pep-0001 PEP Purpose and Guidelines
| PEP: | 1 |
|---|---|
| Title: | PEP Purpose and Guidelines |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw, Jeremy Hylton, David Goodger, Nick Coghlan |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 13-Jun-2000 |
| Post-History: | 21-Mar-2001, 29-Jul-2002, 03-May-2003, 05-May-2012, 07-Apr-2013 |
Contents
What is a PEP?
PEP stands for Python Enhancement Proposal. A PEP is a design document providing information to the Python community, or describing a new feature for Python or its processes or environment. The PEP should provide a concise technical specification of the feature and a rationale for the feature.
We intend PEPs to be the primary mechanisms for proposing major new features, for collecting community input on an issue, and for documenting the design decisions that have gone into Python. The PEP author is responsible for building consensus within the community and documenting dissenting opinions.
Because the PEPs are maintained as text files in a versioned repository, their revision history is the historical record of the feature proposal [1].
PEP Types
There are three kinds of PEP:
- A Standards Track PEP describes a new feature or implementation for Python. It may also describe an interoperability standard that will be supported outside the standard library for current Python versions before a subsequent PEP adds standard library support in a future version.
- An Informational PEP describes a Python design issue, or provides general guidelines or information to the Python community, but does not propose a new feature. Informational PEPs do not necessarily represent a Python community consensus or recommendation, so users and implementers are free to ignore Informational PEPs or follow their advice.
- A Process PEP describes a process surrounding Python, or proposes a change to (or an event in) a process. Process PEPs are like Standards Track PEPs but apply to areas other than the Python language itself. They may propose an implementation, but not to Python's codebase; they often require community consensus; unlike Informational PEPs, they are more than recommendations, and users are typically not free to ignore them. Examples include procedures, guidelines, changes to the decision-making process, and changes to the tools or environment used in Python development. Any meta-PEP is also considered a Process PEP.
PEP Workflow
Python's BDFL
There are several reference in this PEP to the "BDFL". This acronym stands for "Benevolent Dictator for Life" and refers to Guido van Rossum, the original creator of, and the final design authority for, the Python programming language.
PEP Editors
The PEP editors are individuals responsible for managing the administrative and editorial aspects of the PEP workflow (e.g. assigning PEP numbers and changing their status). See PEP Editor Responsibilities & Workflow for details. The current editors are:
- Chris Angelico
- Anthony Baxter
- Georg Brandl
- Brett Cannon
- David Goodger
- Jesse Noller
- Berker Peksag
- Guido van Rossum
- Barry Warsaw
PEP editorship is by invitation of the current editors. The address <peps@python.org> is a mailing list for contacting the PEP editors. All email related to PEP administration (such as requesting a PEP number or providing an updated version of a PEP for posting) should be sent to this address (no cross-posting please).
Submitting a PEP
The PEP process begins with a new idea for Python. It is highly recommended that a single PEP contain a single key proposal or new idea. Small enhancements or patches often don't need a PEP and can be injected into the Python development workflow with a patch submission to the Python issue tracker [6]. The more focused the PEP, the more successful it tends to be. The PEP editors reserve the right to reject PEP proposals if they appear too unfocused or too broad. If in doubt, split your PEP into several well-focused ones.
Each PEP must have a champion -- someone who writes the PEP using the style and format described below, shepherds the discussions in the appropriate forums, and attempts to build community consensus around the idea. The PEP champion (a.k.a. Author) should first attempt to ascertain whether the idea is PEP-able. Posting to the comp.lang.python newsgroup (a.k.a. python-list@python.org mailing list) or the python-ideas mailing list is the best way to go about this.
Vetting an idea publicly before going as far as writing a PEP is meant to save the potential author time. Many ideas have been brought forward for changing Python that have been rejected for various reasons. Asking the Python community first if an idea is original helps prevent too much time being spent on something that is guaranteed to be rejected based on prior discussions (searching the internet does not always do the trick). It also helps to make sure the idea is applicable to the entire community and not just the author. Just because an idea sounds good to the author does not mean it will work for most people in most areas where Python is used.
Once the champion has asked the Python community as to whether an idea has any chance of acceptance, a draft PEP should be presented to python-ideas. This gives the author a chance to flesh out the draft PEP to make properly formatted, of high quality, and to address initial concerns about the proposal.
Following a discussion on python-ideas, the proposal should be sent as a draft PEP to the PEP editors <peps@python.org>. The draft must be written in PEP style as described below, else it will be sent back without further regard until proper formatting rules are followed (although minor errors will be corrected by the editors).
If the PEP editors approve, they will assign the PEP a number, label it as Standards Track, Informational, or Process, give it status "Draft", and create and check-in the initial draft of the PEP. The PEP editors will not unreasonably deny a PEP. Reasons for denying PEP status include duplication of effort, being technically unsound, not providing proper motivation or addressing backwards compatibility, or not in keeping with the Python philosophy. The BDFL can be consulted during the approval phase, and is the final arbiter of the draft's PEP-ability.
Developers with hg push privileges for the PEP repository [10] may claim PEP numbers directly by creating and committing a new PEP. When doing so, the developer must handle the tasks that would normally be taken care of by the PEP editors (see PEP Editor Responsibilities & Workflow). This includes ensuring the initial version meets the expected standards for submitting a PEP. Alternately, even developers may choose to submit PEPs through the PEP editors. When doing so, let the PEP editors know you have hg push privileges and they can guide you through the process of updating the PEP repository directly.
As updates are necessary, the PEP author can check in new versions if they (or a collaborating developer) have hg push privileges, or else they can email new PEP versions to the PEP editors for publication.
After a PEP number has been assigned, a draft PEP may be discussed further on python-ideas (getting a PEP number assigned early can be useful for ease of reference, especially when multiple draft PEPs are being considered at the same time). Eventually, all Standards Track PEPs must be sent to the python-dev list for review as described in the next section.
Standards Track PEPs consist of two parts, a design document and a reference implementation. It is generally recommended that at least a prototype implementation be co-developed with the PEP, as ideas that sound good in principle sometimes turn out to be impractical when subjected to the test of implementation.
PEP authors are responsible for collecting community feedback on a PEP before submitting it for review. However, wherever possible, long open-ended discussions on public mailing lists should be avoided. Strategies to keep the discussions efficient include: setting up a separate SIG mailing list for the topic, having the PEP author accept private comments in the early design phases, setting up a wiki page, etc. PEP authors should use their discretion here.
PEP Review & Resolution
Once the authors have completed a PEP, they may request a review for style and consistency from the PEP editors. However, the content and final acceptance of the PEP must be requested of the BDFL, usually via an email to the python-dev mailing list. PEPs are reviewed by the BDFL and his chosen consultants, who may accept or reject a PEP or send it back to the author(s) for revision. For a PEP that is predetermined to be acceptable (e.g., it is an obvious win as-is and/or its implementation has already been checked in) the BDFL may also initiate a PEP review, first notifying the PEP author(s) and giving them a chance to make revisions.
The final authority for PEP approval is the BDFL. However, whenever a new PEP is put forward, any core developer that believes they are suitably experienced to make the final decision on that PEP may offer to serve as the BDFL's delegate (or "PEP czar") for that PEP. If their self-nomination is accepted by the other core developers and the BDFL, then they will have the authority to approve (or reject) that PEP. This process happens most frequently with PEPs where the BDFL has granted in principle approval for something to be done, but there are details that need to be worked out before the PEP can be accepted.
If the final decision on a PEP is to be made by a delegate rather than directly by the BDFL, this will be recorded by including the "BDFL-Delegate" header in the PEP.
PEP review and resolution may also occur on a list other than python-dev (for example, distutils-sig for packaging related PEPs that don't immediately affect the standard library). In this case, the "Discussions-To" heading in the PEP will identify the appropriate alternative list where discussion, review and pronouncement on the PEP will occur.
For a PEP to be accepted it must meet certain minimum criteria. It must be a clear and complete description of the proposed enhancement. The enhancement must represent a net improvement. The proposed implementation, if applicable, must be solid and must not complicate the interpreter unduly. Finally, a proposed enhancement must be "pythonic" in order to be accepted by the BDFL. (However, "pythonic" is an imprecise term; it may be defined as whatever is acceptable to the BDFL. This logic is intentionally circular.) See PEP 2 [2] for standard library module acceptance criteria.
Once a PEP has been accepted, the reference implementation must be completed. When the reference implementation is complete and incorporated into the main source code repository, the status will be changed to "Final".
A PEP can also be assigned status "Deferred". The PEP author or an editor can assign the PEP this status when no progress is being made on the PEP. Once a PEP is deferred, a PEP editor can re-assign it to draft status.
A PEP can also be "Rejected". Perhaps after all is said and done it was not a good idea. It is still important to have a record of this fact. The "Withdrawn" status is similar - it means that the PEP author themselves has decided that the PEP is actually a bad idea, or has accepted that a competing proposal is a better alternative.
When a PEP is Accepted, Rejected or Withdrawn, the PEP should be updated accordingly. In addition to updating the status field, at the very least the Resolution header should be added with a link to the relevant post in the python-dev mailing list archives.
PEPs can also be superseded by a different PEP, rendering the original obsolete. This is intended for Informational PEPs, where version 2 of an API can replace version 1.
The possible paths of the status of PEPs are as follows:
Some Informational and Process PEPs may also have a status of "Active" if they are never meant to be completed. E.g. PEP 1 (this PEP).
PEP Maintenance
In general, Standards track PEPs are no longer modified after they have reached the Final state. Once a PEP has been completed, the Language and Standard Library References become the formal documentation of the expected behavior.
Informational and Process PEPs may be updated over time to reflect changes to development practices and other details. The precise process followed in these cases will depend on the nature and purpose of the PEP being updated.
What belongs in a successful PEP?
Each PEP should have the following parts:
Preamble -- RFC 822 style headers containing meta-data about the PEP, including the PEP number, a short descriptive title (limited to a maximum of 44 characters), the names, and optionally the contact info for each author, etc.
Abstract -- a short (~200 word) description of the technical issue being addressed.
Copyright/public domain -- Each PEP must either be explicitly labeled as placed in the public domain (see this PEP as an example) or licensed under the Open Publication License [7].
Specification -- The technical specification should describe the syntax and semantics of any new language feature. The specification should be detailed enough to allow competing, interoperable implementations for at least the current major Python platforms (CPython, Jython, IronPython, PyPy).
Motivation -- The motivation is critical for PEPs that want to change the Python language. It should clearly explain why the existing language specification is inadequate to address the problem that the PEP solves. PEP submissions without sufficient motivation may be rejected outright.
Rationale -- The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages.
The rationale should provide evidence of consensus within the community and discuss important objections or concerns raised during discussion.
Backwards Compatibility -- All PEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The PEP must explain how the author proposes to deal with these incompatibilities. PEP submissions without a sufficient backwards compatibility treatise may be rejected outright.
Reference Implementation -- The reference implementation must be completed before any PEP is given status "Final", but it need not be completed before the PEP is accepted. While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of API details.
The final implementation must include test code and documentation appropriate for either the Python language reference or the standard library reference.
PEP Formats and Templates
There are two PEP formats available to authors: plaintext and reStructuredText [8]. Both are UTF-8-encoded text files.
Plaintext PEPs are written with minimal structural markup that adheres to a rigid style. PEP 9 contains a instructions and a template [3] you can use to get started writing your plaintext PEP.
ReStructuredText [8] PEPs allow for rich markup that is still quite easy to read, but results in much better-looking and more functional HTML. PEP 12 contains instructions and a template [4] for reStructuredText PEPs.
There is a Python script that converts both styles of PEPs to HTML for viewing on the web [5]. Parsing and conversion of plaintext PEPs is self-contained within the script. reStructuredText PEPs are parsed and converted by Docutils [9] code called from the script.
PEP Header Preamble
Each PEP must begin with an RFC 822 style header preamble. The headers must appear in the following order. Headers marked with "*" are optional and are described below. All other headers are required.
PEP: <pep number>
Title: <pep title>
Version: <version string>
Last-Modified: <date string>
Author: <list of authors' real names and optionally, email addrs>
* BDFL-Delegate: <PEP czar's real name>
* Discussions-To: <email address>
Status: <Draft | Active | Accepted | Deferred | Rejected |
Withdrawn | Final | Superseded>
Type: <Standards Track | Informational | Process>
* Content-Type: <text/plain | text/x-rst>
* Requires: <pep numbers>
Created: <date created on, in dd-mmm-yyyy format>
* Python-Version: <version number>
Post-History: <dates of postings to python-list and python-dev>
* Replaces: <pep number>
* Superseded-By: <pep number>
* Resolution: <url>
The Author header lists the names, and optionally the email addresses of all the authors/owners of the PEP. The format of the Author header value must be
Random J. User <address@dom.ain>
if the email address is included, and just
Random J. User
if the address is not given. For historical reasons the format "address@dom.ain (Random J. User)" may appear in a PEP, however new PEPs must use the mandated format above, and it is acceptable to change to this format when PEPs are updated.
If there are multiple authors, each should be on a separate line following RFC 2822 continuation line conventions. Note that personal email addresses in PEPs will be obscured as a defense against spam harvesters.
The BDFL-Delegate field is used to record cases where the final decision to approve or reject a PEP rests with someone other than the BDFL. (The delegate's email address is currently omitted due to a limitation in the email address masking for reStructuredText PEPs)
Note: The Resolution header is required for Standards Track PEPs only. It contains a URL that should point to an email message or other web resource where the pronouncement about the PEP is made.
For a PEP where final pronouncement will be made on a list other than python-dev, a Discussions-To header will indicate the mailing list or URL where the pronouncement will occur. A temporary Discussions-To header may also be used when a draft PEP is being discussed prior to submission for pronouncement. No Discussions-To header is necessary if the PEP is being discussed privately with the author, or on the python-list, python-ideas or python-dev mailing lists. Note that email addresses in the Discussions-To header will not be obscured.
The Type header specifies the type of PEP: Standards Track, Informational, or Process.
The format of a PEP is specified with a Content-Type header. The acceptable values are "text/plain" for plaintext PEPs (see PEP 9 [3]) and "text/x-rst" for reStructuredText PEPs (see PEP 12 [4]). Plaintext ("text/plain") is the default if no Content-Type header is present.
The Created header records the date that the PEP was assigned a number, while Post-History is used to record the dates of when new versions of the PEP are posted to python-list and/or python-dev. Both headers should be in dd-mmm-yyyy format, e.g. 14-Aug-2001.
Standards Track PEPs will typically have a Python-Version header which indicates the version of Python that the feature will be released with. Standards Track PEPs without a Python-Version header indicate interoperability standards that will initially be supported through external libraries and tools, and then supplemented by a later PEP to add support to the standard library. Informational and Process PEPs do not need a Python-Version header.
PEPs may have a Requires header, indicating the PEP numbers that this PEP depends on.
PEPs may also have a Superseded-By header indicating that a PEP has been rendered obsolete by a later document; the value is the number of the PEP that replaces the current document. The newer PEP must have a Replaces header containing the number of the PEP that it rendered obsolete.
Auxiliary Files
PEPs may include auxiliary files such as diagrams. Such files must be named pep-XXXX-Y.ext, where "XXXX" is the PEP number, "Y" is a serial number (starting at 1), and "ext" is replaced by the actual file extension (e.g. "png").
Reporting PEP Bugs, or Submitting PEP Updates
How you report a bug, or submit a PEP update depends on several factors, such as the maturity of the PEP, the preferences of the PEP author, and the nature of your comments. For the early draft stages of the PEP, it's probably best to send your comments and changes directly to the PEP author. For more mature, or finished PEPs you may want to submit corrections to the Python issue tracker [6] so that your changes don't get lost. If the PEP author is a Python developer, assign the bug/patch to them, otherwise assign it to a PEP editor.
When in doubt about where to send your changes, please check first with the PEP author and/or a PEP editor.
PEP authors with hg push privileges for the PEP repository can update the PEPs themselves by using "hg push" to submit their changes.
Transferring PEP Ownership
It occasionally becomes necessary to transfer ownership of PEPs to a new champion. In general, it is preferable to retain the original author as a co-author of the transferred PEP, but that's really up to the original author. A good reason to transfer ownership is because the original author no longer has the time or interest in updating it or following through with the PEP process, or has fallen off the face of the 'net (i.e. is unreachable or not responding to email). A bad reason to transfer ownership is because the author doesn't agree with the direction of the PEP. One aim of the PEP process is to try to build consensus around a PEP, but if that's not possible, an author can always submit a competing PEP.
If you are interested in assuming ownership of a PEP, send a message asking to take over, addressed to both the original author and the PEP editors <peps@python.org>. If the original author doesn't respond to email in a timely manner, the PEP editors will make a unilateral decision (it's not like such decisions can't be reversed :).
PEP Editor Responsibilities & Workflow
A PEP editor must subscribe to the <peps@python.org> list. All correspondence related to PEP administration should be sent (or forwarded) to <peps@python.org> (but please do not cross-post!).
For each new PEP that comes in an editor does the following:
- Read the PEP to check if it is ready: sound and complete. The ideas must make technical sense, even if they don't seem likely to be accepted.
- The title should accurately describe the content.
- Edit the PEP for language (spelling, grammar, sentence structure, etc.), markup (for reST PEPs), code style (examples should match PEP 8 & 7).
If the PEP isn't ready, an editor will send it back to the author for revision, with specific instructions.
Once the PEP is ready for the repository, a PEP editor will:
Assign a PEP number (almost always just the next available number, but sometimes it's a special/joke number, like 666 or 3141). (Clarification: For Python 3, numbers in the 3000s were used for Py3k-specific proposals. But now that all new features go into Python 3 only, the process is back to using numbers in the 100s again. Remember that numbers below 100 are meta-PEPs.)
Add the PEP to a local clone of the PEP repository. For mercurial workflow instructions, follow The Python Developers Guide
The mercurial repo for the peps is:
http://hg.python.org/peps/
Run ./genpepindex.py and ./pep2html.py <PEP Number> to ensure they are generated without errors. If either triggers errors, then the web site will not be updated to reflect the PEP changes.
Commit and push the new (or updated) PEP
Monitor python.org to make sure the PEP gets added to the site properly. If it fails to appear, running make will build all of the current PEPs. If any of these are triggering errors, they must be corrected before any PEP will update on the site.
Send email back to the PEP author with next steps (post to python-list & -dev).
Updates to existing PEPs also come in to peps@python.org. Many PEP authors are not Python committers yet, so PEP editors do the commits for them.
Many PEPs are written and maintained by developers with write access to the Python codebase. The PEP editors monitor the python-checkins list for PEP changes, and correct any structure, grammar, spelling, or markup mistakes they see.
PEP editors don't pass judgment on PEPs. They merely do the administrative & editorial part (which is generally a low volume task).
Resources:
References and Footnotes
| [1] | This historical record is available by the normal hg commands for retrieving older revisions, and can also be browsed via HTTP here: http://hg.python.org/peps/ |
| [2] | PEP 2, Procedure for Adding New Modules, Faassen (http://www.python.org/dev/peps/pep-0002) |
| [3] | (1, 2) PEP 9, Sample Plaintext PEP Template, Warsaw (http://www.python.org/dev/peps/pep-0009) |
| [4] | (1, 2) PEP 12, Sample reStructuredText PEP Template, Goodger, Warsaw (http://www.python.org/dev/peps/pep-0012) |
| [5] | The script referred to here is pep2pyramid.py, the successor to pep2html.py, both of which live in the same directory in the hg repo as the PEPs themselves. Try pep2html.py --help for details. The URL for viewing PEPs on the web is http://www.python.org/dev/peps/. |
| [6] | (1, 2) http://bugs.python.org/ |
| [7] | http://www.opencontent.org/openpub/ |
| [8] | (1, 2) http://docutils.sourceforge.net/rst.html |
| [9] | http://docutils.sourceforge.net/ |
| [10] | http://hg.python.org/peps |
Copyright
This document has been placed in the public domain.
pep-0002 Procedure for Adding New Modules
| PEP: | 2 |
|---|---|
| Title: | Procedure for Adding New Modules |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martijn Faassen <faassen at infrae.com> |
| Status: | Final |
| Type: | Process |
| Created: | 07-Jul-2001 |
| Post-History: | 07-Jul-2001, 09-Mar-2002 |
PEP Replacement
This PEP has been superseded by the updated material in the Python
Developer's Guide [1].
Introduction
The Python Standard Library contributes significantly to Python's
success. The language comes with "batteries included", so it is
easy for people to become productive with just the standard
library alone. It is therefore important that this library grows
with the language, and that such growth is supported and
encouraged.
Many contributions to the library are not created by core
developers but by people from the Python community who are experts
in their particular field. Furthermore, community members are
also the users of the standard library, applying it in a great
diversity of settings. This makes the community well equipped to
detect and report gaps in the library; things that are missing but
should be added.
New functionality is commonly added to the library in the form of
new modules. This PEP will describe the procedure for the
_addition_ of new modules. PEP 4 deals with procedures for
deprecation of modules; the _removal_ of old and unused modules
from the standard library. Finally there is also the issue of
_changing_ existing modules to make the picture of library
evolution complete. PEP 3 and PEP 5 give some guidelines on this.
The continued maintenance of existing modules is an integral part
of the decision on whether to add a new module to the standard
library. Therefore, this PEP also introduces concepts
(integrators, maintainers) relevant to the maintenance issue.
Integrators
The integrators are a group of people with the following
responsibilities:
- They determine if a proposed contribution should become part of
the standard library.
- They integrate accepted contributions into the standard library.
- They produce standard library releases.
This group of people shall be PythonLabs, led by Guido.
Maintainer(s)
All contributions to the standard library need one or more
maintainers. This can be an individual, but it is frequently a
group of people such as the XML-SIG. Groups may subdivide
maintenance tasks among themselves. One ore more maintainers
shall be the _head maintainer_ (usually this is also the main
developer). Head maintainers are convenient people the
integrators can address if they want to resolve specific issues,
such as the ones detailed later in this document.
Developers(s)
Contributions to the standard library have been developed by one
or more developers. The initial maintainers are the original
developers unless there are special circumstances (which should be
detailed in the PEP proposing the contribution).
Acceptance Procedure
When developers wish to have a contribution accepted into the
standard library, they will first form a group of maintainers
(normally initially consisting of themselves).
Then, this group shall produce a PEP called a library PEP. A
library PEP is a special form of standards track PEP. The library
PEP gives an overview of the proposed contribution, along with the
proposed contribution as the reference implementation. This PEP
should also contain a motivation on why this contribution should
be part of the standard library.
One or more maintainers shall step forward as PEP champion (the
people listed in the Author field are the champions). The PEP
champion(s) shall be the initial head maintainer(s).
As described in PEP 1, a standards track PEP should consist of a
design document and a reference implementation. The library PEP
differs from a normal standard track PEP in that the reference
implementation should in this case always already have been
written before the PEP is to be reviewed for inclusion by the
integrators and to be commented upon by the community; the
reference implementation _is_ the proposed contribution.
This different requirement exists for the following reasons:
- The integrators can only properly evaluate a contribution to the
standard library when there is source code and documentation to
look at; i.e. the reference implementation is always necessary
to aid people in studying the PEP.
- Even rejected contributions will be useful outside the standard
library, so there will a lower risk of waste of effort by the
developers.
- It will impress the integrators of the seriousness of
contribution and will help guard them against having to evaluate
too many frivolous proposals.
Once the library PEP has been submitted for review, the
integrators will then evaluate it. The PEP will follow the normal
PEP work flow as described in PEP 1. If the PEP is accepted, they
will work through the head maintainers to make the contribution
ready for integration.
Maintenance Procedure
After a contribution has been accepted, the job is not over for
both integrators and maintainers. The integrators will forward
any bug reports in the standard library to the appropriate head
maintainers.
Before the feature freeze preparing for a release of the standard
library, the integrators will check with the head maintainers for
all contributions, to see if there are any updates to be included
in the next release. The integrators will evaluate any such
updates for issues like backwards compatibility and may require
PEPs if the changes are deemed to be large.
The head maintainers should take an active role in keeping up to
date with the Python development process. If a head maintainer is
unable to function in this way, he or she should announce the
intention to step down to the integrators and the rest of the
maintainers, so that a replacement can step forward. The
integrators should at all times be capable of reaching the head
maintainers by email.
In the case where no head maintainer can be found (possibly
because there are no maintainers left), the integrators will issue
a call to the community at large asking for new maintainers to
step forward. If no one does, the integrators can decide to
declare the contribution deprecated as described in PEP 4.
Open issues
There needs to be some procedure so that the integrators can
always reach the maintainers (or at least the head maintainers).
This could be accomplished by a mailing list to which all head
maintainers should be subscribed (this could be python-dev).
Another possibility, which may be useful in any case, is the
maintenance of a list similar to that of the list of PEPs which
lists all the contributions and their head maintainers with
contact info. This could in fact be part of the list of the PEPs,
as a new contribution requires a PEP. But since the
authors/owners of a PEP introducing a new module may eventually be
different from those who maintain it, this wouldn't resolve all
issues yet.
Should there be a list of what criteria integrators use for
evaluating contributions? (Source code but also things like
documentation and a test suite, as well as such vague things like
'dependability of the maintainers'.)
This relates to all the technical issues; check-in privileges,
coding style requirements, documentation requirements, test suite
requirements. These are preferably part of another PEP.
Should the current standard library be subdivided among
maintainers? Many parts already have (informal) maintainers; it
may be good to make this more explicit.
Perhaps there is a better word for 'contribution'; the word
'contribution' may not imply enough that the process (of
development and maintenance) does not stop after the contribution
is accepted and integrated into the library.
Relationship to the mythical Catalog?
References
[1] Adding to the Stdlib
http://docs.python.org/devguide/stdlibchanges.html
Copyright
This document has been placed in the public domain.
pep-0003 Guidelines for Handling Bug Reports
| PEP: | 3 |
|---|---|
| Title: | Guidelines for Handling Bug Reports |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Withdrawn |
| Type: | Process |
| Created: | 25-Sep-2000 |
| Post-History: |
Introduction
This PEP contained guidelines for handling bug reports in
the Python bug tracker. It has been replaced by the Developer's
Guide description of issue triaging at
https://docs.python.org/devguide/triaging.html
Guidelines for people submitting Python bugs are at
http://docs.python.org/bugs.html
Original Guidelines
1. Make sure the bug category and bug group are correct. If they
are correct, it is easier for someone interested in helping to
find out, say, what all the open Tkinter bugs are.
2. If it's a minor feature request that you don't plan to address
right away, add it to PEP 42 or ask the owner to add it for
you. If you add the bug to PEP 42, mark the bug as "feature
request", "later", and "closed"; and add a comment to the bug
saying that this is the case (mentioning the PEP explicitly).
XXX do we prefer the tracker or PEP 42?
3. Assign the bug a reasonable priority. We don't yet have a
clear sense of what each priority should mean. One rule,
however, is that bugs with priority "urgent" or higher must
be fixed before the next release.
4. If a bug report doesn't have enough information to allow you to
reproduce or diagnose it, ask the original submitter for more
information. If the original report is really thin and your
email doesn't get a response after a reasonable waiting period,
you can close the bug.
5. If you fix a bug, mark the status as "Fixed" and close it. In
the comments, include the SVN revision numbers of the commit(s).
In the SVN checkin message, include the issue number *and* a
normal description of the change, mentioning the contributor
if a patch was applied.
6. If you are assigned a bug that you are unable to deal with,
assign it to someone else if you think they will be able to
deal with it, otherwise it's probably best to unassign it.
References
[1] http://bugs.python.org/
pep-0004 Deprecation of Standard Modules
| PEP: | 4 |
|---|---|
| Title: | Deprecation of Standard Modules |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 1-Oct-2000 |
| Post-History: |
Contents
Introduction
When new modules were added to the standard Python library in the past, it was not possible to foresee whether they would still be useful in the future. Even though Python "Comes With Batteries Included", batteries may discharge over time. Carrying old modules around is a burden on the maintainer, especially when there is no interest in the module anymore.
At the same time, removing a module from the distribution is difficult, as it is not known in general whether anybody is still using it. This PEP defines a procedure for removing modules from the standard Python library. Usage of a module may be 'deprecated', which means that it may be removed from a future Python release. The rationale for deprecating a module is also collected in this PEP. If the rationale turns out faulty, the module may become 'undeprecated'.
Procedure for declaring a module deprecated
Since the status of module deprecation is recorded in this PEP, proposals for deprecating modules MUST be made by providing a change to the text of this PEP, which SHOULD be a patch posted to bugs.python.org.
A proposal for deprecation of the module MUST include the date of the proposed deprecation and a rationale for deprecating it. In addition, the proposal MUST include a change to the documentation of the module; deprecation is indicated by saying that the module is "obsolete" or "deprecated". The proposal SHOULD include a patch for the module's source code to indicate deprecation there as well, by raising a DeprecationWarning. The proposal MUST include patches to remove any use of the deprecated module from the standard library.
It is expected that deprecated modules are included in the Python releases that immediately follows the deprecation; later releases may ship without the deprecated modules.
Procedure for declaring a module undeprecated
When a module becomes deprecated, a rationale is given for its deprecation. In some cases, an alternative interface for the same functionality is provided, so the old interface is deprecated. In other cases, the need for having the functionality of the module may not exist anymore.
If the rationale is faulty, again a change to this PEP's text MUST be submitted. This change MUST include the date of undeprecation and a rationale for undeprecation. Modules that are undeprecated under this procedure MUST be listed in this PEP for at least one major release of Python.
Obsolete modules
A number of modules are already listed as obsolete in the library documentation. These are listed here for completeness.
cl, sv, timing
All these modules have been declared as obsolete in Python 2.0, some even earlier.
The following obsolete modules were removed in Python 2.5:
addpack, cmp, cmpcache, codehack, dircmp, dump, find, fmt, grep, lockfile, newdir, ni, packmail, Para, poly, rand, reconvert, regex, regsub, statcache, tb, tzparse, util, whatsound, whrandom, zmod
The following modules were removed in Python 2.6:
gopherlib, rgbimg, macfs
The following modules currently lack a DeprecationWarning:
rfc822, mimetools, multifile
Deprecated modules
Module name: posixfile
Rationale: Locking is better done by fcntl.lockf().
Date: Before 1-Oct-2000.
Documentation: Already documented as obsolete. Deprecation
warning added in Python 2.6.
Module name: gopherlib
Rationale: The gopher protocol is not in active use anymore.
Date: 1-Oct-2000.
Documentation: Documented as deprecated since Python 2.5. Removed
in Python 2.6.
Module name: rgbimgmodule
Rationale: In a 2001-04-24 c.l.py post, Jason Petrone mentions
that he occasionally uses it; no other references to
its use can be found as of 2003-11-19.
Date: 1-Oct-2000
Documentation: Documented as deprecated since Python 2.5. Removed
in Python 2.6.
Module name: pre
Rationale: The underlying PCRE engine doesn't support Unicode, and
has been unmaintained since Python 1.5.2.
Date: 10-Apr-2002
Documentation: It was only mentioned as an implementation detail,
and never had a section of its own. This mention
has now been removed.
Module name: whrandom
Rationale: The module's default seed computation was
inherently insecure; the random module should be
used instead.
Date: 11-Apr-2002
Documentation: This module has been documented as obsolete since
Python 2.1, but listing in this PEP was neglected.
The deprecation warning will be added to the module
one year after Python 2.3 is released, and the
module will be removed one year after that.
Module name: rfc822
Rationale: Supplanted by Python 2.2's email package.
Date: 18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
Python 2.2.2.
Module name: mimetools
Rationale: Supplanted by Python 2.2's email package.
Date: 18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
Python 2.2.2.
Module name: MimeWriter
Rationale: Supplanted by Python 2.2's email package.
Date: 18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
Python 2.2.2. Raises a DeprecationWarning as of
Python 2.6.
Module name: mimify
Rationale: Supplanted by Python 2.2's email package.
Date: 18-Mar-2002
Documentation: Documented as "deprecated since release 2.3" since
Python 2.2.2. Raises a DeprecationWarning as of
Python 2.6.
Module name: rotor
Rationale: Uses insecure algorithm.
Date: 24-Apr-2003
Documentation: The documentation has been removed from the library
reference in Python 2.4.
Module name: TERMIOS.py
Rationale: The constants in this file are now in the 'termios' module.
Date: 10-Aug-2004
Documentation: This module has been documented as obsolete since
Python 2.1, but listing in this PEP was neglected.
Removed from the library reference in Python 2.4.
Module name: statcache
Rationale: Using the cache can be fragile and error-prone;
applications should just use os.stat() directly.
Date: 10-Aug-2004
Documentation: This module has been documented as obsolete since
Python 2.2, but listing in this PEP was neglected.
Removed from the library reference in Python 2.5.
Module name: mpz
Rationale: Third-party packages provide similiar features
and wrap more of GMP's API.
Date: 10-Aug-2004
Documentation: This module has been documented as obsolete since
Python 2.2, but listing in this PEP was neglected.
Removed from the library reference in Python 2.4.
Module name: xreadlines
Rationale: Using 'for line in file', introduced in 2.3, is preferable.
Date: 10-Aug-2004
Documentation: This module has been documented as obsolete since
Python 2.3, but listing in this PEP was neglected.
Removed from the library reference in Python 2.4.
Module name: multifile
Rationale: Supplanted by the email package.
Date: 21-Feb-2006
Documentation: Documented as deprecated as of Python 2.5.
Module name: sets
Rationale: The built-in set/frozenset types, introduced in
Python 2.4, supplant the module.
Date: 12-Jan-2007
Documentation: Documented as deprecated as of Python 2.6.
Module name: buildtools
Rationale: Unknown.
Date: 15-May-2007
Documentation: Documented as deprecated as of Python 2.3, but
listing in this PEP was neglected. Raised a
DeprecationWarning as of Python 2.6.
Module name: cfmfile
Rationale: Unknown.
Date: 15-May-2007
Documentation: Documented as deprecated as of Python 2.4, but
listing in this PEP was neglected. A
DeprecationWarning was added in Python 2.6.
Module name: macfs
Rationale: Unknown.
Date: 15-May-2007
Documentation: Documented as deprecated as of Python 2.3, but
listing in this PEP was neglected. Removed in
Python 2.6.
Module name: md5
Rationale: Replaced by the 'hashlib' module.
Date: 15-May-2007
Documentation: Documented as deprecated as of Python 2.5, but
listing in this PEP was neglected.
DeprecationWarning raised as of Python 2.6.
Module name: sha
Rationale: Replaced by the 'hashlib' module.
Date: 15-May-2007
Documentation: Documented as deprecated as of Python 2.5, but
listing in this PEP was neglected.
DeprecationWarning added in Python 2.6.
Module name: plat-freebsd2/IN and plat-freebsd3/IN
Rationale: Platforms are obsolete (last released in 2000)
Removed from 2.6
Date: 15-May-2007
Documentation: None
Module name: plat-freebsd4/IN and possibly plat-freebsd5/IN
Rationale: Platforms are obsolete/unsupported
Date: 15-May-2007
Remove from 2.7
Documentation: None
Module name: formatter
Rationale: Lack of use in the community, no tests to keep
code working.
Documentation: Deprecated as of Python 3.4 by raising
PendingDeprecationWarning. Slated for removal in
Python 3.6.
Deprecation of modules removed in Python 3.0
PEP 3108 lists all modules that have been removed from Python 3.0. They all are documented as deprecated in Python 2.6, and raise a DeprecationWarning if the -3 flag is activated.
Undeprecated modules
None.
Copyright
This document has been placed in the public domain.
pep-0005 Guidelines for Language Evolution
| PEP: | 5 |
|---|---|
| Title: | Guidelines for Language Evolution |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Prescod <paul at prescod.net> |
| Status: | Active |
| Type: | Process |
| Created: | 26-Oct-2000 |
| Post-History: |
Abstract
In the natural evolution of programming languages it is sometimes
necessary to make changes that modify the behavior of older
programs. This PEP proposes a policy for implementing these
changes in a manner respectful of the installed base of Python
users.
Implementation Details
Implementation of this PEP requires the addition of a formal
warning and deprecation facility that will be described in another
proposal.
Scope
These guidelines apply to future versions of Python that introduce
backward-incompatible behavior. Backward incompatible behavior is
a major deviation in Python interpretation from an earlier
behavior described in the standard Python documentation. Removal
of a feature also constitutes a change of behavior.
This PEP does not replace or preclude other compatibility
strategies such as dynamic loading of backwards-compatible
parsers. On the other hand, if execution of "old code" requires a
special switch or pragma then that is indeed a change of behavior
from the point of view of the user and that change should be
implemented according to these guidelines.
In general, common sense must prevail in the implementation of
these guidelines. For instance changing "sys.copyright" does not
constitute a backwards-incompatible change of behavior!
Steps For Introducing Backwards-Incompatible Features
1. Propose backwards-incompatible behavior in a PEP. The PEP must
include a section on backwards compatibility that describes in
detail a plan to complete the remainder of these steps.
2. Once the PEP is accepted as a productive direction, implement
an alternate way to accomplish the task previously provided by
the feature that is being removed or changed. For instance if
the addition operator were scheduled for removal, a new version
of Python could implement an "add()" built-in function.
3. Formally deprecate the obsolete construct in the Python
documentation.
4. Add an optional warning mode to the parser that will inform
users when the deprecated construct is used. In other words,
all programs that will behave differently in the future must
trigger warnings in this mode. Compile-time warnings are
preferable to runtime warnings. The warning messages should
steer people from the deprecated construct to the alternative
construct.
5. There must be at least a one-year transition period between the
release of the transitional version of Python and the release
of the backwards incompatible version. Users will have at
least a year to test their programs and migrate them from use
of the deprecated construct to the alternative one.
pep-0006 Bug Fix Releases
| PEP: | 6 |
|---|---|
| Title: | Bug Fix Releases |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Aahz <aahz at pythoncraft.com>, Anthony Baxter <anthony at interlink.com.au> |
| Status: | Active |
| Type: | Process |
| Created: | 15-Mar-2001 |
| Post-History: | 15-Mar-2001 18-Apr-2001 19-Aug-2004 |
Abstract
Python has historically had only a single fork of development,
with releases having the combined purpose of adding new features
and delivering bug fixes (these kinds of releases will be referred
to as "major releases"). This PEP describes how to fork off
maintenance, or bug fix, releases of old versions for the primary
purpose of fixing bugs.
This PEP is not, repeat NOT, a guarantee of the existence of bug fix
releases; it only specifies a procedure to be followed if bug fix
releases are desired by enough of the Python community willing to
do the work.
Motivation
With the move to SourceForge, Python development has accelerated.
There is a sentiment among part of the community that there was
too much acceleration, and many people are uncomfortable with
upgrading to new versions to get bug fixes when so many features
have been added, sometimes late in the development cycle.
One solution for this issue is to maintain the previous major
release, providing bug fixes until the next major release. This
should make Python more attractive for enterprise development,
where Python may need to be installed on hundreds or thousands of
machines.
Prohibitions
Bug fix releases are required to adhere to the following restrictions:
1. There must be zero syntax changes. All .pyc and .pyo files
must work (no regeneration needed) with all bugfix releases
forked off from a major release.
2. There must be zero pickle changes.
3. There must be no incompatible C API changes. All extensions
must continue to work without recompiling in all bugfix releases
in the same fork as a major release.
Breaking any of these prohibitions requires a BDFL proclamation
(and a prominent warning in the release notes).
Not-Quite-Prohibitions
Where possible, bug fix releases should also:
1. Have no new features. The purpose of a bug fix release is to
fix bugs, not add the latest and greatest whizzo feature from
the HEAD of the CVS root.
2. Be a painless upgrade. Users should feel confident that an
upgrade from 2.x.y to 2.x.(y+1) will not break their running
systems. This means that, unless it is necessary to fix a bug,
the standard library should not change behavior, or worse yet,
APIs.
Applicability of Prohibitions
The above prohibitions and not-quite-prohibitions apply both
for a final release to a bugfix release (for instance, 2.4 to
2.4.1) and for one bugfix release to the next in a series
(for instance 2.4.1 to 2.4.2).
Following the prohibitions listed in this PEP should help keep
the community happy that a bug fix release is a painless and safe
upgrade.
Helping the Bug Fix Releases Happen
Here's a few pointers on helping the bug fix release process along.
1. Backport bug fixes. If you fix a bug, and it seems appropriate,
port it to the CVS branch for the current bug fix release. If
you're unwilling or unable to backport it yourself, make a note
in the commit message, with words like 'Bugfix candidate' or
'Backport candidate'.
2. If you're not sure, ask. Ask the person managing the current bug
fix releases if they think a particular fix is appropriate.
3. If there's a particular bug you'd particularly like fixed in a
bug fix release, jump up and down and try to get it done. Do not
wait until 48 hours before a bug fix release is due, and then
start asking for bug fixes to be included.
Version Numbers
Starting with Python 2.0, all major releases are required to have
a version number of the form X.Y; bugfix releases will always be of
the form X.Y.Z.
The current major release under development is referred to as
release N; the just-released major version is referred to as N-1.
In CVS, the bug fix releases happen on a branch. For release 2.x,
the branch is named 'release2x-maint'. For example, the branch for
the 2.3 maintenance releases is release23-maint
Procedure
The process for managing bugfix releases is modeled in part on the
Tcl system [1].
The Patch Czar is the counterpart to the BDFL for bugfix releases.
However, the BDFL and designated appointees retain veto power over
individual patches. A Patch Czar might only be looking after a single
branch of development - it's quite possible that a different person
might be maintaining the 2.3.x and the 2.4.x releases.
As individual patches get contributed to the current trunk of CVS,
each patch committer is requested to consider whether the patch is
a bug fix suitable for inclusion in a bugfix release. If the patch is
considered suitable, the committer can either commit the release to
the maintenance branch, or else mark the patch in the commit message.
In addition, anyone from the Python community is free to suggest
patches for inclusion. Patches may be submitted specifically for
bugfix releases; they should follow the guidelines in PEP 3 [2].
In general, though, it's probably better that a bug in a specific
release also be fixed on the HEAD as well as the branch.
The Patch Czar decides when there are a sufficient number of patches
to warrant a release. The release gets packaged up, including a
Windows installer, and made public. If any new bugs are found, they
must be fixed immediately and a new bugfix release publicized (with
an incremented version number). For the 2.3.x cycle, the Patch Czar
(Anthony) has been trying for a release approximately every six
months, but this should not be considered binding in any way on
any future releases.
Bug fix releases are expected to occur at an interval of roughly
six months. This is only a guideline, however - obviously, if a
major bug is found, a bugfix release may be appropriate sooner. In
general, only the N-1 release will be under active maintenance at
any time. That is, during Python 2.4's development, Python 2.3 gets
bugfix releases. If, however, someone qualified wishes to continue
the work to maintain an older release, they should be encouraged.
Patch Czar History
Anthony Baxter is the Patch Czar for 2.3.1 through 2.3.4.
Barry Warsaw is the Patch Czar for 2.2.3.
Guido van Rossum is the Patch Czar for 2.2.2.
Michael Hudson is the Patch Czar for 2.2.1.
Anthony Baxter is the Patch Czar for 2.1.2 and 2.1.3.
Thomas Wouters is the Patch Czar for 2.1.1.
Moshe Zadka is the Patch Czar for 2.0.1.
History
This PEP started life as a proposal on comp.lang.python. The
original version suggested a single patch for the N-1 release to
be released concurrently with the N release. The original version
also argued for sticking with a strict bug fix policy.
Following feedback from the BDFL and others, the draft PEP was
written containing an expanded bugfix release cycle that permitted
any previous major release to obtain patches and also relaxed
the strict bug fix requirement (mainly due to the example of PEP
235 [3], which could be argued as either a bug fix or a feature).
Discussion then mostly moved to python-dev, where BDFL finally
issued a proclamation basing the Python bugfix release process on
Tcl's, which essentially returned to the original proposal in
terms of being only the N-1 release and only bug fixes, but
allowing multiple bugfix releases until release N is published.
Anthony Baxter then took this PEP and revised it, based on
lessons from the 2.3 release cycle.
References
[1] http://www.tcl.tk/cgi-bin/tct/tip/28.html
[2] PEP 3, Guidelines for Handling Bug Reports, Hylton
http://www.python.org/dev/peps/pep-0003/
[3] PEP 235, Import on Case-Insensitive Platforms, Peters
http://www.python.org/dev/peps/pep-0235/
Copyright
This document has been placed in the public domain.
pep-0007 Style Guide for C Code
| PEP: | 7 |
|---|---|
| Title: | Style Guide for C Code |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 05-Jul-2001 |
| Post-History: |
Contents
Introduction
This document gives coding conventions for the C code comprising the C implementation of Python. Please see the companion informational PEP describing style guidelines for Python code [1].
Note, rules are there to be broken. Two good reasons to break a particular rule:
- When applying the rule would make the code less readable, even for someone who is used to reading code that follows the rules.
- To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean up someone else's mess (in true XP style).
C dialect
- Use ANSI/ISO standard C (the 1989 version of the standard). This means (amongst many other things) that all declarations must be at the top of a block (not necessarily at the top of function).
- Don't use GCC extensions (e.g. don't write multi-line strings without trailing backslashes).
- All function declarations and definitions must use full prototypes (i.e. specify the types of all arguments).
- Never use C++ style // one-line comments.
- No compiler warnings with major compilers (gcc, VC++, a few others).
Code lay-out
Use 4-space indents and no tabs at all.
No line should be longer than 79 characters. If this and the previous rule together don't give you enough room to code, your code is too complicated -- consider using subroutines.
No line should end in whitespace. If you think you need significant trailing whitespace, think again -- somebody's editor might delete it as a matter of routine.
Function definition style: function name in column 1, outermost curly braces in column 1, blank line after local variable declarations.
static int extra_ivars(PyTypeObject *type, PyTypeObject *base) { int t_size = PyType_BASICSIZE(type); int b_size = PyType_BASICSIZE(base); assert(t_size >= b_size); /* type smaller than base! */ ... return 1; }Code structure: one space between keywords like if, for and the following left paren; no spaces inside the paren; braces may be omitted where C permits but when present, they should be formatted as shown:
if (mro != NULL) { ... } else { ... }The return statement should not get redundant parentheses:
return Py_None; /* correct */ return(Py_None); /* incorrect */
Function and macro call style: foo(a, b, c) -- no space before the open paren, no spaces inside the parens, no spaces before commas, one space after each comma.
Always put spaces around assignment, Boolean and comparison operators. In expressions using a lot of operators, add spaces around the outermost (lowest-priority) operators.
Breaking long lines: if you can, break after commas in the outermost argument list. Always indent continuation lines appropriately, e.g.:
PyErr_Format(PyExc_TypeError, "cannot create '%.100s' instances", type->tp_name);When you break a long expression at a binary operator, the operator goes at the end of the previous line, e.g.:
if (type->tp_dictoffset != 0 && base->tp_dictoffset == 0 && type->tp_dictoffset == b_size && (size_t)t_size == b_size + sizeof(PyObject *)) return 0; /* "Forgive" adding a __dict__ only */Put blank lines around functions, structure definitions, and major sections inside functions.
Comments go before the code they describe.
All functions and global variables should be declared static unless they are to be part of a published interface
For external functions and variables, we always have a declaration in an appropriate header file in the "Include" directory, which uses the PyAPI_FUNC() macro, like this:
PyAPI_FUNC(PyObject *) PyObject_Repr(PyObject *);
Naming conventions
- Use a Py prefix for public functions; never for static functions. The Py_ prefix is reserved for global service routines like Py_FatalError; specific groups of routines (e.g. specific object type APIs) use a longer prefix, e.g. PyString_ for string functions.
- Public functions and variables use MixedCase with underscores, like this: PyObject_GetAttr, Py_BuildValue, PyExc_TypeError.
- Occasionally an "internal" function has to be visible to the loader; we use the _Py prefix for this, e.g.: _PyObject_Dump.
- Macros should have a MixedCase prefix and then use upper case, for example: PyString_AS_STRING, Py_PRINT_RAW.
Documentation Strings
Use the PyDoc_STR() or PyDoc_STRVAR() macro for docstrings to support building Python without docstrings (./configure --without-doc-strings).
For C code that needs to support versions of Python older than 2.3, you can include this after including Python.h:
#ifndef PyDoc_STR #define PyDoc_VAR(name) static char name[] #define PyDoc_STR(str) (str) #define PyDoc_STRVAR(name, str) PyDoc_VAR(name) = PyDoc_STR(str) #endif
The first line of each fuction docstring should be a "signature line" that gives a brief synopsis of the arguments and return value. For example:
PyDoc_STRVAR(myfunction__doc__, "myfunction(name, value) -> bool\n\n\ Determine whether name and value make a valid pair.");
Always include a blank line between the signature line and the text of the description.
If the return value for the function is always None (because there is no meaningful return value), do not include the indication of the return type.
When writing multi-line docstrings, be sure to always use backslash continuations, as in the example above, or string literal concatenation:
PyDoc_STRVAR(myfunction__doc__, "myfunction(name, value) -> bool\n\n" "Determine whether name and value make a valid pair.");
Though some C compilers accept string literals without either:
/* BAD -- don't do this! */ PyDoc_STRVAR(myfunction__doc__, "myfunction(name, value) -> bool\n\n Determine whether name and value make a valid pair.");
not all do; the MSVC compiler is known to complain about this.
References
| [1] | PEP 8, "Style Guide for Python Code", van Rossum, Warsaw (http://www.python.org/dev/peps/pep-0008) |
Copyright
This document has been placed in the public domain.
pep-0008 Style Guide for Python Code
| PEP: | 8 |
|---|---|
| Title: | Style Guide for Python Code |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org>, Barry Warsaw <barry at python.org>, Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 05-Jul-2001 |
| Post-History: | 05-Jul-2001, 01-Aug-2013 |
Contents
- Introduction
- A Foolish Consistency is the Hobgoblin of Little Minds
- Code lay-out
- String Quotes
- Whitespace in Expressions and Statements
- Comments
- Version Bookkeeping
- Naming Conventions
- Programming Recommendations
- References
- Copyright
Introduction
This document gives coding conventions for the Python code comprising the standard library in the main Python distribution. Please see the companion informational PEP describing style guidelines for the C code in the C implementation of Python [1].
This document and PEP 257 (Docstring Conventions) were adapted from Guido's original Python Style Guide essay, with some additions from Barry's style guide [2].
This style guide evolves over time as additional conventions are identified and past conventions are rendered obsolete by changes in the language itself.
Many projects have their own coding style guidelines. In the event of any conflicts, such project-specific guides take precedence for that project.
A Foolish Consistency is the Hobgoblin of Little Minds
One of Guido's key insights is that code is read much more often than it is written. The guidelines provided here are intended to improve the readability of code and make it consistent across the wide spectrum of Python code. As PEP 20 says, "Readability counts".
A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important.
But most importantly: know when to be inconsistent -- sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!
In particular: do not break backwards compatibility just to comply with this PEP!
Some other good reasons to ignore a particular guideline:
- When applying the guideline would make the code less readable, even for someone who is used to reading code that follows this PEP.
- To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean up someone else's mess (in true XP style).
- Because the code in question predates the introduction of the guideline and there is no other reason to be modifying that code.
- When the code needs to remain compatible with older versions of Python that don't support the feature recommended by the style guide.
Code lay-out
Indentation
Use 4 spaces per indentation level.
Continuation lines should align wrapped elements either vertically using Python's implicit line joining inside parentheses, brackets and braces, or using a hanging indent [5]. When using a hanging indent the following considerations should be applied; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line.
Yes:
# Aligned with opening delimiter.
foo = long_function_name(var_one, var_two,
var_three, var_four)
# More indentation included to distinguish this from the rest.
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
# Hanging indents should add a level.
foo = long_function_name(
var_one, var_two,
var_three, var_four)
No:
# Arguments on first line forbidden when not using vertical alignment.
foo = long_function_name(var_one, var_two,
var_three, var_four)
# Further indentation required as indentation is not distinguishable.
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)
The 4-space rule is optional for continuation lines.
Optional:
# Hanging indents *may* be indented to other than 4 spaces. foo = long_function_name( var_one, var_two, var_three, var_four)
When the conditional part of an if-statement is long enough to require that it be written across multiple lines, it's worth noting that the combination of a two character keyword (i.e. if), plus a single space, plus an opening parenthesis creates a natural 4-space indent for the subsequent lines of the multiline conditional. This can produce a visual conflict with the indented suite of code nested inside the if-statement, which would also naturally be indented to 4 spaces. This PEP takes no explicit position on how (or whether) to further visually distinguish such conditional lines from the nested suite inside the if-statement. Acceptable options in this situation include, but are not limited to:
# No extra indentation.
if (this_is_one_thing and
that_is_another_thing):
do_something()
# Add a comment, which will provide some distinction in editors
# supporting syntax highlighting.
if (this_is_one_thing and
that_is_another_thing):
# Since both conditions are true, we can frobnicate.
do_something()
# Add some extra indentation on the conditional continuation line.
if (this_is_one_thing
and that_is_another_thing):
do_something()
The closing brace/bracket/parenthesis on multi-line constructs may either line up under the first non-whitespace character of the last line of list, as in:
my_list = [
1, 2, 3,
4, 5, 6,
]
result = some_function_that_takes_arguments(
'a', 'b', 'c',
'd', 'e', 'f',
)
or it may be lined up under the first character of the line that starts the multi-line construct, as in:
my_list = [
1, 2, 3,
4, 5, 6,
]
result = some_function_that_takes_arguments(
'a', 'b', 'c',
'd', 'e', 'f',
)
Tabs or Spaces?
Spaces are the preferred indentation method.
Tabs should be used solely to remain consistent with code that is already indented with tabs.
Python 3 disallows mixing the use of tabs and spaces for indentation.
Python 2 code indented with a mixture of tabs and spaces should be converted to using spaces exclusively.
When invoking the Python 2 command line interpreter with the -t option, it issues warnings about code that illegally mixes tabs and spaces. When using -tt these warnings become errors. These options are highly recommended!
Maximum Line Length
Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.
Limiting the required editor window width makes it possible to have several files open side-by-side, and works well when using code review tools that present the two versions in adjacent columns.
The default wrapping in most tools disrupts the visual structure of the code, making it more difficult to understand. The limits are chosen to avoid wrapping in editors with the window width set to 80, even if the tool places a marker glyph in the final column when wrapping lines. Some web based tools may not offer dynamic line wrapping at all.
Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the nominal line length from 80 to 100 characters (effectively increasing the maximum length to 99 characters), provided that comments and docstrings are still wrapped at 72 characters.
The Python standard library is conservative and requires limiting lines to 79 characters (and docstrings/comments to 72).
The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.
Backslashes may still be appropriate at times. For example, long, multiple with-statements cannot use implicit continuation, so backslashes are acceptable:
with open('/path/to/some/file/you/want/to/read') as file_1, \
open('/path/to/some/file/being/written', 'w') as file_2:
file_2.write(file_1.read())
(See the previous discussion on multiline if-statements for further thoughts on the indentation of such multiline with-statements.)
Another such case is with assert statements.
Make sure to indent the continued line appropriately. The preferred place to break around a binary operator is after the operator, not before it. Some examples:
class Rectangle(Blob):
def __init__(self, width, height,
color='black', emphasis=None, highlight=0):
if (width == 0 and height == 0 and
color == 'red' and emphasis == 'strong' or
highlight > 100):
raise ValueError("sorry, you lose")
if width == 0 and height == 0 and (color == 'red' or
emphasis is None):
raise ValueError("I don't think so -- values are %s, %s" %
(width, height))
Blob.__init__(self, width, height,
color, emphasis, highlight)
Blank Lines
Surround top-level function and class definitions with two blank lines.
Method definitions inside a class are surrounded by a single blank line.
Extra blank lines may be used (sparingly) to separate groups of related functions. Blank lines may be omitted between a bunch of related one-liners (e.g. a set of dummy implementations).
Use blank lines in functions, sparingly, to indicate logical sections.
Python accepts the control-L (i.e. ^L) form feed character as whitespace; Many tools treat these characters as page separators, so you may use them to separate pages of related sections of your file. Note, some editors and web-based code viewers may not recognize control-L as a form feed and will show another glyph in its place.
Source File Encoding
Code in the core Python distribution should always use UTF-8 (or ASCII in Python 2).
Files using ASCII (in Python 2) or UTF-8 (in Python 3) should not have an encoding declaration.
In the standard library, non-default encodings should be used only for test purposes or when a comment or docstring needs to mention an author name that contains non-ASCII characters; otherwise, using \x, \u, \U, or \N escapes is the preferred way to include non-ASCII data in string literals.
For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.
Open source projects with a global audience are encouraged to adopt a similar policy.
Imports
Imports should usually be on separate lines, e.g.:
Yes: import os import sys No: import sys, osIt's okay to say this though:
from subprocess import Popen, PIPE
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
Imports should be grouped in the following order:
- standard library imports
- related third party imports
- local application/library specific imports
You should put a blank line between each group of imports.
Put any relevant __all__ specification after the imports.
Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages) if the import system is incorrectly configured (such as when a directory inside a package ends up on sys.path):
import mypkg.sibling from mypkg import sibling from mypkg.sibling import example
However, explicit relative imports are an acceptable alternative to absolute imports, especially when dealing with complex package layouts where using absolute imports would be unnecessarily verbose:
from . import sibling from .sibling import example
Standard library code should avoid complex package layouts and always use absolute imports.
Implicit relative imports should never be used and have been removed in Python 3.
When importing a class from a class-containing module, it's usually okay to spell this:
from myclass import MyClass from foo.bar.yourclass import YourClass
If this spelling causes local name clashes, then spell them
import myclass import foo.bar.yourclass
and use "myclass.MyClass" and "foo.bar.yourclass.YourClass".
Wildcard imports (from <module> import *) should be avoided, as they make it unclear which names are present in the namespace, confusing both readers and many automated tools. There is one defensible use case for a wildcard import, which is to republish an internal interface as part of a public API (for example, overwriting a pure Python implementation of an interface with the definitions from an optional accelerator module and exactly which definitions will be overwritten isn't known in advance).
When republishing names this way, the guidelines below regarding public and internal interfaces still apply.
String Quotes
In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it. When a string contains single or double quote characters, however, use the other one to avoid backslashes in the string. It improves readability.
For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257.
Whitespace in Expressions and Statements
Pet Peeves
Avoid extraneous whitespace in the following situations:
Immediately inside parentheses, brackets or braces.
Yes: spam(ham[1], {eggs: 2}) No: spam( ham[ 1 ], { eggs: 2 } )Immediately before a comma, semicolon, or colon:
Yes: if x == 4: print x, y; x, y = y, x No: if x == 4 : print x , y ; x , y = y , x
However, in a slice the colon acts like a binary operator, and should have equal amounts on either side (treating it as the operator with the lowest priority). In an extended slice, both colons must have the same amount of spacing applied. Exception: when a slice parameter is omitted, the space is omitted.
Yes:
ham[1:9], ham[1:9:3], ham[:9:3], ham[1::3], ham[1:9:] ham[lower:upper], ham[lower:upper:], ham[lower::step] ham[lower+offset : upper+offset] ham[: upper_fn(x) : step_fn(x)], ham[:: step_fn(x)] ham[lower + offset : upper + offset]
No:
ham[lower + offset:upper + offset] ham[1: 9], ham[1 :9], ham[1:9 :3] ham[lower : : upper] ham[ : upper]
Immediately before the open parenthesis that starts the argument list of a function call:
Yes: spam(1) No: spam (1)
Immediately before the open parenthesis that starts an indexing or slicing:
Yes: dct['key'] = lst[index] No: dct ['key'] = lst [index]
More than one space around an assignment (or other) operator to align it with another.
Yes:
x = 1 y = 2 long_variable = 3
No:
x = 1 y = 2 long_variable = 3
Other Recommendations
Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not).
If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies). Use your own judgment; however, never use more than one space, and always have the same amount of whitespace on both sides of a binary operator.
Yes:
i = i + 1 submitted += 1 x = x*2 - 1 hypot2 = x*x + y*y c = (a+b) * (a-b)
No:
i=i+1 submitted +=1 x = x * 2 - 1 hypot2 = x * x + y * y c = (a + b) * (a - b)
Don't use spaces around the = sign when used to indicate a keyword argument or a default parameter value.
Yes:
def complex(real, imag=0.0): return magic(r=real, i=imag)No:
def complex(real, imag = 0.0): return magic(r = real, i = imag)Do use spaces around the = sign of an annotated function definition. Additionally, use a single space after the :, as well as a single space on either side of the -> sign representing an annotated return value.
Yes:
def munge(input: AnyStr): def munge(sep: AnyStr = None): def munge() -> AnyStr: def munge(input: AnyStr, sep: AnyStr = None, limit=1000):
No:
def munge(input: AnyStr=None): def munge(input:AnyStr): def munge(input: AnyStr)->PosInt:
Compound statements (multiple statements on the same line) are generally discouraged.
Yes:
if foo == 'blah': do_blah_thing() do_one() do_two() do_three()Rather not:
if foo == 'blah': do_blah_thing() do_one(); do_two(); do_three()
While sometimes it's okay to put an if/for/while with a small body on the same line, never do this for multi-clause statements. Also avoid folding such long lines!
Rather not:
if foo == 'blah': do_blah_thing() for x in lst: total += x while t < 10: t = delay()
Definitely not:
if foo == 'blah': do_blah_thing() else: do_non_blah_thing() try: something() finally: cleanup() do_one(); do_two(); do_three(long, argument, list, like, this) if foo == 'blah': one(); two(); three()
Comments
Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes!
Comments should be complete sentences. If a comment is a phrase or sentence, its first word should be capitalized, unless it is an identifier that begins with a lower case letter (never alter the case of identifiers!).
If a comment is short, the period at the end can be omitted. Block comments generally consist of one or more paragraphs built out of complete sentences, and each sentence should end in a period.
You should use two spaces after a sentence-ending period.
When writing English, follow Strunk and White.
Python coders from non-English speaking countries: please write your comments in English, unless you are 120% sure that the code will never be read by people who don't speak your language.
Block Comments
Block comments generally apply to some (or all) code that follows them, and are indented to the same level as that code. Each line of a block comment starts with a # and a single space (unless it is indented text inside the comment).
Paragraphs inside a block comment are separated by a line containing a single #.
Inline Comments
Use inline comments sparingly.
An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.
Inline comments are unnecessary and in fact distracting if they state the obvious. Don't do this:
x = x + 1 # Increment x
But sometimes, this is useful:
x = x + 1 # Compensate for border
Documentation Strings
Conventions for writing good documentation strings (a.k.a. "docstrings") are immortalized in PEP 257.
Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.
PEP 257 describes good docstring conventions. Note that most importantly, the """ that ends a multiline docstring should be on a line by itself, e.g.:
"""Return a foobang Optional plotz says to frobnicate the bizbaz first. """
For one liner docstrings, please keep the closing """ on the same line.
Version Bookkeeping
If you have to have Subversion, CVS, or RCS crud in your source file, do it as follows.
__version__ = "$Revision$" # $Source$
These lines should be included after the module's docstring, before any other code, separated by a blank line above and below.
Naming Conventions
The naming conventions of Python's library are a bit of a mess, so we'll never get this completely consistent -- nevertheless, here are the currently recommended naming standards. New modules and packages (including third party frameworks) should be written to these standards, but where an existing library has a different style, internal consistency is preferred.
Overriding Principle
Names that are visible to the user as public parts of the API should follow conventions that reflect usage rather than implementation.
Descriptive: Naming Styles
There are a lot of different naming styles. It helps to be able to recognize what naming style is being used, independently from what they are used for.
The following naming styles are commonly distinguished:
b (single lowercase letter)
B (single uppercase letter)
lowercase
lower_case_with_underscores
UPPERCASE
UPPER_CASE_WITH_UNDERSCORES
CapitalizedWords (or CapWords, or CamelCase -- so named because of the bumpy look of its letters [3]). This is also sometimes known as StudlyCaps.
Note: When using abbreviations in CapWords, capitalize all the letters of the abbreviation. Thus HTTPServerError is better than HttpServerError.
mixedCase (differs from CapitalizedWords by initial lowercase character!)
Capitalized_Words_With_Underscores (ugly!)
There's also the style of using a short unique prefix to group related names together. This is not used much in Python, but it is mentioned for completeness. For example, the os.stat() function returns a tuple whose items traditionally have names like st_mode, st_size, st_mtime and so on. (This is done to emphasize the correspondence with the fields of the POSIX system call struct, which helps programmers familiar with that.)
The X11 library uses a leading X for all its public functions. In Python, this style is generally deemed unnecessary because attribute and method names are prefixed with an object, and function names are prefixed with a module name.
In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention):
_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose name starts with an underscore.
single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g.
Tkinter.Toplevel(master, class_='ClassName')
__double_leading_underscore: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).
__double_leading_and_trailing_underscore__: "magic" objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.
Prescriptive: Naming Conventions
Names to Avoid
Never use the characters 'l' (lowercase letter el), 'O' (uppercase letter oh), or 'I' (uppercase letter eye) as single character variable names.
In some fonts, these characters are indistinguishable from the numerals one and zero. When tempted to use 'l', use 'L' instead.
Package and Module Names
Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
Since module names are mapped to file names, and some file systems are case insensitive and truncate long names, it is important that module names be chosen to be fairly short -- this won't be a problem on Unix, but it may be a problem when the code is transported to older Mac or Windows versions, or DOS.
When an extension module written in C or C++ has an accompanying Python module that provides a higher level (e.g. more object oriented) interface, the C/C++ module has a leading underscore (e.g. _socket).
Class Names
Class names should normally use the CapWords convention.
The naming convention for functions may be used instead in cases where the interface is documented and used primarily as a callable.
Note that there is a separate convention for builtin names: most builtin names are single words (or two words run together), with the CapWords convention used only for exception names and builtin constants.
Exception Names
Because exceptions should be classes, the class naming convention applies here. However, you should use the suffix "Error" on your exception names (if the exception actually is an error).
Global Variable Names
(Let's hope that these variables are meant for use inside one module only.) The conventions are about the same as those for functions.
Modules that are designed for use via from M import * should use the __all__ mechanism to prevent exporting globals, or use the older convention of prefixing such globals with an underscore (which you might want to do to indicate these globals are "module non-public").
Function Names
Function names should be lowercase, with words separated by underscores as necessary to improve readability.
mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility.
Function and method arguments
Always use self for the first argument to instance methods.
Always use cls for the first argument to class methods.
If a function argument's name clashes with a reserved keyword, it is generally better to append a single trailing underscore rather than use an abbreviation or spelling corruption. Thus class_ is better than clss. (Perhaps better is to avoid such clashes by using a synonym.)
Method Names and Instance Variables
Use the function naming rules: lowercase with words separated by underscores as necessary to improve readability.
Use one leading underscore only for non-public methods and instance variables.
To avoid name clashes with subclasses, use two leading underscores to invoke Python's name mangling rules.
Python mangles these names with the class name: if class Foo has an attribute named __a, it cannot be accessed by Foo.__a. (An insistent user could still gain access by calling Foo._Foo__a.) Generally, double leading underscores should be used only to avoid name conflicts with attributes in classes designed to be subclassed.
Note: there is some controversy about the use of __names (see below).
Constants
Constants are usually defined on a module level and written in all capital letters with underscores separating words. Examples include MAX_OVERFLOW and TOTAL.
Designing for inheritance
Always decide whether a class's methods and instance variables (collectively: "attributes") should be public or non-public. If in doubt, choose non-public; it's easier to make it public later than to make a public attribute non-public.
Public attributes are those that you expect unrelated clients of your class to use, with your commitment to avoid backward incompatible changes. Non-public attributes are those that are not intended to be used by third parties; you make no guarantees that non-public attributes won't change or even be removed.
We don't use the term "private" here, since no attribute is really private in Python (without a generally unnecessary amount of work).
Another category of attributes are those that are part of the "subclass API" (often called "protected" in other languages). Some classes are designed to be inherited from, either to extend or modify aspects of the class's behavior. When designing such a class, take care to make explicit decisions about which attributes are public, which are part of the subclass API, and which are truly only to be used by your base class.
With this in mind, here are the Pythonic guidelines:
Public attributes should have no leading underscores.
If your public attribute name collides with a reserved keyword, append a single trailing underscore to your attribute name. This is preferable to an abbreviation or corrupted spelling. (However, notwithstanding this rule, 'cls' is the preferred spelling for any variable or argument which is known to be a class, especially the first argument to a class method.)
Note 1: See the argument name recommendation above for class methods.
For simple public data attributes, it is best to expose just the attribute name, without complicated accessor/mutator methods. Keep in mind that Python provides an easy path to future enhancement, should you find that a simple data attribute needs to grow functional behavior. In that case, use properties to hide functional implementation behind simple data attribute access syntax.
Note 1: Properties only work on new-style classes.
Note 2: Try to keep the functional behavior side-effect free, although side-effects such as caching are generally fine.
Note 3: Avoid using properties for computationally expensive operations; the attribute notation makes the caller believe that access is (relatively) cheap.
If your class is intended to be subclassed, and you have attributes that you do not want subclasses to use, consider naming them with double leading underscores and no trailing underscores. This invokes Python's name mangling algorithm, where the name of the class is mangled into the attribute name. This helps avoid attribute name collisions should subclasses inadvertently contain attributes with the same name.
Note 1: Note that only the simple class name is used in the mangled name, so if a subclass chooses both the same class name and attribute name, you can still get name collisions.
Note 2: Name mangling can make certain uses, such as debugging and __getattr__(), less convenient. However the name mangling algorithm is well documented and easy to perform manually.
Note 3: Not everyone likes name mangling. Try to balance the need to avoid accidental name clashes with potential use by advanced callers.
Public and internal interfaces
Any backwards compatibility guarantees apply only to public interfaces. Accordingly, it is important that users be able to clearly distinguish between public and internal interfaces.
Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal interfaces exempt from the usual backwards compatibility guarantees. All undocumented interfaces should be assumed to be internal.
To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute. Setting __all__ to an empty list indicates that the module has no public API.
Even with __all__ set appropriately, internal interfaces (packages, modules, classes, functions, attributes or other names) should still be prefixed with a single leading underscore.
An interface is also considered internal if any containing namespace (package, module or class) is considered internal.
Imported names should always be considered an implementation detail. Other modules must not rely on indirect access to such imported names unless they are an explicitly documented part of the containing module's API, such as os.path or a package's __init__ module that exposes functionality from submodules.
Programming Recommendations
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.
Comparisons to singletons like None should always be done with is or is not, never the equality operators.
Also, beware of writing if x when you really mean if x is not None -- e.g. when testing whether a variable or argument that defaults to None was set to some other value. The other value might have a type (such as a container) that could be false in a boolean context!
Use is not operator rather than not ... is. While both expressions are functionally identical, the former is more readable and preferred.
Yes:
if foo is not None:
No:
if not foo is None:
When implementing ordering operations with rich comparisons, it is best to implement all six operations (__eq__, __ne__, __lt__, __le__, __gt__, __ge__) rather than relying on other code to only exercise a particular comparison.
To minimize the effort involved, the functools.total_ordering() decorator provides a tool to generate missing comparison methods.
PEP 207 indicates that reflexivity rules are assumed by Python. Thus, the interpreter may swap y > x with x < y, y >= x with x <= y, and may swap the arguments of x == y and x != y. The sort() and min() operations are guaranteed to use the < operator and the max() function uses the > operator. However, it is best to implement all six operations so that confusion doesn't arise in other contexts.
Always use a def statement instead of an assignment statement that binds a lambda expression directly to an identifier.
Yes:
def f(x): return 2*x
No:
f = lambda x: 2*x
The first form means that the name of the resulting function object is specifically 'f' instead of the generic '<lambda>'. This is more useful for tracebacks and string representations in general. The use of the assignment statement eliminates the sole benefit a lambda expression can offer over an explicit def statement (i.e. that it can be embedded inside a larger expression)
Derive exceptions from Exception rather than BaseException. Direct inheritance from BaseException is reserved for exceptions where catching them is almost always the wrong thing to do.
Design exception hierarchies based on the distinctions that code catching the exceptions is likely to need, rather than the locations where the exceptions are raised. Aim to answer the question "What went wrong?" programmatically, rather than only stating that "A problem occurred" (see PEP 3151 for an example of this lesson being learned for the builtin exception hierarchy)
Class naming conventions apply here, although you should add the suffix "Error" to your exception classes if the exception is an error. Non-error exceptions that are used for non-local flow control or other forms of signaling need no special suffix.
Use exception chaining appropriately. In Python 3, "raise X from Y" should be used to indicate explicit replacement without losing the original traceback.
When deliberately replacing an inner exception (using "raise X" in Python 2 or "raise X from None" in Python 3.3+), ensure that relevant details are transferred to the new exception (such as preserving the attribute name when converting KeyError to AttributeError, or embedding the text of the original exception in the new exception message).
When raising an exception in Python 2, use raise ValueError('message') instead of the older form raise ValueError, 'message'.
The latter form is not legal Python 3 syntax.
The paren-using form also means that when the exception arguments are long or include string formatting, you don't need to use line continuation characters thanks to the containing parentheses.
When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.
For example, use:
try: import platform_specific_module except ImportError: platform_specific_module = NoneA bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:).
A good rule of thumb is to limit use of bare 'except' clauses to two cases:
- If the exception handler will be printing out or logging the traceback; at least the user will be aware that an error has occurred.
- If the code needs to do some cleanup work, but then lets the exception propagate upwards with raise. try...finally can be a better way to handle this case.
When binding caught exceptions to a name, prefer the explicit name binding syntax added in Python 2.6:
try: process_data() except Exception as exc: raise DataProcessingFailedError(str(exc))This is the only syntax supported in Python 3, and avoids the ambiguity problems associated with the older comma-based syntax.
When catching operating system errors, prefer the explicit exception hierarchy introduced in Python 3.3 over introspection of errno values.
Additionally, for all try/except clauses, limit the try clause to the absolute minimum amount of code necessary. Again, this avoids masking bugs.
Yes:
try: value = collection[key] except KeyError: return key_not_found(key) else: return handle_value(value)No:
try: # Too broad! return handle_value(collection[key]) except KeyError: # Will also catch KeyError raised by handle_value() return key_not_found(key)When a resource is local to a particular section of code, use a with statement to ensure it is cleaned up promptly and reliably after use. A try/finally statement is also acceptable.
Context managers should be invoked through separate functions or methods whenever they do something other than acquire and release resources. For example:
Yes:
with conn.begin_transaction(): do_stuff_in_transaction(conn)No:
with conn: do_stuff_in_transaction(conn)The latter example doesn't provide any information to indicate that the __enter__ and __exit__ methods are doing something other than closing the connection after a transaction. Being explicit is important in this case.
Be consistent in return statements. Either all return statements in a function should return an expression, or none of them should. If any return statement returns an expression, any return statements where no value is returned should explicitly state this as return None, and an explicit return statement should be present at the end of the function (if reachable).
Yes:
def foo(x): if x >= 0: return math.sqrt(x) else: return None def bar(x): if x < 0: return None return math.sqrt(x)No:
def foo(x): if x >= 0: return math.sqrt(x) def bar(x): if x < 0: return return math.sqrt(x)Use string methods instead of the string module.
String methods are always much faster and share the same API with unicode strings. Override this rule if backward compatibility with Pythons older than 2.0 is required.
Use ''.startswith() and ''.endswith() instead of string slicing to check for prefixes or suffixes.
startswith() and endswith() are cleaner and less error prone. For example:
Yes: if foo.startswith('bar'): No: if foo[:3] == 'bar':Object type comparisons should always use isinstance() instead of comparing types directly.
Yes: if isinstance(obj, int): No: if type(obj) is type(1):
When checking if an object is a string, keep in mind that it might be a unicode string too! In Python 2, str and unicode have a common base class, basestring, so you can do:
if isinstance(obj, basestring):
Note that in Python 3, unicode and basestring no longer exist (there is only str) and a bytes object is no longer a kind of string (it is a sequence of integers instead)
For sequences, (strings, lists, tuples), use the fact that empty sequences are false.
Yes: if not seq: if seq: No: if len(seq) if not len(seq)Don't write string literals that rely on significant trailing whitespace. Such trailing whitespace is visually indistinguishable and some editors (or more recently, reindent.py) will trim them.
Don't compare boolean values to True or False using ==.
Yes: if greeting: No: if greeting == True: Worse: if greeting is True:
The Python standard library will not use function annotations as that would result in a premature commitment to a particular annotation style. Instead, the annotations are left for users to discover and experiment with useful annotation styles.
It is recommended that third party experiments with annotations use an associated decorator to indicate how the annotation should be interpreted.
Early core developer attempts to use function annotations revealed inconsistent, ad-hoc annotation styles. For example:
- [str] was ambiguous as to whether it represented a list of strings or a value that could be either str or None.
- The notation open(file:(str,bytes)) was used for a value that could be either bytes or str rather than a 2-tuple containing a str value followed by a bytes value.
- The annotation seek(whence:int) exhibited a mix of over-specification and under-specification: int is too restrictive (anything with __index__ would be allowed) and it is not restrictive enough (only the values 0, 1, and 2 are allowed). Likewise, the annotation write(b: bytes) was also too restrictive (anything supporting the buffer protocol would be allowed).
- Annotations such as read1(n: int=None) were self-contradictory since None is not an int. Annotations such as source_path(self, fullname:str) -> object were confusing about what the return type should be.
- In addition to the above, annotations were inconsistent in the use of concrete types versus abstract types: int versus Integral and set/frozenset versus MutableSet/Set.
- Some annotations in the abstract base classes were incorrect specifications. For example, set-to-set operations require other to be another instance of Set rather than just an Iterable.
- A further issue was that annotations become part of the specification but weren't being tested.
- In most cases, the docstrings already included the type specifications and did so with greater clarity than the function annotations. In the remaining cases, the docstrings were improved once the annotations were removed.
- The observed function annotations were too ad-hoc and inconsistent to work with a coherent system of automatic type checking or argument validation. Leaving these annotations in the code would have made it more difficult to make changes later so that automated utilities could be supported.
Footnotes
| [5] | Hanging indentation is a type-setting style where all the lines in a paragraph are indented except the first line. In the context of Python, the term is used to describe a style where the opening parenthesis of a parenthesized statement is the last non-whitespace character of the line, with subsequent lines being indented until the closing parenthesis. |
References
| [1] | PEP 7, Style Guide for C Code, van Rossum |
| [2] | Barry's GNU Mailman style guide http://barry.warsaw.us/software/STYLEGUIDE.txt |
| [3] | http://www.wikipedia.com/wiki/CamelCase |
| [4] | PEP 8 modernisation, July 2013 http://bugs.python.org/issue18472 |
Copyright
This document has been placed in the public domain.
pep-0009 Sample Plaintext PEP Template
| PEP: | 9 |
|---|---|
| Title: | Sample Plaintext PEP Template |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/plain |
| Created: | 14-Aug-2001 |
| Post-History: |
Abstract
This PEP provides a boilerplate or sample template for creating
your own plaintext PEPs. In conjunction with the content
guidelines in PEP 1 [1], this should make it easy for you to
conform your own PEPs to the format outlined below.
Note: if you are reading this PEP via the web, you should first
grab the plaintext source of this PEP in order to complete the
steps below. DO NOT USE THE HTML FILE AS YOUR TEMPLATE!
To get the source this (or any) PEP, look at the top of the HTML
page and click on the date & time on the "Last-Modified" line. It
is a link to the source text in the Python repository.
If you would prefer to use lightweight markup in your PEP, please
see PEP 12, "Sample reStructuredText PEP Template" [2].
Rationale
PEP submissions come in a wide variety of forms, not all adhering
to the format guidelines set forth below. Use this template, in
conjunction with the content guidelines in PEP 1, to ensure that
your PEP submission won't get automatically rejected because of
form.
How to Use This Template
To use this template you must first decide whether your PEP is
going to be an Informational or Standards Track PEP. Most PEPs
are Standards Track because they propose a new feature for the
Python language or standard library. When in doubt, read PEP 1
for details or contact the PEP editors <peps@python.org>.
Once you've decided which type of PEP yours is going to be, follow
the directions below.
- Make a copy of this file (.txt file, not HTML!) and perform the
following edits.
- Replace the "PEP: 9" header with "PEP: XXX" since you don't yet
have a PEP number assignment.
- Change the Title header to the title of your PEP.
- Leave the Version and Last-Modified headers alone; we'll take
care of those when we check your PEP into Python's Subversion
repository. These headers consist of keywords ("Revision" and
"Date" enclosed in "$"-signs) which are automatically expanded
by the repository. Please do not edit the expanded date or
revision text.
- Change the Author header to include your name, and optionally
your email address. Be sure to follow the format carefully:
your name must appear first, and it must not be contained in
parentheses. Your email address may appear second (or it can be
omitted) and if it appears, it must appear in angle brackets.
It is okay to obfuscate your email address.
- If there is a mailing list for discussion of your new feature,
add a Discussions-To header right after the Author header. You
should not add a Discussions-To header if the mailing list to be
used is either python-list@python.org or python-dev@python.org,
or if discussions should be sent to you directly. Most
Informational PEPs don't have a Discussions-To header.
- Change the Status header to "Draft".
- For Standards Track PEPs, change the Type header to "Standards
Track".
- For Informational PEPs, change the Type header to
"Informational".
- For Standards Track PEPs, if your feature depends on the
acceptance of some other currently in-development PEP, add a
Requires header right after the Type header. The value should
be the PEP number of the PEP yours depends on. Don't add this
header if your dependent feature is described in a Final PEP.
- Change the Created header to today's date. Be sure to follow
the format carefully: it must be in dd-mmm-yyyy format, where
the mmm is the 3 English letter month abbreviation, e.g. one of
Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.
- For Standards Track PEPs, after the Created header, add a
Python-Version header and set the value to the next planned
version of Python, i.e. the one your new feature will hopefully
make its first appearance in. Do not use an alpha or beta
release designation here. Thus, if the last version of Python
was 2.2 alpha 1 and you're hoping to get your new feature into
Python 2.2, set the header to:
Python-Version: 2.2
- Leave Post-History alone for now; you'll add dates to this
header each time you post your PEP to python-list@python.org or
python-dev@python.org. E.g. if you posted your PEP to the lists
on August 14, 2001 and September 3, 2001, the Post-History
header would look like:
Post-History: 14-Aug-2001, 03-Sept-2001
You must manually add new dates and check them in. If you don't
have check-in privileges, send your changes to the PEP editor.
- Add a Replaces header if your PEP obsoletes an earlier PEP. The
value of this header is the number of the PEP that your new PEP
is replacing. Only add this header if the older PEP is in
"final" form, i.e. is either Accepted, Final, or Rejected. You
aren't replacing an older open PEP if you're submitting a
competing idea.
- Now write your Abstract, Rationale, and other content for your
PEP, replacing all this gobbledygook with your own text. Be sure
to adhere to the format guidelines below, specifically on the
prohibition of tab characters and the indentation requirements.
- Update your References and Copyright section. Usually you'll
place your PEP into the public domain, in which case just leave
the "Copyright" section alone. Alternatively, you can use the
Open Publication License[3], but public domain is still strongly
preferred.
- Leave the little Emacs turd at the end of this file alone,
including the formfeed character ("^L", or \f).
- Send your PEP submission to the PEP editors (peps@python.org),
along with $100k in unmarked pennies. (Just kidding, I wanted
to see if you were still awake. :)
Plaintext PEP Formatting Requirements
PEP headings must begin in column zero and the initial letter of
each word must be capitalized as in book titles. Acronyms should
be in all capitals. The body of each section must be indented 4
spaces. Code samples inside body sections should be indented a
further 4 spaces, and other indentation can be used as required to
make the text readable. You must use two blank lines between the
last line of a section's body and the next section heading.
You must adhere to the Emacs convention of adding two spaces at
the end of every sentence. You should fill your paragraphs to
column 70, but under no circumstances should your lines extend
past column 79. If your code samples spill over column 79, you
should rewrite them.
Tab characters must never appear in the document at all. A PEP
should include the standard Emacs stanza included by example at
the bottom of this PEP.
When referencing an external web page in the body of a PEP, you
should include the title of the page in the text, with a
footnote reference to the URL. Do not include the URL in the body
text of the PEP. E.g.
Refer to the Python Language web site [1] for more details.
...
[1] http://www.python.org
When referring to another PEP, include the PEP number in the body
text, such as "PEP 1". The title may optionally appear. Add a
footnote reference, a number in square brackets. The footnote
body should include the PEP's title and author. It may optionally
include the explicit URL on a separate line, but only in the
References section. Note that the pep2html.py script will
calculate URLs automatically. For example:
...
Refer to PEP 1 [7] for more information about PEP style
...
References
[7] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
http://www.python.org/dev/peps/pep-0001/
If you decide to provide an explicit URL for a PEP, please use
this as the URL template:
http://www.python.org/dev/peps/pep-xxxx/
PEP numbers in URLs must be padded with zeros from the left, so as
to be exactly 4 characters wide, however PEP numbers in the text
are never padded.
References
[1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
http://www.python.org/dev/peps/pep-0001/
[2] PEP 12, Sample reStructuredText PEP Template, Goodger, Warsaw
http://www.python.org/dev/peps/pep-0012/
[3] http://www.opencontent.org/openpub/
Copyright
This document has been placed in the public domain.
pep-0010 Voting Guidelines
| PEP: | 10 |
|---|---|
| Title: | Voting Guidelines |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Active |
| Type: | Process |
| Created: | 07-Mar-2002 |
| Post-History: | 07-Mar-2002 |
Abstract
This PEP outlines the python-dev voting guidelines. These
guidelines serve to provide feedback or gauge the "wind direction"
on a particular proposal, idea, or feature. They don't have a
binding force.
Rationale
When a new idea, feature, patch, etc. is floated in the Python
community, either through a PEP or on the mailing lists (most
likely on python-dev [1]), it is sometimes helpful to gauge the
community's general sentiment. Sometimes people just want to
register their opinion of an idea. Sometimes the BDFL wants to
take a straw poll. Whatever the reason, these guidelines have
been adopted so as to provide a common language for developers.
While opinions are (sometimes) useful, but they are never binding.
Opinions that are accompanied by rationales are always valued
higher than bare scores (this is especially true with -1 votes).
Voting Scores
The scoring guidelines are loosely derived from the Apache voting
procedure [2], with of course our own spin on things. There are 4
possible vote scores:
+1 I like it
+0 I don't care, but go ahead
-0 I don't care, so why bother?
-1 I hate it
You may occasionally see wild flashes of enthusiasm (either for or
against) with vote scores like +2, +1000, or -1000. These aren't
really valued much beyond the above scores, but it's nice to see
people get excited about such geeky stuff.
References
[1] Python Developer's Guide,
http://www.python.org/dev/
[2] Apache Project Guidelines and Voting Rules
http://httpd.apache.org/dev/guidelines.html
Copyright
This document has been placed in the public domain.
pep-0011 Removing support for little used platforms
| PEP: | 11 |
|---|---|
| Title: | Removing support for little used platforms |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de>, Brett Cannon <brett at python.org> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 07-Jul-2002 |
| Post-History: | 18-Aug-2007 16-May-2014 20-Feb-2015 |
Contents
Abstract
This PEP documents how an operating system (platform) becomes supported in CPython and documents past support.
Rationale
Over time, the CPython source code has collected various pieces of platform-specific code, which, at some point in time, was considered necessary to use Python on a specific platform. Without access to this platform, it is not possible to determine whether this code is still needed. As a result, this code may either break during Python's evolution, or it may become unnecessary as the platforms evolve as well.
The growing amount of these fragments poses the risk of unmaintainability: without having experts for a large number of platforms, it is not possible to determine whether a certain change to the CPython source code will work on all supported platforms.
To reduce this risk, this PEP specifies what is required for a platform to be considered supported by Python as well as providing a procedure to remove code for platforms with few or no Python users.
Supporting platforms
Gaining official platform support requires two things. First, a core developer needs to volunteer to maintain platform-specific code. This core developer can either already be a member of the Python development team or be given contributor rights on the basis of maintaining platform support (it is at the discretion of the Python development team to decide if a person is ready to have such rights even if it is just for supporting a specific platform).
Second, a stable buildbot must be provided [2]. This guarantees that platform support will not be accidentally broken by a Python core developer who does not have personal access to the platform. For a buildbot to be considered stable it requires that the machine be reliably up and functioning (but it is up to the Python core developers to decide whether to promote a buildbot to being considered stable).
This policy does not disqualify supporting other platforms indirectly. Patches which are not platform-specific but still done to add platform support will be considered for inclusion. For example, if platform-independent changes were necessary in the configure script which were motivated to support a specific platform that could be accepted. Patches which add platform-specific code such as the name of a specific platform to the configure script will generally not be accepted without the platform having official support.
CPU architecture and compiler support are viewed in a similar manner as platforms. For example, to consider the ARM architecture supported a buildbot running on ARM would be required along with support from the Python development team. In general it is not required to have a CPU architecture run under every possible platform in order to be considered supported.
Unsupporting platforms
If a certain platform that currently has special code in CPython is deemed to be without enough Python users or lacks proper support from the Python development team and/or a buildbot, a note must be posted in this PEP that this platform is no longer actively supported. This note must include:
- the name of the system
- the first release number that does not support this platform anymore, and
- the first release where the historical support code is actively removed
In some cases, it is not possible to identify the specific list of systems for which some code is used (e.g. when autoconf tests for absence of some feature which is considered present on all supported systems). In this case, the name will give the precise condition (usually a preprocessor symbol) that will become unsupported.
At the same time, the CPython source code must be changed to produce a build-time error if somebody tries to install Python on this platform. On platforms using autoconf, configure must fail. This gives potential users of the platform a chance to step forward and offer maintenance.
Re-supporting platforms
If a user of a platform wants to see this platform supported again, he may volunteer to maintain the platform support. Such an offer must be recorded in the PEP, and the user can submit patches to remove the build-time errors, and perform any other maintenance work for the platform.
Microsoft Windows
Microsoft has established a policy called product support lifecycle [1]. Each product's lifecycle has a mainstream support phase, where the product is generally commercially available, and an extended support phase, where paid support is still available, and certain bug fixes are released (in particular security fixes).
CPython's Windows support now follows this lifecycle. A new feature release X.Y.0 will support all Windows releases whose extended support phase is not yet expired. Subsequent bug fix releases will support the same Windows releases as the original feature release (even if the extended support phase has ended).
Because of this policy, no further Windows releases need to be listed in this PEP.
Each feature release is built by a specific version of Microsoft Visual Studio. That version should have mainstream support when the release is made. Developers of extension modules will generally need to use the same Visual Studio release; they are concerned both with the availability of the versions they need to use, and with keeping the zoo of versions small. The CPython source tree will keep unmaintained build files for older Visual Studio releases, for which patches will be accepted. Such build files will be removed from the source tree 3 years after the extended support for the compiler has ended (but continue to remain available in revision control).
No-longer-supported platforms
- Name: MS-DOS, MS-Windows 3.xUnsupported in: Python 2.0Code removed in: Python 2.1
- Name: SunOS 4Unsupported in: Python 2.3Code removed in: Python 2.4
- Name: DYNIXUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: dguxUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: MinixUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: Irix 4 and --with-sgi-dlUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: Linux 1Unsupported in: Python 2.3Code removed in: Python 2.4
- Name: Systems defining __d6_pthread_create (configure.in)Unsupported in: Python 2.3Code removed in: Python 2.4
- Name: Systems defining PY_PTHREAD_D4, PY_PTHREAD_D6, or PY_PTHREAD_D7 in thread_pthread.hUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: Systems using --with-dl-dldUnsupported in: Python 2.3Code removed in: Python 2.4
- Name: Systems using --without-universal-newlines,Unsupported in: Python 2.3Code removed in: Python 2.4
- Name: MacOS 9Unsupported in: Python 2.4Code removed in: Python 2.4
- Name: Systems using --with-wctype-functionsUnsupported in: Python 2.6Code removed in: Python 2.6
- Name: Win9x, WinME, NT4Unsupported in: Python 2.6 (warning in 2.5 installer)Code removed in: Python 2.6
- Name: AtheOSUnsupported in: Python 2.6 (with "AtheOS" changed to "Syllable")Build broken in: Python 2.7 (edit configure to reenable)Code removed in: Python 3.0
- Name: BeOSUnsupported in: Python 2.6 (warning in configure)Build broken in: Python 2.7 (edit configure to reenable)Code removed in: Python 3.0
- Name: Systems using Mach C ThreadsUnsupported in: Python 3.2Code removed in: Python 3.3
- Name: SunOS lightweight processes (LWP)Unsupported in: Python 3.2Code removed in: Python 3.3
- Name: Systems using --with-pth (GNU pth threads)Unsupported in: Python 3.2Code removed in: Python 3.3
- Name: Systems using Irix threadsUnsupported in: Python 3.2Code removed in: Python 3.3
- Name: OSF* systems (issue 8606)Unsupported in: Python 3.2Code removed in: Python 3.3
- Name: OS/2 (issue 16135)Unsupported in: Python 3.3Code removed in: Python 3.4
- Name: VMS (issue 16136)Unsupported in: Python 3.3Code removed in: Python 3.4
- Name: Windows 2000Unsupported in: Python 3.3Code removed in: Python 3.4
- Name: Windows systems where COMSPEC points to command.comUnsupported in: Python 3.3Code removed in: Python 3.4
- Name: RISC OSUnsupported in: Python 3.0 (some code actually removed)Code removed in: Python 3.4
Copyright
This document has been placed in the public domain.
Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
pep-0012 Sample reStructuredText PEP Template
| PEP: | 12 |
|---|---|
| Title: | Sample reStructuredText PEP Template |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Goodger <goodger at python.org>, Barry Warsaw <barry at python.org> |
| Status: | Active |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 05-Aug-2002 |
| Post-History: | 30-Aug-2002 |
Contents
Abstract
This PEP provides a boilerplate or sample template for creating your own reStructuredText PEPs. In conjunction with the content guidelines in PEP 1 [1], this should make it easy for you to conform your own PEPs to the format outlined below.
Note: if you are reading this PEP via the web, you should first grab the text (reStructuredText) source of this PEP in order to complete the steps below. DO NOT USE THE HTML FILE AS YOUR TEMPLATE!
The source for this (or any) PEP can be found in the PEPs repository, viewable on the web at https://hg.python.org/peps/file/tip .
If you would prefer not to use markup in your PEP, please see PEP 9, "Sample Plaintext PEP Template" [2].
Rationale
PEP submissions come in a wide variety of forms, not all adhering to the format guidelines set forth below. Use this template, in conjunction with the format guidelines below, to ensure that your PEP submission won't get automatically rejected because of form.
ReStructuredText is offered as an alternative to plaintext PEPs, to allow PEP authors more functionality and expressivity, while maintaining easy readability in the source text. The processed HTML form makes the functionality accessible to readers: live hyperlinks, styled text, tables, images, and automatic tables of contents, among other advantages. For an example of a PEP marked up with reStructuredText, see PEP 287.
How to Use This Template
To use this template you must first decide whether your PEP is going to be an Informational or Standards Track PEP. Most PEPs are Standards Track because they propose a new feature for the Python language or standard library. When in doubt, read PEP 1 for details or contact the PEP editors <peps@python.org>.
Once you've decided which type of PEP yours is going to be, follow the directions below.
Make a copy of this file (.txt file, not HTML!) and perform the following edits.
Replace the "PEP: 12" header with "PEP: XXX" since you don't yet have a PEP number assignment.
Change the Title header to the title of your PEP.
Leave the Version and Last-Modified headers alone; we'll take care of those when we check your PEP into Python's Subversion repository. These headers consist of keywords ("Revision" and "Date" enclosed in "$"-signs) which are automatically expanded by the repository. Please do not edit the expanded date or revision text.
Change the Author header to include your name, and optionally your email address. Be sure to follow the format carefully: your name must appear first, and it must not be contained in parentheses. Your email address may appear second (or it can be omitted) and if it appears, it must appear in angle brackets. It is okay to obfuscate your email address.
If there is a mailing list for discussion of your new feature, add a Discussions-To header right after the Author header. You should not add a Discussions-To header if the mailing list to be used is either python-list@python.org or python-dev@python.org, or if discussions should be sent to you directly. Most Informational PEPs don't have a Discussions-To header.
Change the Status header to "Draft".
For Standards Track PEPs, change the Type header to "Standards Track".
For Informational PEPs, change the Type header to "Informational".
For Standards Track PEPs, if your feature depends on the acceptance of some other currently in-development PEP, add a Requires header right after the Type header. The value should be the PEP number of the PEP yours depends on. Don't add this header if your dependent feature is described in a Final PEP.
Change the Created header to today's date. Be sure to follow the format carefully: it must be in dd-mmm-yyyy format, where the mmm is the 3 English letter month abbreviation, i.e. one of Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.
For Standards Track PEPs, after the Created header, add a Python-Version header and set the value to the next planned version of Python, i.e. the one your new feature will hopefully make its first appearance in. Do not use an alpha or beta release designation here. Thus, if the last version of Python was 2.2 alpha 1 and you're hoping to get your new feature into Python 2.2, set the header to:
Python-Version: 2.2
Leave Post-History alone for now; you'll add dates to this header each time you post your PEP to python-list@python.org or python-dev@python.org. If you posted your PEP to the lists on August 14, 2001 and September 3, 2001, the Post-History header would look like:
Post-History: 14-Aug-2001, 03-Sept-2001
You must manually add new dates and check them in. If you don't have check-in privileges, send your changes to the PEP editors.
Add a Replaces header if your PEP obsoletes an earlier PEP. The value of this header is the number of the PEP that your new PEP is replacing. Only add this header if the older PEP is in "final" form, i.e. is either Accepted, Final, or Rejected. You aren't replacing an older open PEP if you're submitting a competing idea.
Now write your Abstract, Rationale, and other content for your PEP, replacing all this gobbledygook with your own text. Be sure to adhere to the format guidelines below, specifically on the prohibition of tab characters and the indentation requirements.
Update your References and Copyright section. Usually you'll place your PEP into the public domain, in which case just leave the Copyright section alone. Alternatively, you can use the Open Publication License [6], but public domain is still strongly preferred.
Leave the Emacs stanza at the end of this file alone, including the formfeed character ("^L", or \f).
Send your PEP submission to the PEP editors at peps@python.org.
ReStructuredText PEP Formatting Requirements
The following is a PEP-specific summary of reStructuredText syntax. For the sake of simplicity and brevity, much detail is omitted. For more detail, see Resources below. Literal blocks (in which no markup processing is done) are used for examples throughout, to illustrate the plaintext markup.
General
You must adhere to the Emacs convention of adding two spaces at the end of every sentence. You should fill your paragraphs to column 70, but under no circumstances should your lines extend past column 79. If your code samples spill over column 79, you should rewrite them.
Tab characters must never appear in the document at all. A PEP should include the standard Emacs stanza included by example at the bottom of this PEP.
Section Headings
PEP headings must begin in column zero and the initial letter of each word must be capitalized as in book titles. Acronyms should be in all capitals. Section titles must be adorned with an underline, a single repeated punctuation character, which begins in column zero and must extend at least as far as the right edge of the title text (4 characters minimum). First-level section titles are underlined with "=" (equals signs), second-level section titles with "-" (hyphens), and third-level section titles with "'" (single quotes or apostrophes). For example:
First-Level Title ================= Second-Level Title ------------------ Third-Level Title '''''''''''''''''
If there are more than three levels of sections in your PEP, you may insert overline/underline-adorned titles for the first and second levels as follows:
============================ First-Level Title (optional) ============================ ----------------------------- Second-Level Title (optional) ----------------------------- Third-Level Title ================= Fourth-Level Title ------------------ Fifth-Level Title '''''''''''''''''
You shouldn't have more than five levels of sections in your PEP. If you do, you should consider rewriting it.
You must use two blank lines between the last line of a section's body and the next section heading. If a subsection heading immediately follows a section heading, a single blank line in-between is sufficient.
The body of each section is not normally indented, although some constructs do use indentation, as described below. Blank lines are used to separate constructs.
Paragraphs
Paragraphs are left-aligned text blocks separated by blank lines. Paragraphs are not indented unless they are part of an indented construct (such as a block quote or a list item).
Inline Markup
Portions of text within paragraphs and other text blocks may be styled. For example:
Text may be marked as *emphasized* (single asterisk markup, typically shown in italics) or **strongly emphasized** (double asterisks, typically boldface). ``Inline literals`` (using double backquotes) are typically rendered in a monospaced typeface. No further markup recognition is done within the double backquotes, so they're safe for any kind of code snippets.
Block Quotes
Block quotes consist of indented body elements. For example:
This is a paragraph.
This is a block quote.
A block quote may contain many paragraphs.
Block quotes are used to quote extended passages from other sources. Block quotes may be nested inside other body elements. Use 4 spaces per indent level.
Literal Blocks
Literal blocks are used for code samples or preformatted ASCII art. To indicate a literal block, preface the indented text block with "::" (two colons). The literal block continues until the end of the indentation. Indent the text block by 4 spaces. For example:
This is a typical paragraph. A literal block follows.
::
for a in [5,4,3,2,1]: # this is program code, shown as-is
print a
print "it's..."
# a literal block continues until the indentation ends
The paragraph containing only "::" will be completely removed from the output; no empty paragraph will remain. "::" is also recognized at the end of any paragraph. If immediately preceded by whitespace, both colons will be removed from the output. When text immediately precedes the "::", one colon will be removed from the output, leaving only one colon visible (i.e., "::" will be replaced by ":"). For example, one colon will remain visible here:
Paragraph::
Literal block
Lists
Bullet list items begin with one of "-", "*", or "+" (hyphen, asterisk, or plus sign), followed by whitespace and the list item body. List item bodies must be left-aligned and indented relative to the bullet; the text immediately after the bullet determines the indentation. For example:
This paragraph is followed by a list.
* This is the first bullet list item. The blank line above the
first list item is required; blank lines between list items
(such as below this paragraph) are optional.
* This is the first paragraph in the second item in the list.
This is the second paragraph in the second item in the list.
The blank line above this paragraph is required. The left edge
of this paragraph lines up with the paragraph above, both
indented relative to the bullet.
- This is a sublist. The bullet lines up with the left edge of
the text blocks above. A sublist is a new list so requires a
blank line above and below.
* This is the third item of the main list.
This paragraph is not part of the list.
Enumerated (numbered) list items are similar, but use an enumerator instead of a bullet. Enumerators are numbers (1, 2, 3, ...), letters (A, B, C, ...; uppercase or lowercase), or Roman numerals (i, ii, iii, iv, ...; uppercase or lowercase), formatted with a period suffix ("1.", "2."), parentheses ("(1)", "(2)"), or a right-parenthesis suffix ("1)", "2)"). For example:
1. As with bullet list items, the left edge of paragraphs must align. 2. Each list item may contain multiple paragraphs, sublists, etc. This is the second paragraph of the second list item. a) Enumerated lists may be nested. b) Blank lines may be omitted between list items.
Definition lists are written like this:
what
Definition lists associate a term with a definition.
how
The term is a one-line phrase, and the definition is one
or more paragraphs or body elements, indented relative to
the term.
Tables
Simple tables are easy and compact:
===== ===== ======= A B A and B ===== ===== ======= False False False True False False False True False True True True ===== ===== =======
There must be at least two columns in a table (to differentiate from section titles). Column spans use underlines of hyphens ("Inputs" spans the first two columns):
===== ===== ====== Inputs Output ------------ ------ A B A or B ===== ===== ====== False False False True False True False True True True True True ===== ===== ======
Text in a first-column cell starts a new row. No text in the first column indicates a continuation line; the rest of the cells may consist of multiple lines. For example:
===== =========================
col 1 col 2
===== =========================
1 Second column of row 1.
2 Second column of row 2.
Second line of paragraph.
3 - Second column of row 3.
- Second item in bullet
list (row 3, column 2).
===== =========================
Hyperlinks
When referencing an external web page in the body of a PEP, you should include the title of the page in the text, with either an inline hyperlink reference to the URL or a footnote reference (see Footnotes below). Do not include the URL in the body text of the PEP.
Hyperlink references use backquotes and a trailing underscore to mark up the reference text; backquotes are optional if the reference text is a single word. For example:
In this paragraph, we refer to the `Python web site`_.
An explicit target provides the URL. Put targets in a References section at the end of the PEP, or immediately after the reference. Hyperlink targets begin with two periods and a space (the "explicit markup start"), followed by a leading underscore, the reference text, a colon, and the URL (absolute or relative):
.. _Python web site: http://www.python.org/
The reference text and the target text must match (although the match is case-insensitive and ignores differences in whitespace). Note that the underscore trails the reference text but precedes the target text. If you think of the underscore as a right-pointing arrow, it points away from the reference and toward the target.
The same mechanism can be used for internal references. Every unique section title implicitly defines an internal hyperlink target. We can make a link to the Abstract section like this:
Here is a hyperlink reference to the `Abstract`_ section. The backquotes are optional since the reference text is a single word; we can also just write: Abstract_.
Footnotes containing the URLs from external targets will be generated automatically at the end of the References section of the PEP, along with footnote references linking the reference text to the footnotes.
Text of the form "PEP x" or "RFC x" (where "x" is a number) will be linked automatically to the appropriate URLs.
Footnotes
Footnote references consist of a left square bracket, a number, a right square bracket, and a trailing underscore:
This sentence ends with a footnote reference [1]_.
Whitespace must precede the footnote reference. Leave a space between the footnote reference and the preceding word.
When referring to another PEP, include the PEP number in the body text, such as "PEP 1". The title may optionally appear. Add a footnote reference following the title. For example:
Refer to PEP 1 [2]_ for more information.
Add a footnote that includes the PEP's title and author. It may optionally include the explicit URL on a separate line, but only in the References section. Footnotes begin with ".. " (the explicit markup start), followed by the footnote marker (no underscores), followed by the footnote body. For example:
References ========== .. [2] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton (http://www.python.org/dev/peps/pep-0001)
If you decide to provide an explicit URL for a PEP, please use this as the URL template:
http://www.python.org/dev/peps/pep-xxxx
PEP numbers in URLs must be padded with zeros from the left, so as to be exactly 4 characters wide, however PEP numbers in the text are never padded.
During the course of developing your PEP, you may have to add, remove, and rearrange footnote references, possibly resulting in mismatched references, obsolete footnotes, and confusion. Auto-numbered footnotes allow more freedom. Instead of a number, use a label of the form "#word", where "word" is a mnemonic consisting of alphanumerics plus internal hyphens, underscores, and periods (no whitespace or other characters are allowed). For example:
Refer to PEP 1 [#PEP-1]_ for more information. References ========== .. [#PEP-1] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton http://www.python.org/dev/peps/pep-0001
Footnotes and footnote references will be numbered automatically, and the numbers will always match. Once a PEP is finalized, auto-numbered labels should be replaced by numbers for simplicity.
Images
If your PEP contains a diagram, you may include it in the processed output using the "image" directive:
.. image:: diagram.png
Any browser-friendly graphics format is possible: .png, .jpeg, .gif, .tiff, etc.
Since this image will not be visible to readers of the PEP in source text form, you should consider including a description or ASCII art alternative, using a comment (below).
Comments
A comment block is an indented block of arbitrary text immediately following an explicit markup start: two periods and whitespace. Leave the ".." on a line by itself to ensure that the comment is not misinterpreted as another explicit markup construct. Comments are not visible in the processed document. For the benefit of those reading your PEP in source form, please consider including a descriptions of or ASCII art alternatives to any images you include. For example:
.. image:: dataflow.png .. Data flows from the input module, through the "black box" module, and finally into (and through) the output module.
The Emacs stanza at the bottom of this document is inside a comment.
Escaping Mechanism
reStructuredText uses backslashes ("\") to override the special meaning given to markup characters and get the literal characters themselves. To get a literal backslash, use an escaped backslash ("\\"). There are two contexts in which backslashes have no special meaning: literal blocks and inline literals (see Inline Markup above). In these contexts, no markup recognition is done, and a single backslash represents a literal backslash, without having to double up.
If you find that you need to use a backslash in your text, consider using inline literals or a literal block instead.
Habits to Avoid
Many programmers who are familiar with TeX often write quotation marks like this:
`single-quoted' or ``double-quoted''
Backquotes are significant in reStructuredText, so this practice should be avoided. For ordinary text, use ordinary 'single-quotes' or "double-quotes". For inline literal text (see Inline Markup above), use double-backquotes:
``literal text: in here, anything goes!``
Resources
Many other constructs and variations are possible. For more details about the reStructuredText markup, in increasing order of thoroughness, please see:
- A ReStructuredText Primer [7], a gentle introduction.
- Quick reStructuredText [8], a users' quick reference.
- reStructuredText Markup Specification [9], the final authority.
The processing of reStructuredText PEPs is done using Docutils [3]. If you have a question or require assistance with reStructuredText or Docutils, please post a message [4] to the Docutils-users mailing list [5]. The Docutils project web site [3] has more information.
References
| [1] | PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton (http://www.python.org/dev/peps/pep-0001) |
| [2] | PEP 9, Sample Plaintext PEP Template, Warsaw (http://www.python.org/dev/peps/pep-0009) |
| [3] | (1, 2) http://docutils.sourceforge.net/ |
| [4] | mailto:docutils-users@lists.sourceforge.net?subject=PEPs |
| [5] | http://docutils.sf.net/docs/user/mailing-lists.html#docutils-users |
| [6] | http://www.opencontent.org/openpub/ |
| [7] | http://docutils.sourceforge.net/docs/rst/quickstart.html |
| [8] | http://docutils.sourceforge.net/docs/rst/quickref.html |
| [9] | http://docutils.sourceforge.net/spec/rst/reStructuredText.html |
Copyright
This document has been placed in the public domain.
pep-0020 The Zen of Python
| PEP: | 20 |
|---|---|
| Title: | The Zen of Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tim Peters <tim at zope.com> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/plain |
| Created: | 19-Aug-2004 |
| Post-History: | 22-Aug-2004 |
Abstract
Long time Pythoneer Tim Peters succinctly channels the BDFL's
guiding principles for Python's design into 20 aphorisms, only 19
of which have been written down.
The Zen of Python
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Easter Egg
>>> import this
Copyright
This document has been placed in the public domain.
pep-0042 Feature Requests
| PEP: | 42 |
|---|---|
| Title: | Feature Requests |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Final |
| Type: | Process |
| Created: | 12-Sep-2000 |
| Post-History: |
Introduction
This PEP contains a list of feature requests that may be
considered for future versions of Python. Large feature requests
should not be included here, but should be described in separate
PEPs; however a large feature request that doesn't have its own
PEP can be listed here until its own PEP is created. See
PEP 0 for details.
This PEP was created to allow us to close bug reports that are really
feature requests. Marked as Open, they distract from the list of real
bugs (which should ideally be less than a page). Marked as Closed, they
tend to be forgotten. The procedure now is: if a bug report is really
a feature request, add the feature request to this PEP; mark the bug as
"feature request", "later", and "closed"; and add a comment to the bug
saying that this is the case (mentioning the PEP explicitly). It is
also acceptable to move large feature requests directly from the bugs
database to a separate PEP.
This PEP should really be separated into four different categories
(categories due to Laura Creighton):
1. BDFL rejects as a bad idea. Don't come back with it.
2. BDFL will put in if somebody writes the code. (Or at any rate,
BDFL will say 'change this and I will put it in' if you show up
with code.)
(possibly divided into:
2a) BDFL would really like to see some code!
2b) BDFL is never going to be enthusiastic about this, but
will work it in when it's easy.
)
3. If you show up with code, BDFL will make a pronouncement. It
might be ICK.
4. This is too vague. This is rejected, but only on the grounds
of vagueness. If you like this enhancement, make a new PEP.
Core Language / Builtins
- The parser should handle more deeply nested parse trees.
The following will fail -- eval("["*50 + "]"*50) -- because the
parser has a hard-coded limit on stack size. This limit should
be raised or removed. Removal would be hard because the
current compiler can overflow the C stack if the nesting is too
deep.
http://www.python.org/sf/215555
- Non-accidental IEEE-754 support (Infs, NaNs, settable traps, etc).
Big project.
- Windows: Trying to create (or even access) files with certain magic
names can hang or crash Windows systems. This is really a bug in the
OSes, but some apps try to shield users from it. When it happens,
the symptoms are very confusing.
Hang using files named prn.txt, etc
http://www.python.org/sf/481171
- eval and free variables: It might be useful if there was a way
to pass bindings for free variables to eval when a code object
with free variables is passed.
http://www.python.org/sf/443866
Standard Library
- The urllib module should support proxies which require
authentication. See SourceForge bug #210619 for information:
http://www.python.org/sf/210619
- os.rename() should be modified to handle EXDEV errors on
platforms that don't allow rename() to operate across filesystem
boundaries by copying the file over and removing the original.
Linux is one system that requires this treatment.
http://www.python.org/sf/212317
- signal handling doesn't always work as expected. E.g. if
sys.stdin.readline() is interrupted by a (returning) signal
handler, it returns "". It would be better to make it raise an
exception (corresponding to EINTR) or to restart. But these
changes would have to applied to all places that can do blocking
interruptable I/O. So it's a big project.
http://www.python.org/sf/210599
- Extend Windows utime to accept directory paths.
http://www.python.org/sf/214245
- Extend copy.py to module & function types.
http://www.python.org/sf/214553
- Better checking for bad input to marshal.load*().
http://www.python.org/sf/214754
- rfc822.py should be more lenient than the spec in the types of
address fields it parses. Specifically, an invalid address of
the form "From: Amazon.com <delivers-news2@amazon.com>" should
be parsed correctly.
http://www.python.org/sf/210678
- cgi.py's FieldStorage class should be more conservative with
memory in the face of large binary file uploads.
http://www.python.org/sf/210674
There are two issues here: first, because
read_lines_to_outerboundary() uses readline() it is possible
that a large amount of data will be read into memory for a
binary file upload. This should probably look at the
Content-Type header of the section and do a chunked read if it's
a binary type.
The second issue was related to the self.lines attribute, which
was removed in revision 1.56 of cgi.py (see also):
http://www.python.org/sf/219806
- urllib should support proxy definitions that contain just the
host and port
http://www.python.org/sf/210849
- urlparse should be updated to comply with RFC 2396, which
defines optional parameters for each segment of the path.
http://www.python.org/sf/210834
- The exceptions raised by pickle and cPickle are currently
different; these should be unified (probably the exceptions
should be defined in a helper module that's imported by both).
[No bug report; I just thought of this.]
- More standard library routines should support Unicode. For
example, urllib.quote() could convert Unicode strings to UTF-8
and then do the usual %HH conversion. But this is not the only
one!
http://www.python.org/sf/216716
- There should be a way to say that you don't mind if str() or
__str__() return a Unicode string object. Or a different
function -- ustr() has been proposed. Or something...
http://sf.net/patch/?func=detailpatch&patch_id=101527&group_id=5470
- Killing a thread from another thread. Or maybe sending a
signal. Or maybe raising an asynchronous exception.
http://www.python.org/sf/221115
- The debugger (pdb) should understand packages.
http://www.python.org/sf/210631
- Jim Fulton suggested the following:
I wonder if it would be a good idea to have a new kind of
temporary file that stored data in memory unless:
- The data exceeds some size, or
- Somebody asks for a fileno.
Then the cgi module (and other apps) could use this thing in a
uniform way.
http://www.python.org/sf/415692
- Jim Fulton pointed out that binascii's b2a_base64() function
has situations where it makes sense not to append a newline,
or to append something else than a newline.
Proposal:
- add an optional argument giving the delimiter string to be
appended, defaulting to "\n"
- possibly special-case None as the delimiter string to avoid
adding the pad bytes too???
http://www.python.org/sf/415694
- pydoc should be integrated with the HTML docs, or at least
be able to link to them.
http://www.python.org/sf/405554
- Distutils should deduce dependencies for .c and .h files.
http://www.python.org/sf/472881
- asynchat is buggy in the face of multithreading.
http://www.python.org/sf/595217
- It would be nice if the higher level modules (httplib, smtplib,
nntplib, etc.) had options for setting socket timeouts.
http://www.python.org/sf/723287
- The curses library is missing two important calls: newterm() and
delscreen()
http://www.python.org/sf/665572, http://bugs.debian.org/175590
- It would be nice if the built-in SSL socket type could be used
for non-blocking SSL I/O. Currently packages such as Twisted
which implement async servers using SSL have to require third-party
packages such as pyopenssl.
- reST as a standard library module
- The import lock could use some redesign.
http://www.python.org/sf/683658
- A nicer API to open text files, replacing the ugly (in some
people's eyes) "U" mode flag. There's a proposal out there to
have a new built-in type textfile(filename, mode, encoding).
(Shouldn't it have a bufsize argument too?)
- Support new widgets and/or parameters for Tkinter
- For a class defined inside another class, the __name__ should be
"outer.inner", and pickling should work. (GvR is no longer certain
this is easy or even right.)
http://www.python.org/sf/633930
- Decide on a clearer deprecation policy (especially for modules)
and act on it.
http://mail.python.org/pipermail/python-dev/2002-April/023165.html
- Provide alternatives for common uses of the types module;
Skip Montanaro has posted a proto-PEP for this idea:
http://mail.python.org/pipermail/python-dev/2002-May/024346.html
- Use pending deprecation for the types and string modules. This
requires providing alternatives for the parts that aren't
covered yet (e.g. string.whitespace and types.TracebackType).
It seems we can't get consensus on this.
- Lazily tracking tuples?
http://mail.python.org/pipermail/python-dev/2002-May/023926.html
http://www.python.org/sf/558745
- Make 'as' a keyword. It has been a pseudo-keyword long enough.
(It's deprecated in 2.5, and will become a keyword in 2.6.)
C API wishes
- Add C API functions to help Windows users who are building
embedded applications where the FILE * structure does not match
the FILE * the interpreter was compiled with.
http://www.python.org/sf/210821
See this bug report for a specific suggestion that will allow a
Borland C++ builder application to interact with a python.dll
build with MSVC.
Tools
- Python could use a GUI builder.
http://www.python.org/sf/210820
Building and Installing
- Modules/makesetup should make sure the 'config.c' file it
generates from the various Setup files, is valid C. It currently
accepts module names with characters that are not allowable in
Python or C identifiers.
http://www.python.org/sf/216326
- Building from source should not attempt to overwrite the
Include/graminit.h and Parser/graminit.c files, at least for
people downloading a source release rather than working from
Subversion or snapshots. Some people find this a problem in unusual
build environments.
http://www.python.org/sf/219221
- The configure script has probably grown a bit crufty with age and may
not track autoconf's more recent features very well. It should be
looked at and possibly cleaned up.
http://mail.python.org/pipermail/python-dev/2004-January/041790.html
- Make Python compliant to the FHS (the Filesystem Hierarchy Standard)
http://bugs.python.org/issue588756
pep-0100 Python Unicode Integration
| PEP: | 100 |
|---|---|
| Title: | Python Unicode Integration |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 10-Mar-2000 |
| Python-Version: | 2.0 |
| Post-History: |
Historical Note
This document was first written by Marc-Andre in the pre-PEP days,
and was originally distributed as Misc/unicode.txt in Python
distributions up to and included Python 2.1. The last revision of
the proposal in that location was labeled version 1.7 (CVS
revision 3.10). Because the document clearly serves the purpose
of an informational PEP in the post-PEP era, it has been moved
here and reformatted to comply with PEP guidelines. Future
revisions will be made to this document, while Misc/unicode.txt
will contain a pointer to this PEP.
-Barry Warsaw, PEP editor
Introduction
The idea of this proposal is to add native Unicode 3.0 support to
Python in a way that makes use of Unicode strings as simple as
possible without introducing too many pitfalls along the way.
Since this goal is not easy to achieve -- strings being one of the
most fundamental objects in Python -- we expect this proposal to
undergo some significant refinements.
Note that the current version of this proposal is still a bit
unsorted due to the many different aspects of the Unicode-Python
integration.
The latest version of this document is always available at:
http://starship.python.net/~lemburg/unicode-proposal.txt
Older versions are available as:
http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
[ed. note: new revisions should be made to this PEP document,
while the historical record previous to version 1.7 should be
retrieved from MAL's url, or Misc/unicode.txt]
Conventions
- In examples we use u = Unicode object and s = Python string
- 'XXX' markings indicate points of discussion (PODs)
General Remarks
- Unicode encoding names should be lower case on output and
case-insensitive on input (they will be converted to lower case
by all APIs taking an encoding name as input).
- Encoding names should follow the name conventions as used by the
Unicode Consortium: spaces are converted to hyphens, e.g. 'utf
16' is written as 'utf-16'.
- Codec modules should use the same names, but with hyphens
converted to underscores, e.g. utf_8, utf_16, iso_8859_1.
Unicode Default Encoding
The Unicode implementation has to make some assumption about the
encoding of 8-bit strings passed to it for coercion and about the
encoding to as default for conversion of Unicode to strings when
no specific encoding is given. This encoding is called <default
encoding> throughout this text.
For this, the implementation maintains a global which can be set
in the site.py Python startup script. Subsequent changes are not
possible. The <default encoding> can be set and queried using the
two sys module APIs:
sys.setdefaultencoding(encoding)
--> Sets the <default encoding> used by the Unicode implementation.
encoding has to be an encoding which is supported by the
Python installation, otherwise, a LookupError is raised.
Note: This API is only available in site.py! It is
removed from the sys module by site.py after usage.
sys.getdefaultencoding()
--> Returns the current <default encoding>.
If not otherwise defined or set, the <default encoding> defaults
to 'ascii'. This encoding is also the startup default of Python
(and in effect before site.py is executed).
Note that the default site.py startup module contains disabled
optional code which can set the <default encoding> according to
the encoding defined by the current locale. The locale module is
used to extract the encoding from the locale default settings
defined by the OS environment (see locale.py). If the encoding
cannot be determined, is unkown or unsupported, the code defaults
to setting the <default encoding> to 'ascii'. To enable this
code, edit the site.py file or place the appropriate code into the
sitecustomize.py module of your Python installation.
Unicode Constructors
Python should provide a built-in constructor for Unicode strings
which is available through __builtins__:
u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
u = u'<unicode-escape encoded Python string>'
u = ur'<raw-unicode-escape encoded Python string>'
With the 'unicode-escape' encoding being defined as:
- all non-escape characters represent themselves as Unicode
ordinal (e.g. 'a' -> U+0061).
- all existing defined Python escape sequences are interpreted as
Unicode ordinals; note that \xXXXX can represent all Unicode
ordinals, and \OOO (octal) can represent Unicode ordinals up to
U+01FF.
- a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
error to have fewer than 4 digits after \u.
For an explanation of possible values for errors see the Codec
section below.
Examples:
u'abc' -> U+0061 U+0062 U+0063
u'\u1234' -> U+1234
u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
The 'raw-unicode-escape' encoding is defined as follows:
- \uXXXX sequence represent the U+XXXX Unicode character if and
only if the number of leading backslashes is odd
- all other characters represent themselves as Unicode ordinal
(e.g. 'b' -> U+0062)
Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment
lines of the source file (e.g. '# source file encoding: latin-1').
If you only use 7-bit ASCII then everything is fine and no such
notice is needed, but if you include Latin-1 characters not
defined in ASCII, it may well be worthwhile including a hint since
people in other countries will want to be able to read your source
strings too.
Unicode Type Object
Unicode objects should have the type UnicodeType with type name
'unicode', made available through the standard types module.
Unicode Output
Unicode objects have a method .encode([encoding=<default encoding>])
which returns a Python string encoding the Unicode string using the
given scheme (see Codecs).
print u := print u.encode() # using the <default encoding>
str(u) := u.encode() # using the <default encoding>
repr(u) := "u%s" % repr(u.encode('unicode-escape'))
Also see Internal Argument Parsing and Buffer Interface for
details on how other APIs written in C will treat Unicode objects.
Unicode Ordinals
Since Unicode 3.0 has a 32-bit ordinal character set, the
implementation should provide 32-bit aware ordinal conversion
APIs:
ord(u[:1]) (this is the standard ord() extended to work with Unicode
objects)
--> Unicode ordinal number (32-bit)
unichr(i)
--> Unicode object for character i (provided it is 32-bit);
ValueError otherwise
Both APIs should go into __builtins__ just like their string
counterparts ord() and chr().
Note that Unicode provides space for private encodings. Usage of
these can cause different output representations on different
machines. This problem is not a Python or Unicode problem, but a
machine setup and maintenance one.
Comparison & Hash Value
Unicode objects should compare equal to other objects after these
other objects have been coerced to Unicode. For strings this
means that they are interpreted as Unicode string using the
<default encoding>.
Unicode objects should return the same hash value as their ASCII
equivalent strings. Unicode strings holding non-ASCII values are
not guaranteed to return the same hash values as the default
encoded equivalent string representation.
When compared using cmp() (or PyObject_Compare()) the
implementation should mask TypeErrors raised during the conversion
to remain in synch with the string behavior. All other errors
such as ValueErrors raised during coercion of strings to Unicode
should not be masked and passed through to the user.
In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
should be coerced to Unicode before applying the test. Errors
occurring during coercion (e.g. None in u'abc') should not be
masked.
Coercion
Using Python strings and Unicode objects to form new objects
should always coerce to the more precise format, i.e. Unicode
objects.
u + s := u + unicode(s)
s + u := unicode(s) + u
All string methods should delegate the call to an equivalent
Unicode object method call by converting all involved strings to
Unicode and then applying the arguments to the Unicode method of
the same name, e.g.
string.join((s,u),sep) := (s + sep) + u
sep.join((s,u)) := (s + sep) + u
For a discussion of %-formatting w/r to Unicode objects, see
Formatting Markers.
Exceptions
UnicodeError is defined in the exceptions module as a subclass of
ValueError. It is available at the C level via
PyExc_UnicodeError. All exceptions related to Unicode
encoding/decoding should be subclasses of UnicodeError.
Codecs (Coder/Decoders) Lookup
A Codec (see Codec Interface Definition) search registry should be
implemented by a module "codecs":
codecs.register(search_function)
Search functions are expected to take one argument, the encoding
name in all lower case letters and with hyphens and spaces
converted to underscores, and return a tuple of functions
(encoder, decoder, stream_reader, stream_writer) taking the
following arguments:
encoder and decoder:
These must be functions or methods which have the same
interface as the .encode/.decode methods of Codec instances
(see Codec Interface). The functions/methods are expected to
work in a stateless mode.
stream_reader and stream_writer:
These need to be factory functions with the following
interface:
factory(stream,errors='strict')
The factory functions must return objects providing the
interfaces defined by StreamWriter/StreamReader resp. (see
Codec Interface). Stream codecs can maintain state.
Possible values for errors are defined in the Codec section
below.
In case a search function cannot find a given encoding, it should
return None.
Aliasing support for encodings is left to the search functions to
implement.
The codecs module will maintain an encoding cache for performance
reasons. Encodings are first looked up in the cache. If not
found, the list of registered search functions is scanned. If no
codecs tuple is found, a LookupError is raised. Otherwise, the
codecs tuple is stored in the cache and returned to the caller.
To query the Codec instance the following API should be used:
codecs.lookup(encoding)
This will either return the found codecs tuple or raise a
LookupError.
Standard Codecs
Standard codecs should live inside an encodings/ package directory
in the Standard Python Code Library. The __init__.py file of that
directory should include a Codec Lookup compatible search function
implementing a lazy module based codec lookup.
Python should provide a few standard codecs for the most relevant
encodings, e.g.
'utf-8': 8-bit variable length encoding
'utf-16': 16-bit variable length encoding (little/big endian)
'utf-16-le': utf-16 but explicitly little endian
'utf-16-be': utf-16 but explicitly big endian
'ascii': 7-bit ASCII codepage
'iso-8859-1': ISO 8859-1 (Latin 1) codepage
'unicode-escape': See Unicode Constructors for a definition
'raw-unicode-escape': See Unicode Constructors for a definition
'native': Dump of the Internal Format used by Python
Common aliases should also be provided per default, e.g.
'latin-1' for 'iso-8859-1'.
Note: 'utf-16' should be implemented by using and requiring byte
order marks (BOM) for file input/output.
All other encodings such as the CJK ones to support Asian scripts
should be implemented in separate packages which do not get
included in the core Python distribution and are not a part of
this proposal.
Codecs Interface Definition
The following base class should be defined in the module "codecs".
They provide not only templates for use by encoding module
implementors, but also define the interface which is expected by
the Unicode implementation.
Note that the Codec Interface defined here is well suitable for a
larger range of applications. The Unicode implementation expects
Unicode objects on input for .encode() and .write() and character
buffer compatible objects on input for .decode(). Output of
.encode() and .read() should be a Python string and .decode() must
return an Unicode object.
First, we have the stateless encoders/decoders. These do not work
in chunks as the stream codecs (see below) do, because all
components are expected to be available in memory.
class Codec:
"""Defines the interface for stateless encoders/decoders.
The .encode()/.decode() methods may implement different
error handling schemes by providing the errors argument.
These string values are defined:
'strict' - raise an error (or a subclass)
'ignore' - ignore the character and continue with the next
'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD
REPLACEMENT CHARACTER for the builtin Unicode
codecs.
"""
def encode(self,input,errors='strict'):
"""Encodes the object input and returns a tuple (output
object, length consumed).
errors defines the error handling to apply. It
defaults to 'strict' handling.
The method may not store state in the Codec instance.
Use StreamCodec for codecs which have to keep state in
order to make encoding/decoding efficient.
"""
def decode(self,input,errors='strict'):
"""Decodes the object input and returns a tuple (output
object, length consumed).
input must be an object which provides the
bf_getreadbuf buffer slot. Python strings, buffer
objects and memory mapped files are examples of objects
providing this slot.
errors defines the error handling to apply. It
defaults to 'strict' handling.
The method may not store state in the Codec instance.
Use StreamCodec for codecs which have to keep state in
order to make encoding/decoding efficient.
"""
StreamWriter and StreamReader define the interface for stateful
encoders/decoders which work on streams. These allow processing
of the data in chunks to efficiently use memory. If you have
large strings in memory, you may want to wrap them with cStringIO
objects and then use these codecs on them to be able to do chunk
processing as well, e.g. to provide progress information to the
user.
class StreamWriter(Codec):
def __init__(self,stream,errors='strict'):
"""Creates a StreamWriter instance.
stream must be a file-like object open for writing
(binary) data.
The StreamWriter may implement different error handling
schemes by providing the errors keyword argument.
These parameters are defined:
'strict' - raise a ValueError (or a subclass)
'ignore' - ignore the character and continue with the next
'replace'- replace with a suitable replacement character
"""
self.stream = stream
self.errors = errors
def write(self,object):
"""Writes the object's contents encoded to self.stream.
"""
data, consumed = self.encode(object,self.errors)
self.stream.write(data)
def writelines(self, list):
"""Writes the concatenated list of strings to the stream
using .write().
"""
self.write(''.join(list))
def reset(self):
"""Flushes and resets the codec buffers used for keeping state.
Calling this method should ensure that the data on the
output is put into a clean state, that allows appending
of new fresh data without having to rescan the whole
stream to recover state.
"""
pass
def __getattr__(self,name, getattr=getattr):
"""Inherit all other methods from the underlying stream.
"""
return getattr(self.stream,name)
class StreamReader(Codec):
def __init__(self,stream,errors='strict'):
"""Creates a StreamReader instance.
stream must be a file-like object open for reading
(binary) data.
The StreamReader may implement different error handling
schemes by providing the errors keyword argument.
These parameters are defined:
'strict' - raise a ValueError (or a subclass)
'ignore' - ignore the character and continue with the next
'replace'- replace with a suitable replacement character;
"""
self.stream = stream
self.errors = errors
def read(self,size=-1):
"""Decodes data from the stream self.stream and returns the
resulting object.
size indicates the approximate maximum number of bytes
to read from the stream for decoding purposes. The
decoder can modify this setting as appropriate. The
default value -1 indicates to read and decode as much
as possible. size is intended to prevent having to
decode huge files in one step.
The method should use a greedy read strategy meaning
that it should read as much data as is allowed within
the definition of the encoding and the given size, e.g.
if optional encoding endings or state markers are
available on the stream, these should be read too.
"""
# Unsliced reading:
if size < 0:
return self.decode(self.stream.read())[0]
# Sliced reading:
read = self.stream.read
decode = self.decode
data = read(size)
i = 0
while 1:
try:
object, decodedbytes = decode(data)
except ValueError,why:
# This method is slow but should work under pretty
# much all conditions; at most 10 tries are made
i = i + 1
newdata = read(1)
if not newdata or i > 10:
raise
data = data + newdata
else:
return object
def readline(self, size=None):
"""Read one line from the input stream and return the
decoded data.
Note: Unlike the .readlines() method, this method
inherits the line breaking knowledge from the
underlying stream's .readline() method -- there is
currently no support for line breaking using the codec
decoder due to lack of line buffering. Subclasses
should however, if possible, try to implement this
method using their own knowledge of line breaking.
size, if given, is passed as size argument to the
stream's .readline() method.
"""
if size is None:
line = self.stream.readline()
else:
line = self.stream.readline(size)
return self.decode(line)[0]
def readlines(self, sizehint=0):
"""Read all lines available on the input stream
and return them as list of lines.
Line breaks are implemented using the codec's decoder
method and are included in the list entries.
sizehint, if given, is passed as size argument to the
stream's .read() method.
"""
if sizehint is None:
data = self.stream.read()
else:
data = self.stream.read(sizehint)
return self.decode(data)[0].splitlines(1)
def reset(self):
"""Resets the codec buffers used for keeping state.
Note that no stream repositioning should take place.
This method is primarily intended to be able to recover
from decoding errors.
"""
pass
def __getattr__(self,name, getattr=getattr):
""" Inherit all other methods from the underlying stream.
"""
return getattr(self.stream,name)
Stream codec implementors are free to combine the StreamWriter and
StreamReader interfaces into one class. Even combining all these
with the Codec class should be possible.
Implementors are free to add additional methods to enhance the
codec functionality or provide extra state information needed for
them to work. The internal codec implementation will only use the
above interfaces, though.
It is not required by the Unicode implementation to use these base
classes, only the interfaces must match; this allows writing
Codecs as extension types.
As guideline, large mapping tables should be implemented using
static C data in separate (shared) extension modules. That way
multiple processes can share the same data.
A tool to auto-convert Unicode mapping files to mapping modules
should be provided to simplify support for additional mappings
(see References).
Whitespace
The .split() method will have to know about what is considered
whitespace in Unicode.
Case Conversion
Case conversion is rather complicated with Unicode data, since
there are many different conditions to respect. See
http://www.unicode.org/unicode/reports/tr13/
for some guidelines on implementing case conversion.
For Python, we should only implement the 1-1 conversions included
in Unicode. Locale dependent and other special case conversions
(see the Unicode standard file SpecialCasing.txt) should be left
to user land routines and not go into the core interpreter.
The methods .capitalize() and .iscapitalized() should follow the
case mapping algorithm defined in the above technical report as
closely as possible.
Line Breaks
Line breaking should be done for all Unicode characters having the
B property as well as the combinations CRLF, CR, LF (interpreted
in that order) and other special line separators defined by the
standard.
The Unicode type should provide a .splitlines() method which
returns a list of lines according to the above specification. See
Unicode Methods.
Unicode Character Properties
A separate module "unicodedata" should provide a compact interface
to all Unicode character properties defined in the standard's
UnicodeData.txt file.
Among other things, these properties provide ways to recognize
numbers, digits, spaces, whitespace, etc.
Since this module will have to provide access to all Unicode
characters, it will eventually have to contain the data from
UnicodeData.txt which takes up around 600kB. For this reason, the
data should be stored in static C data. This enables compilation
as shared module which the underlying OS can shared between
processes (unlike normal Python code modules).
There should be a standard Python interface for accessing this
information so that other implementors can plug in their own
possibly enhanced versions, e.g. ones that do decompressing of the
data on-the-fly.
Private Code Point Areas
Support for these is left to user land Codecs and not explicitly
integrated into the core. Note that due to the Internal Format
being implemented, only the area between \uE000 and \uF8FF is
usable for private encodings.
Internal Format
The internal format for Unicode objects should use a Python
specific fixed format <PythonUnicode> implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits). Byte
order is platform dependent.
This format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.
It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use
of surrogates.
Future implementations can extend the 32 bit restriction to the
full set of all UTF-16 addressable characters (around 1M
characters).
The Unicode API should provide interface routines from
<PythonUnicode> to the compiler's wchar_t which can be 16 or 32
bit depending on the compiler/libc/platform being used.
Unicode objects should have a pointer to a cached Python string
object <defenc> holding the object's value using the <default
encoding>. This is needed for performance and internal parsing
(see Internal Argument Parsing) reasons. The buffer is filled
when the first conversion request to the <default encoding> is
issued on the object.
Interning is not needed (for now), since Python identifiers are
defined as being ASCII only.
codecs.BOM should return the byte order mark (BOM) for the format
used internally. The codecs module should provide the following
additional constants for convenience and reference (codecs.BOM
will either be BOM_BE or BOM_LE depending on the platform):
BOM_BE: '\376\377'
(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
platforms == ZERO WIDTH NO-BREAK SPACE)
BOM_LE: '\377\376'
(corresponds to Unicode U+0000FFFE in UTF-16 on little endian
platforms == defined as being an illegal Unicode character)
BOM4_BE: '\000\000\376\377'
(corresponds to Unicode U+0000FEFF in UCS-4)
BOM4_LE: '\377\376\000\000'
(corresponds to Unicode U+0000FFFE in UCS-4)
Note that Unicode sees big endian byte order as being "correct".
The swapped order is taken to be an indicator for a "wrong"
format, hence the illegal character definition.
The configure script should provide aid in deciding whether Python
can use the native wchar_t type or not (it has to be a 16-bit
unsigned type).
Buffer Interface
Implement the buffer interface using the <defenc> Python string
object as basis for bf_getcharbuf and the internal buffer for
bf_getreadbuf. If bf_getcharbuf is requested and the <defenc>
object does not yet exist, it is created first.
Note that as special case, the parser marker "s#" will not return
raw Unicode UTF-16 data (which the bf_getreadbuf returns), but
instead tries to encode the Unicode object using the default
encoding and then returns a pointer to the resulting string object
(or raises an exception in case the conversion fails). This was
done in order to prevent accidentely writing binary data to an
output stream which the other end might not recognize.
This has the advantage of being able to write to output streams
(which typically use this interface) without additional
specification of the encoding to use.
If you need to access the read buffer interface of Unicode
objects, use the PyObject_AsReadBuffer() interface.
The internal format can also be accessed using the
'unicode-internal' codec, e.g. via u.encode('unicode-internal').
Pickle/Marshalling
Should have native Unicode object support. The objects should be
encoded using platform independent encodings.
Marshal should use UTF-8 and Pickle should either choose
Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
encoding. Using UTF-8 instead of UTF-16 has the advantage of
eliminating the need to store a BOM mark.
Regular Expressions
Secret Labs AB is working on a Unicode-aware regular expression
machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
internal character buffers.
Also see
http://www.unicode.org/unicode/reports/tr18/
for some remarks on how to treat Unicode REs.
Formatting Markers
Format markers are used in Python format strings. If Python
strings are used as format strings, the following interpretations
should be in effect:
'%s': For Unicode objects this will cause coercion of the
whole format string to Unicode. Note that you should use
a Unicode format string to start with for performance
reasons.
In case the format string is an Unicode object, all parameters are
coerced to Unicode first and then put together and formatted
according to the format string. Numbers are first converted to
strings and then to Unicode.
'%s': Python strings are interpreted as Unicode
string using the <default encoding>. Unicode objects are
taken as is.
All other string formatters should work accordingly.
Example:
u"%s %s" % (u"abc", "abc") == u"abc abc"
Internal Argument Parsing
These markers are used by the PyArg_ParseTuple() APIs:
"U": Check for Unicode object and return a pointer to it
"s": For Unicode objects: return a pointer to the object's
<defenc> buffer (which uses the <default encoding>).
"s#": Access to the default encoded version of the Unicode object
(see Buffer Interface); note that the length relates to
the length of the default encoded string rather than the
Unicode object length.
"t#": Same as "s#".
"es":
Takes two parameters: encoding (const char *) and buffer
(char **).
The input object is first coerced to Unicode in the usual
way and then encoded into a string using the given
encoding.
On output, a buffer of the needed size is allocated and
returned through *buffer as NULL-terminated string. The
encoded may not contain embedded NULL characters. The
caller is responsible for calling PyMem_Free() to free the
allocated *buffer after usage.
"es#":
Takes three parameters: encoding (const char *), buffer
(char **) and buffer_len (int *).
The input object is first coerced to Unicode in the usual
way and then encoded into a string using the given
encoding.
If *buffer is non-NULL, *buffer_len must be set to
sizeof(buffer) on input. Output is then copied to *buffer.
If *buffer is NULL, a buffer of the needed size is
allocated and output copied into it. *buffer is then
updated to point to the allocated memory area. The caller
is responsible for calling PyMem_Free() to free the
allocated *buffer after usage.
In both cases *buffer_len is updated to the number of
characters written (excluding the trailing NULL-byte).
The output buffer is assured to be NULL-terminated.
Examples:
Using "es#" with auto-allocation:
static PyObject *
test_parser(PyObject *self,
PyObject *args)
{
PyObject *str;
const char *encoding = "latin-1";
char *buffer = NULL;
int buffer_len = 0;
if (!PyArg_ParseTuple(args, "es#:test_parser",
encoding, &buffer, &buffer_len))
return NULL;
if (!buffer) {
PyErr_SetString(PyExc_SystemError,
"buffer is NULL");
return NULL;
}
str = PyString_FromStringAndSize(buffer, buffer_len);
PyMem_Free(buffer);
return str;
}
Using "es" with auto-allocation returning a NULL-terminated string:
static PyObject *
test_parser(PyObject *self,
PyObject *args)
{
PyObject *str;
const char *encoding = "latin-1";
char *buffer = NULL;
if (!PyArg_ParseTuple(args, "es:test_parser",
encoding, &buffer))
return NULL;
if (!buffer) {
PyErr_SetString(PyExc_SystemError,
"buffer is NULL");
return NULL;
}
str = PyString_FromString(buffer);
PyMem_Free(buffer);
return str;
}
Using "es#" with a pre-allocated buffer:
static PyObject *
test_parser(PyObject *self,
PyObject *args)
{
PyObject *str;
const char *encoding = "latin-1";
char _buffer[10];
char *buffer = _buffer;
int buffer_len = sizeof(_buffer);
if (!PyArg_ParseTuple(args, "es#:test_parser",
encoding, &buffer, &buffer_len))
return NULL;
if (!buffer) {
PyErr_SetString(PyExc_SystemError,
"buffer is NULL");
return NULL;
}
str = PyString_FromStringAndSize(buffer, buffer_len);
return str;
}
File/Stream Output
Since file.write(object) and most other stream writers use the
"s#" or "t#" argument parsing marker for querying the data to
write, the default encoded string version of the Unicode object
will be written to the streams (see Buffer Interface).
For explicit handling of files using Unicode, the standard stream
codecs as available through the codecs module should be used.
The codecs module should provide a short-cut
open(filename,mode,encoding) available which also assures that
mode contains the 'b' character when needed.
File/Stream Input
Only the user knows what encoding the input data uses, so no
special magic is applied. The user will have to explicitly
convert the string data to Unicode objects as needed or use the
file wrappers defined in the codecs module (see File/Stream
Output).
Unicode Methods & Attributes
All Python string methods, plus:
.encode([encoding=<default encoding>][,errors="strict"])
--> see Unicode Output
.splitlines([include_breaks=0])
--> breaks the Unicode string into a list of (Unicode) lines;
returns the lines with line breaks included, if
include_breaks is true. See Line Breaks for a
specification of how line breaking is done.
Code Base
We should use Fredrik Lundh's Unicode object implementation as
basis. It already implements most of the string methods needed
and provides a well written code base which we can build upon.
The object sharing implemented in Fredrik's implementation should
be dropped.
Test Cases
Test cases should follow those in Lib/test/test_string.py and
include additional checks for the Codec Registry and the Standard
Codecs.
References
Unicode Consortium:
http://www.unicode.org/
Unicode FAQ:
http://www.unicode.org/unicode/faq/
Unicode 3.0:
http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
Unicode-TechReports:
http://www.unicode.org/unicode/reports/techreports.html
Unicode-Mappings:
ftp://ftp.unicode.org/Public/MAPPINGS/
Introduction to Unicode (a little outdated by still nice to read):
http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
For comparison:
Introducing Unicode to ECMAScript (aka JavaScript) --
http://www-4.ibm.com/software/developer/library/internationalization-support.html
IANA Character Set Names:
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Discussion of UTF-8 and Unicode support for POSIX and Linux:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
Encodings:
Overview:
http://czyborra.com/utf/
UTC-2:
http://www.uazone.com/multiling/unicode/ucs2.html
UTF-7:
Defined in RFC2152, e.g.
http://www.uazone.com/multiling/ml-docs/rfc2152.txt
UTF-8:
Defined in RFC2279, e.g.
http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
UTF-16:
http://www.uazone.com/multiling/unicode/wg2n1035.html
History of this Proposal
[ed. note: revisions prior to 1.7 are available in the CVS history
of Misc/unicode.txt from the standard Python distribution. All
subsequent history is available via the CVS revisions on this
file.]
1.7: Added note about the changed behaviour of "s#".
1.6: Changed <defencstr> to <defenc> since this is the name used in the
implementation. Added notes about the usage of <defenc> in
the buffer protocol implementation.
1.5: Added notes about setting the <default encoding>. Fixed some
typos (thanks to Andrew Kuchling). Changed <defencstr> to
<utf8str>.
1.4: Added note about mixed type comparisons and contains tests.
Changed treating of Unicode objects in format strings (if
used with '%s' % u they will now cause the format string to
be coerced to Unicode, thus producing a Unicode object on
return). Added link to IANA charset names (thanks to Lars
Marius Garshol). Added new codec methods .readline(),
.readlines() and .writelines().
1.3: Added new "es" and "es#" parser markers
1.2: Removed POD about codecs.open()
1.1: Added note about comparisons and hash values. Added note about
case mapping algorithms. Changed stream codecs .read() and
.write() method to match the standard file-like object
methods (bytes consumed information is no longer returned by
the methods)
1.0: changed encode Codec method to be symmetric to the decode method
(they both return (object, data consumed) now and thus become
interchangeable); removed __init__ method of Codec class (the
methods are stateless) and moved the errors argument down to
the methods; made the Codec design more generic w/r to type
of input and output objects; changed StreamWriter.flush to
StreamWriter.reset in order to avoid overriding the stream's
.flush() method; renamed .breaklines() to .splitlines();
renamed the module unicodec to codecs; modified the File I/O
section to refer to the stream codecs.
0.9: changed errors keyword argument definition; added 'replace' error
handling; changed the codec APIs to accept buffer like
objects on input; some minor typo fixes; added Whitespace
section and included references for Unicode characters that
have the whitespace and the line break characteristic; added
note that search functions can expect lower-case encoding
names; dropped slicing and offsets in the codec APIs
0.8: added encodings package and raw unicode escape encoding; untabified
the proposal; added notes on Unicode format strings; added
.breaklines() method
0.7: added a whole new set of codec APIs; added a different
encoder lookup scheme; fixed some names
0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
a real Python string object; changed Buffer Interface to
delegate requests to <defencstr>'s buffer interface; removed
the explicit reference to the unicodec.codecs dictionary (the
module can implement this in way fit for the purpose);
removed the settable default encoding; move UnicodeError from
unicodec to exceptions; "s#" not returns the internal data;
passed the UCS-2/UTF-16 checking from the Unicode constructor
to the Codecs
0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
private use encodings and Unicode character properties
0.4: added Codec interface, notes on %-formatting, changed some encoding
details, added comments on stream wrappers, fixed some
discussion points (most important: Internal Format),
clarified the 'unicode-escape' encoding, added encoding
references
0.3: added references, comments on codec modules, the internal format,
bf_getcharbuffer and the RE engine; added 'unicode-escape'
encoding proposed by Tim Peters and fixed repr(u) accordingly
0.2: integrated Guido's suggestions, added stream codecs and file
wrapping
0.1: first version
pep-0101 Doing Python Releases 101
| PEP: | 101 |
|---|---|
| Title: | Doing Python Releases 101 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org> |
| Status: | Active |
| Type: | Informational |
| Created: | 22-Aug-2001 |
| Post-History: |
Abstract
Making a Python release is a thrilling and crazy process. You've heard
the expression "herding cats"? Imagine trying to also saddle those
purring little creatures up, and ride them into town, with some of their
buddies firmly attached to your bare back, anchored by newly sharpened
claws. At least they're cute, you remind yourself.
Actually, no that's a slight exaggeration <wink>. The Python release
process has steadily improved over the years and now, with the help of our
amazing community, is really not too difficult. This PEP attempts to
collect, in one place, all the steps needed to make a Python release. It
is organized as a recipe and you can actually print this out and check
items off as you complete them.
Things You'll Need
As a release manager there are a lot of resources you'll need to access.
Here's a hopefully-complete list.
* A GPG key.
Python releases are digitally signed with GPG; you'll need a key,
which hopefully will be on the "web of trust" with at least one of
the other release managers.
* Access to ``dl-files.iad1.psf.io``, the server that hosts download files.
You'll be uploading files directly here.
* Shell access to ``hg.python.org``, the Python Mercurial host. You'll
have to adapt repository configuration there.
* Write access to the PEP repository.
If you're reading this, you probably already have this--the first
task of any release manager is to draft the release schedule. But
in case you just signed up... sucker! I mean, uh, congratulations!
* Posting access to http://blog.python.org, a Blogger-hosted weblog.
The RSS feed from this blog is used for the 'Python News' section
on www.python.org.
* A subscription to the super secret release manager mailing list, which may
or may not be called ``python-cabal``. Bug Barry about this.
How to Make A Release
Here are the steps taken to make a Python release. Some steps are more
fuzzy than others because there's little that can be automated (e.g.
writing the NEWS entries). Where a step is usually performed by An
Expert, the role of that expert is given. Otherwise, assume the step is
done by the Release Manager (RM), the designated person performing the
release. The roles and their current experts are:
* RM = Release Manager: Larry Hastings <larry@hastings.org> (US)
* WE = Windows: Martin von Loewis <martin@v.loewis.de> (Central Europe) and Steve Dower <steve.dower@microsoft.com>
* ME = Mac: Ned Deily <nad@acm.org> (US)
* DE = Docs: Georg Brandl <georg@python.org> (Central Europe)
* IE = Idle Expert: ??
NOTE: It is highly recommended that the RM contact the Experts the day
before the release. Because the world is round and everyone lives
in different timezones, the RM must ensure that the release tag is
created in enough time for the Experts to cut binary releases.
You should not make the release public (by updating the website and
sending announcements) before all experts have updated their bits.
In rare cases where the expert for Windows or Mac is MIA, you may add
a message "(Platform) binaries will be provided shortly" and proceed.
XXX: We should include a dependency graph to illustrate the steps that can
be taken in parallel, or those that depend on other steps.
As much as possible, the release steps are automated and guided by the
release script, which is available in a separate repository:
https://hg.python.org/release/
We use the following conventions in the examples below. Where a release
number is given, it is of the form X.Y.ZaN, e.g. 3.3.0a3 for Python 3.3.0
alpha 3, where "a" == alpha, "b" == beta, "rc" == release candidate.
Release tags are named "vX.Y.ZaN". The branch name for minor release
maintenance branches is "X.Y".
This helps by performing several automatic editing steps, and guides you
to perform some manual editing steps.
___ Log into irc.freenode.net and join the #python-dev channel.
You probably need to coordinate with other people around the world.
This IRC channel is where we've arranged to meet.
___ Check to see if there are any showstopper bugs.
Go to http://bugs.python.org and look for any open bugs that can block
this release. You're looking at the Priority of the open bugs for the
release you're making; here are the relevant definitions:
release blocker - Stops the release dead in its tracks. You may not
make any release with any open release blocker bugs.
deferred blocker - Doesn't block this release, but it will block a
future release. You may not make a final or
candidate release with any open deferred blocker
bugs.
critical - Important bugs that should be fixed, but which does not block
a release.
Review the release blockers and either resolve them, bump them down to
deferred, or stop the release and ask for community assistance. If
you're making a final or candidate release, do the same with any open
deferred.
___ Check the stable buildbots.
Go to http://www.python.org/dev/buildbot/stable/
(the trailing slash is required). Look at the buildbots for the release
you're making. Ignore any that are offline (or inform the community so
they can be restarted). If what remains are (mostly) green buildbots,
you're good to go. If you have non-offline red buildbots, you may want
to hold up the release until they are fixed. Review the problems and
use your judgement, taking into account whether you are making an alpha,
beta, or final release.
___ Make a release clone.
Create a local clone of the cpython repository (called the "release
clone" from now on).
Also clone the repo at http://hg.python.org/cpython using the
server-side clone feature. The name of the new clone should preferably
have a "releasing/" prefix. The other experts will use the release
clone for making the binaries, so it is important that they have access
to it!
It's best to set up your local release clone to push to the remote
release clone by default (by editing .hg/hgrc to that effect).
___ Notify all committers by sending email to python-committers@python.org.
Since we're now working with a distributed version control system, there
is no need to stop everyone from pushing to the main repo; you'll just
work in your own clone. Therefore, there won't be any checkin freezes.
However, all committers should know the point at which your release
clone was made, as later commits won't make it into the release without
extra effort.
___ Make sure the current branch of your release clone is the branch you
want to release from.
___ Check the docs for markup errors.
cd to the Doc directory and run ``make suspicious``. If any markup
errors are found, fix them.
___ Regenerate Lib/pydoc-topics.py.
While still in the Doc directory, run ``make pydoc-topics``. Then copy
``build/pydoc-topics/topics.py`` to ``../Lib/pydoc_data/topics.py``.
___ Commit your changes to pydoc_topics.py
(and any fixes you made in the docs).
___ Make sure the SOURCE_URI in ``Doc/tools/pyspecific.py``
points to the right branch in the hg repository (or ``default`` for
unstable releases of the default branch).
___ Bump version numbers via the release script.
$ .../release/release.py --bump X.Y.ZaN
This automates updating various release numbers, but you will have to
modify a few files manually. If your $EDITOR environment variable is
set up correctly, release.py will pop up editor windows with the files
you need to edit.
It is important to update the Misc/NEWS file, however in recent years,
this has become easier as the community is responsible for most of the
content of this file. You should only need to review the text for
sanity, and update the release date with today's date.
___ Make sure all changes have been committed. (``release.py --bump``
doesn't check in its changes for you.)
___ Check the years on the copyright notice. If the last release
was some time last year, add the current year to the copyright
notice in several places:
___ README
___ LICENSE (make sure to change on trunk and the branch)
___ Python/getcopyright.c
___ Doc/README.txt (at the end)
___ Doc/copyright.rst
___ Doc/license.rst
___ PC/python_nt.rc sets up the DLL version resource for Windows
(displayed when you right-click on the DLL and select
Properties).
___ Check with the IE (if there is one <wink>) to be sure that
Lib/idlelib/NEWS.txt has been similarly updated.
___ For a final major release, edit the first paragraph of
Doc/whatsnew/X.Y.rst to include the actual release date; e.g. "Python
2.5 was released on August 1, 2003." There's no need to edit this for
alpha or beta releases. Note that Andrew Kuchling often takes care of
this.
___ Tag the release for X.Y.ZaN.
$ .../release/release.py --tag X.Y.ZaN
NOTE: when forward, i.e. "null" merging your changes to newer branches,
e.g. 2.6 -> 2.7, do *not* revert your changes to the .hgtags file or you
will not be able to run the --export command below. Revert everything
else but leave .hgtags alone.
___ If this is a final major release, branch the tree for X.Y.
When making a major release (e.g., for 3.2), you must create the
long-lived maintenance branch.
___ Note down the current revision ID of the tree.
$ hg identify
___ First, set the original trunk up to be the next release.
$ .../release/release.py --bump 3.3a0
___ Edit all version references in the README
___ Move any historical "what's new" entries from Misc/NEWS to
Misc/HISTORY.
___ Edit Doc/tutorial/interpreter.rst (2 references to '[Pp]ython3x',
one to 'Python 3.x', also make the date in the banner consistent).
___ Edit Doc/tutorial/stdlib.rst and Doc/tutorial/stdlib2.rst, which
have each one reference to '[Pp]ython3x'.
___ Add a new whatsnew/3.x.rst file (with the comment near the top
and the toplevel sections copied from the previous file) and
and add it to the toctree in whatsnew/index.rst.
___ Update the version number in configure.ac and re-run autoconf.
___ Update the version numbers for the Windows builds in PC/ and
PCbuild/, which have references to python32.
$ find PC/ PCbuild/ -type f | xargs sed -i 's/python32/python33/g'
$ hg mv -f PC/os2emx/python32.def PC/os2emx/python33.def
$ hg mv -f PC/python32stub.def PC/python33stub.def
$ hg mv -f PC/python32gen.py PC/python33gen.py
___ Commit these changes to the default branch.
___ Now, go back to the previously noted revision and make the
maintenance branch *from there*.
$ hg update deadbeef # revision ID noted down before
$ hg branch 3.2
___ When you want to push back your new branch to the main CPython
repository, the new branch name must be added to the "allow-branches"
hook configuration, which protects against stray named branches being
pushed. Login to hg.python.org and edit (as the "hg" user)
``/data/hg/repos/cpython/.hg/hgrc`` to that effect.
___ For a final major release, Doc/tools/static/version_switch.js
must be updated in all maintained branches, so that the new maintenance
branch is not "dev" anymore and there is a new "dev" version.
___ Push your commits to the remote release clone.
$ hg push ssh://hg.python.org/releasing/...
___ Notify the experts that they can start building binaries.
___ STOP STOP STOP STOP STOP STOP STOP STOP
At this point you must receive the "green light" from other experts in
order to create the release. There are things you can do while you wait
though, so keep reading until you hit the next STOP.
___ The WE builds the Windows helpfile, using (in Doc/)
> make.bat htmlhelp (on Windows)
to create suitable input for HTML Help Workshop in build/htmlhelp. HTML
Help Workshop is then fired up on the created python33.hhp file, finally
resulting in an python33.chm file.
___ The WE then generates Windows installer files for each Windows
target architecture (for Python 3.3, this means x86 and AMD64).
- He has one checkout tree per target architecture, and builds the
pcbuild.sln project for the appropriate architecture.
- PC\icons.mak must have been run with nmake.
- The cmd.exe window in which this is run must have Cygwin/bin in its
path (at least for x86).
- The cmd.exe window must have MS compiler tools for the target
architecture in its path (VS 2010 for Python 3.3).
- The WE then edits Tools/msi/config.py (a file only present locally)
to update full_current_version and sets snapshot to false. Currently
for a release config.py looks like
snapshot=0
full_current_version="3.3.5rc2"
certname="Python Software Foundation
PCBUILD='PCbuild\\amd64'
The last line is only present for the amd64 checkout.
- Now he runs msi.py with ActivePython or Python with pywin32.
The WE checksums the files (*.msi, *.chm, *-pdb.zip), uploads them to
dl-files.iad1.psf.io together with gpg signature files, and emails you the
location and md5sums.
___ The ME builds Mac installer packages and uploads them to
dl-files.iad1.psf.io together with gpg signature files.
___ Time to build the source tarball. Be sure to update your clone to the
correct branch. E.g.
$ hg update 3.2
___ Do a "hg status" in this directory.
You should not see any files. I.e. you better not have any uncommitted
changes in your working directory.
___ Make sure you have virtualenv installed (for Python 2.7). The release
script installs Sphinx in a virtualenv when building the docs.
For building the PDF docs, you also need a fairly complete installation
of a recent TeX distribution such as texlive.
___ Use the release script to create the source gzip and xz tarballs,
documentation tar and zip files, and gpg signature files.
$ .../release/release.py --export X.Y.ZaN
This can take a while for final releases, and it will leave all the
tarballs and signatures in a subdirectory called 'X.Y.ZaN/src', and the
built docs in 'X.Y.ZaN/docs' (for final releases).
___ scp or rsync all the files to your home directory on dl-files.iad1.psf.io.
While you're waiting for the files to finish uploading, you can continue
on with the remaining tasks. You can also ask folks on #python-dev
and/or python-committers to download the files as they finish uploading
so that they can test them on their platforms as well.
___ Now you want to perform the very important step of checking the
tarball you just created, to make sure a completely clean,
virgin build passes the regression test. Here are the best
steps to take:
$ cd /tmp
$ tar xvf ~/Python-3.2rc2.tgz
$ cd Python-3.2rc2
$ ls
(Do things look reasonable?)
$ ls Lib
(Are there stray .pyc files?)
$ ./configure
(Loads of configure output)
$ make test
(Do all the expected tests pass?)
If you're feeling lucky and have some time to kill, or if you are making
a release candidate or final release, run the full test suite:
$ make testall
If the tests pass, then you can feel good that the tarball is
fine. If some of the tests fail, or anything else about the
freshly unpacked directory looks weird, you better stop now and
figure out what the problem is.
___ Now you need to go to dl-files.iad1.psf.io and move all the files in place
over there. Our policy is that every Python version gets its own
directory, but each directory contains all releases of that version.
___ On dl-files.iad1.psf.io, cd /srv/www.python.org/ftp/python/X.Y.Z
creating it if necessary. Make sure it is owned by group 'downloads'
and group-writable.
___ Move the release .tgz, and .tar.xz files into place, as well as the
.asc GPG signature files. The Win/Mac binaries are usually put there
by the experts themselves.
Make sure they are world readable. They should also be group
writable, and group-owned by downloads.
___ Use ``gpg --verify`` to make sure they got uploaded intact.
___ If this is a final release: Move the doc zips and tarballs to
/srv/www.python.org/ftp/python/doc/X.Y.Z, creating the directory
if necessary, and adapt the "current" symlink in .../doc to point to
that directory. Note though that if you're releasing a maintenance
release for an older version, don't change the current link.
___ If this is a final release (even a maintenance release), also unpack
the HTML docs to /srv/docs.python.org/release/X.Y.Z on
docs.iad1.psf.io. Make sure the files are in group "docs". If it is a
release of a security-fix-only version, tell the DE to build a version
with the "version switcher" and put it there.
___ Let the DE check if the docs are built and work all right.
___ If this is a final major release: Tell the DE to adapt redirects for
docs.python.org/X.Y in the Apache config for docs.python.org, update
the script Doc/tools/dailybuild.py to point to the right
stable/development branches, and to install it and make the initial
checkout. The Doc's version_switcher.js script also needs to be
updated. In general, please don't touch things in the toplevel
/srv/docs.python.org/ directory unless you know what you're doing.
___ Note both the documentation and downloads are behind a caching CDN. If
you change archives after downloading them through the website, you'll
need to purge the stale data in the CDN like this:
$ curl -X PURGE https://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.xz
___ For the extra paranoid, do a completely clean test of the release.
This includes downloading the tarball from www.python.org.
Make sure the md5 checksums match. Then unpack the tarball,
and do a clean make test.
$ make distclean
$ ./configure
$ make test
To ensure that the regression test suite passes. If not, you
screwed up somewhere!
Now it's time to twiddle the web site.
To do these steps, you must have the permission to edit the website. If you
don't have that, ask someone on pydotorg@python.org for the proper
permissions. It's insane for you not to have it.
XXX This is completely out of date for Django based python.org.
This page will probably come in handy:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
None of the web site updates are automated by release.py.
___ Build the basic site.
In the top directory, do an `svn update` to get the latest code. In the
build subdirectory, do `make` to build the site. Do `make serve` to
start service the pages on localhost:8005. Hit that url to see the site
as it is right now. At any time you can re-run `make` to update the
local site. You don't have to restart the server.
Don't `svn commit` until you're all done!
___ If this is the first release for this version (even a new patch
version), you'll need to create a subdirectory inside download/releases
to hold the new version files. It's probably a good idea to copy an
existing recent directory and twiddle the files in there for the new
version number.
___ Update the version specific pages.
___ cd to `download/releases/X.Y.Z`
___ Edit the version numbers in content.ht
___ Update the md5 checksums
___ Comment out the "This is a preview release" or the "This is a
production release" paragraph as appropriate
Note, you don't have to copy any release files into this directory;
they only live on dl-files.iad1.psf.io in the ftp directory.
___ Edit `download/releases/content.ht` to update the version numbers for
this release. There are a bunch of places you need to touch:
___ The subdirectory name as the first element in the Nav rows.
___ Possibly the Releases section, and possibly in the experimental
releases section if this is an alpha, beta or release candidate.
___ Update the download page, editing `download/content.ht`. Pre-releases are
added only to the "Testing versions" list.
___ If this is a final release...
___ Update the 'Quick Links' section on the front page. Edit the
top-level `content.ht` file.
___ For X.Y.Z, edit all the previous X.Y releases' content.ht page to
point to the new release.
___ Update `doc/content.ht` to indicate the new current documentation
version, and remove the current version from any 'in development'
section. Update the version in the "What's New" link.
___ Add the new version to `doc/versions/content.ht`.
___ Add a news section item to the front page by editing newsindex.yml. The
format should be pretty self evident.
___ When everything looks good, `svn commit` in the data directory. This
will trigger the live site to update itself, and at that point the
release is live.
___ If this is a final release, create a new python.org/X.Y Apache alias
(or ask pydotorg to do so for you).
Now it's time to write the announcement for the mailing lists. This is the
fuzzy bit because not much can be automated. You can use an earlier
announcement as a template, but edit it for content!
___ STOP STOP STOP STOP STOP STOP STOP STOP
___ Have you gotten the green light from the WE?
___ Have you gotten the green light from the DE?
___ Once the announcement is ready, send it to the following
addresses:
python-list@python.org
python-announce@python.org
python-dev@python.org
___ Also post the announcement to `The Python Insider blog
<http://blog.python.org>`_. To add a new entry, go to
`your Blogger home page, here. <https://www.blogger.com/home>`_
Now it's time to do some cleaning up. These steps are very important!
___ Do the guided post-release steps with the release script.
$ .../release/release.py --done X.Y.ZaN
Review and commit these changes.
___ Merge your release clone into the main development repo:
$ cd ../cpython # your clone of the main repo
$ hg pull ssh://hg.python.org/cpython # update from remote first
$ hg pull ../cpython-releaseX.Y # now pull from release clone
Now merge your release clone's changes in every branch you touched
(usually only one, except if you made a new maintenance release).
Easily resolvable conflicts may appear in Misc/NEWS.
___ If releasing from other than the default branch, remember to carefully
merge any touched branches with higher level branches, up to default. For
example:
$ hg update -C default
$ hg resolve --list
$ hg merge --tool "internal:fail" 3.4
... here, revert changes that are not relevant for the default branch...
$ hg resolve --mark
___ Commit and push to the main repo.
___ You can delete the remote release clone, or simply reuse it for the next
release.
___ Send email to python-committers informing them that the release has been
published.
___ Update any release PEPs (e.g. 361) with the release dates.
___ Update the tracker at http://bugs.python.org:
___ Flip all the deferred blocker issues back to release blocker
for the next release.
___ Add version X.Y+1 as when version X.Y enters alpha.
___ Change non-doc RFEs to version X.Y+1 when version X.Y enters beta.
___ Update 'behavior' issues from versions that your release make
unsupported to the next supported version.
___ Review open issues, as this might find lurking showstopper bugs,
besides reminding people to fix the easy ones they forgot about.
What Next?
___ Verify! Pretend you're a user: download the files from python.org, and
make Python from it. This step is too easy to overlook, and on several
occasions we've had useless release files. Once a general server problem
caused mysterious corruption of all files; once the source tarball got
built incorrectly; more than once the file upload process on SF truncated
files; and so on.
___ Rejoice. Drink. Be Merry. Write a PEP like this one. Or be
like unto Guido and take A Vacation.
You've just made a Python release!
Windows Notes
Windows has a MSI installer, various flavors of Windows have
"special limitations", and the Windows installer also packs
precompiled "foreign" binaries (Tcl/Tk, expat, etc). So Windows
testing is tiresome but very necessary.
Concurrent with uploading the installer, the WE installs Python
from it twice: once into the default directory suggested by the
installer, and later into a directory with embedded spaces in its
name. For each installation, he runs the full regression suite
from a DOS box, and both with and without -0. For maintenance
release, he also tests whether upgrade installations succeed.
He also tries *every* shortcut created under Start -> Menu -> the
Python group. When trying IDLE this way, you need to verify that
Help -> Python Documentation works. When trying pydoc this way
(the "Module Docs" Start menu entry), make sure the "Start
Browser" button works, and make sure you can search for a random
module (like "random" <wink>) and then that the "go to selected"
button works.
It's amazing how much can go wrong here -- and even more amazing
how often last-second checkins break one of these things. If
you're "the Windows geek", keep in mind that you're likely the
only person routinely testing on Windows, and that Windows is
simply a mess.
Repeat the testing for each target architecture. Try both an
Admin and a plain User (not Power User) account.
Copyright
This document has been placed in the public domain.
pep-0102 Doing Python Micro Releases
| PEP: | 102 |
|---|---|
| Title: | Doing Python Micro Releases |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Anthony Baxter <anthony at interlink.com.au>, Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org> |
| Status: | Superseded |
| Type: | Informational |
| Created: | 22-Aug-2001 (edited down on 9-Jan-2002 to become PEP 102) |
| Post-History: | |
| Superseded-By: | 101 |
Replacement Note
Although the size of the to-do list in this PEP is much less scary
than that in PEP 101, it turns out not to be enough justification
for the duplication of information, and with it, the danger of one
of the copies to become out of date. Therefore, this PEP is not
maintained anymore, and micro releases are fully covered by PEP 101.
Abstract
Making a Python release is an arduous process that takes a
minimum of half a day's work even for an experienced releaser.
Until recently, most -- if not all -- of that burden was borne by
Guido himself. But several recent releases have been performed by
other folks, so this PEP attempts to collect, in one place, all
the steps needed to make a Python bugfix release.
The major Python release process is covered in PEP 101 - this PEP
is just PEP 101, trimmed down to only include the bits that are
relevant for micro releases, a.k.a. patch, or bug fix releases.
It is organized as a recipe and you can actually print this out and
check items off as you complete them.
How to Make A Release
Here are the steps taken to make a Python release. Some steps are
more fuzzy than others because there's little that can be
automated (e.g. writing the NEWS entries). Where a step is
usually performed by An Expert, the name of that expert is given.
Otherwise, assume the step is done by the Release Manager (RM),
the designated person performing the release. Almost every place
the RM is mentioned below, this step can also be done by the BDFL
of course!
XXX: We should include a dependency graph to illustrate the steps
that can be taken in parallel, or those that depend on other
steps.
We use the following conventions in the examples below. Where a
release number is given, it is of the form X.Y.MaA, e.g. 2.1.2c1
for Python 2.1.2 release candidate 1, where "a" == alpha, "b" ==
beta, "c" == release candidate. Final releases are tagged with
"releaseXYZ" in CVS. The micro releases are made from the
maintenance branch of the major release, e.g. Python 2.1.2 is made
from the release21-maint branch.
___ Send an email to python-dev@python.org indicating the release is
about to start.
___ Put a freeze on check ins into the maintenance branch. At this
point, nobody except the RM should make any commits to the branch
(or his duly assigned agents, i.e. Guido the BDFL, Fred Drake for
documentation, or Thomas Heller for Windows). If the RM screwed up
and some desperate last minute change to the branch is
necessary, it can mean extra work for Fred and Thomas. So try to
avoid this!
___ On the branch, change Include/patchlevel.h in two places, to
reflect the new version number you've just created. You'll want
to change the PY_VERSION macro, and one or several of the
version subpart macros just above PY_VERSION, as appropriate.
___ Change the "%define version" line of Misc/RPM/python-2.3.spec to the
same string as PY_VERSION was changed to above. E.g:
%define version 2.3.1
You also probably want to reset the %define release line
to '1pydotorg' if it's not already that.
___ If you're changing the version number for Python (e.g. from
Python 2.1.1 to Python 2.1.2), you also need to update the
README file, which has a big banner at the top proclaiming its
identity. Don't do this if you're just releasing a new alpha or
beta release, but /do/ do this if you're release a new micro,
minor or major release.
___ The LICENSE file also needs to be changed, due to several
references to the release number. As for the README file, changing
these are necessary for a new micro, minor or major release.
The LICENSE file contains a table that describes the legal
heritage of Python; you should add an entry for the X.Y.Z
release you are now making. You should update this table in the
LICENSE file on the CVS trunk too.
___ When the year changes, copyright legends need to be updated in
many places, including the README and LICENSE files.
___ For the Windows build, additional files have to be updated.
PCbuild/BUILDno.txt contains the Windows build number, see the
instructions in this file how to change it. Saving the project
file PCbuild/pythoncore.dsp results in a change to
PCbuild/pythoncore.dsp as well.
PCbuild/python20.wse sets up the Windows installer version
resource (displayed when you right-click on the installer .exe
and select Properties), and also contains the Python version
number.
(Before version 2.3.2, it was required to manually edit
PC/python_nt.rc, this step is now automated by the build
process.)
___ After starting the process, the most important thing to do next
is to update the Misc/NEWS file. Thomas will need this in order to
do the Windows release and he likes to stay up late. This step
can be pretty tedious, so it's best to get to it immediately
after making the branch, or even before you've made the branch.
The sooner the better (but again, watch for new checkins up
until the release is made!)
Add high level items new to this release. E.g. if we're
releasing 2.2a3, there must be a section at the top of the file
explaining "What's new in Python 2.2a3". It will be followed by
a section entitled "What's new in Python 2.2a2".
Note that you /hope/ that as developers add new features to the
trunk, they've updated the NEWS file accordingly. You can't be
positive, so double check. If you're a Unix weenie, it helps to
verify with Thomas about changes on Windows, and Jack Jansen
about changes on the Mac.
This command should help you (but substitute the correct -r tag!):
% cvs log -rr22a1: | python Tools/scripts/logmerge.py > /tmp/news.txt
IOW, you're printing out all the cvs log entries from the
previous release until now. You can then troll through the
news.txt file looking for interesting things to add to NEWS.
___ Check your NEWS changes into the maintenance branch. It's easy
to forget to update the release date in this file!
___ Check in any changes to IDLE's NEWS.txt. Update the header in
Lib/idlelib/NEWS.txt to reflect its release version and date.
Update the IDLE version in Lib/idlelib/idlever.py to match.
___ Once the release process has started, the documentation needs to
be built and posted on python.org according to the instructions
in PEP 101.
Note that Fred is responsible both for merging doc changes from
the trunk to the branch AND for merging any branch changes from
the branch to the trunk during the cleaning up phase.
Basically, if it's in Doc/ Fred will take care of it.
___ Thomas compiles everything with MSVC 6.0 SP5, and moves the
python23.chm file into the src/chm directory. The installer
executable is then generated with Wise Installation System.
The installer includes the MSVC 6.0 runtime in the files
MSVCRT.DLL and MSVCIRT.DLL. It leads to disaster if these files
are taken from the system directory of the machine where the
installer is built, instead it must be absolutely made sure that
these files come from the VCREDIST.EXE redistributable package
contained in the MSVC SP5 CD. VCREDIST.EXE must be unpacked
with winzip, and the Wise Installation System prompts for the
directory.
After building the installer, it should be opened with winzip,
and the MS dlls extracted again and check for the same version
number as those unpacked from VCREDIST.EXE.
Thomas uploads this file to the starship. He then sends the RM
a notice which includes the location and MD5 checksum of the
Windows executable.
Note that Thomas's creation of the Windows executable may generate
a few more commits on the branch. Thomas will be responsible for
merging Windows-specific changes from trunk to branch, and from
branch to trunk.
___ Sean performs his Red Hat magic, generating a set of RPMs. He
uploads these files to python.org. He then sends the RM a notice
which includes the location and MD5 checksum of the RPMs.
___ It's Build Time!
Now, you're ready to build the source tarball. First cd to your
working directory for the branch. E.g.
% cd .../python-22a3
___ Do a "cvs update" in this directory. Do NOT include the -A flag!
You should not see any "M" files, but you may see several "P"
and/or "U" files. I.e. you better not have any uncommitted
changes in your working directory, but you may pick up some of
Fred's or Thomas's last minute changes.
___ Now tag the branch using a symbolic name like "rXYMaZ",
e.g. r212
% cvs tag r212
Be sure to tag only the python/dist/src subdirectory of the
Python CVS tree!
___ Change to a neutral directory, i.e. one in which you can do a
fresh, virgin, cvs export of the branch. You will be creating a
new directory at this location, to be named "Python-X.Y.M". Do
a CVS export of the tagged branch.
% cd ~
% cvs -d cvs.sf.net:/cvsroot/python export -rr212 \
-d Python-2.1.2 python/dist/src
___ Generate the tarball. Note that we're not using the `z' option
on the tar command because 1) that's only supported by GNU tar
as far as we know, and 2) we're going to max out the compression
level, which isn't a supported option. We generate both tar.gz
tar.bz2 formats, as the latter is about 1/6th smaller.
% tar -cf - Python-2.1.2 | gzip -9 > Python-2.1.2.tgz
% tar -cf - Python-2.1.2 | bzip2 -9 > Python-2.1.2.tar.bz2
___ Calculate the MD5 checksum of the tgz and tar.bz2 files you
just created
% md5sum Python-2.1.2.tgz
Note that if you don't have the md5sum program, there is a
Python replacement in the Tools/scripts/md5sum.py file.
___ Create GPG keys for each of the files.
% gpg -ba Python-2.1.2.tgz
% gpg -ba Python-2.1.2.tar.bz2
% gpg -ba Python-2.1.2.exe
___ Now you want to perform the very important step of checking the
tarball you just created, to make sure a completely clean,
virgin build passes the regression test. Here are the best
steps to take:
% cd /tmp
% tar zxvf ~/Python-2.1.2.tgz
% cd Python-2.1.2
% ls
(Do things look reasonable?)
% ./configure
(Loads of configure output)
% make test
(Do all the expected tests pass?)
If the tests pass, then you can feel good that the tarball is
fine. If some of the tests fail, or anything else about the
freshly unpacked directory looks weird, you better stop now and
figure out what the problem is.
___ You need to upload the tgz and the exe file to creosote.python.org.
This step can take a long time depending on your network
bandwidth. scp both files from your own machine to creosote.
___ While you're waiting, you can start twiddling the web pages to
include the announcement.
___ In the top of the python.org web site CVS tree, create a
subdirectory for the X.Y.Z release. You can actually copy an
earlier patch release's subdirectory, but be sure to delete
the X.Y.Z/CVS directory and "cvs add X.Y.Z", for example:
% cd .../pydotorg
% cp -r 2.2.2 2.2.3
% rm -rf 2.2.3/CVS
% cvs add 2.2.3
% cd 2.2.3
___ Edit the files for content: usually you can globally replace
X.Ya(Z-1) with X.YaZ. However, you'll need to think about the
"What's New?" section.
___ Copy the Misc/NEWS file to NEWS.txt in the X.Y.Z directory for
python.org; this contains the "full scoop" of changes to
Python since the previous release for this version of Python.
___ Copy the .asc GPG signatures you created earlier here as well.
___ Also, update the MD5 checksums.
___ Preview the web page by doing a "make" or "make install" (as
long as you've created a new directory for this release!)
___ Similarly, edit the ../index.ht file, i.e. the python.org home
page. In the Big Blue Announcement Block, move the paragraph
for the new version up to the top and boldify the phrase
"Python X.YaZ is out". Edit for content, and preview locally,
but do NOT do a "make install" yet!
___ Now we're waiting for the scp to creosote to finish. Da de da,
da de dum, hmm, hmm, dum de dum.
___ Once that's done you need to go to creosote.python.org and move
all the files in place over there. Our policy is that every
Python version gets its own directory, but each directory may
contain several releases. We keep all old releases, moving them
into a "prev" subdirectory when we have a new release.
So, there's a directory called "2.2" which contains
Python-2.2a2.exe and Python-2.2a2.tgz, along with a "prev"
subdirectory containing Python-2.2a1.exe and Python-2.2a1.tgz.
So...
___ On creosote, cd to ~ftp/pub/python/X.Y creating it if
necessary.
___ Move the previous release files to a directory called "prev"
creating the directory if necessary (make sure the directory
has g+ws bits on). If this is the first alpha release of a
new Python version, skip this step.
___ Move the .tgz file and the .exe file to this directory. Make
sure they are world readable. They should also be group
writable, and group-owned by webmaster.
___ md5sum the files and make sure they got uploaded intact.
___ Update the X.Y/bugs.ht file if necessary. It is best to get
BDFL input for this step.
___ Go up to the parent directory (i.e. the root of the web page
hierarchy) and do a "make install" there. You're release is now
live!
___ Now it's time to write the announcement for the mailing lists.
This is the fuzzy bit because not much can be automated. You
can use one of Guido's earlier announcements as a template, but
please edit it for content!
Once the announcement is ready, send it to the following
addresses:
python-list@python.org
python-announce@python.org
python-dev@python.org
___ Send a SourceForge News Item about the release. From the
project's "menu bar", select the "News" link; once in News,
select the "Submit" link. Type a suitable subject (e.g. "Python
2.2c1 released" :-) in the Subject box, add some text to the
Details box (at the very least including the release URL at
www.python.org and the fact that you're happy with the release)
and click the SUBMIT button.
Feel free to remove any old news items.
Now it's time to do some cleanup. These steps are very important!
___ Edit the file Include/patchlevel.h so that the PY_VERSION
string says something like "X.YaZ+". Note the trailing `+'
indicating that the trunk is going to be moving forward with
development. E.g. the line should look like:
#define PY_VERSION "2.1.2+"
Make sure that the other PY_ version macros contain the
correct values. Commit this change.
___ For the extra paranoid, do a completely clean test of the
release. This includes downloading the tarball from
www.python.org.
___ Make sure the md5 checksums match. Then unpack the tarball,
and do a clean make test.
% make distclean
% ./configure
% make test
To ensure that the regression test suite passes. If not, you
screwed up somewhere!
Step 5 ...
Verify! This can be interleaved with Step 4. Pretend you're a
user: download the files from python.org, and make Python from
it. This step is too easy to overlook, and on several occasions
we've had useless release files. Once a general server problem
caused mysterious corruption of all files; once the source tarball
got built incorrectly; more than once the file upload process on
SF truncated files; and so on.
What Next?
Rejoice. Drink. Be Merry. Write a PEP like this one. Or be
like unto Guido and take A Vacation.
You've just made a Python release!
Actually, there is one more step. You should turn over ownership
of the branch to Jack Jansen. All this means is that now he will
be responsible for making commits to the branch. He's going to
use this to build the MacOS versions. He may send you information
about the Mac release that should be merged into the informational
pages on www.python.org. When he's done, he'll tag the branch
something like "rX.YaZ-mac". He'll also be responsible for
merging any Mac-related changes back into the trunk.
Final Release Notes
The Final release of any major release, e.g. Python 2.2 final, has
special requirements, specifically because it will be one of the
longest lived releases (i.e. betas don't last more than a couple
of weeks, but final releases can last for years!).
For this reason we want to have a higher coordination between the
three major releases: Windows, Mac, and source. The Windows and
source releases benefit from the close proximity of the respective
release-bots. But the Mac-bot, Jack Jansen, is 6 hours away. So
we add this extra step to the release process for a final
release:
___ Hold up the final release until Jack approves, or until we
lose patience <wink>.
The python.org site also needs some tweaking when a new bugfix release
is issued.
___ The documentation should be installed at doc/<version>/.
___ Add a link from doc/<previous-minor-release>/index.ht to the
documentation for the new version.
___ All older doc/<old-release>/index.ht files should be updated to
point to the documentation for the new version.
___ /robots.txt should be modified to prevent the old version's
documentation from being crawled by search engines.
Windows Notes
Windows has a GUI installer, various flavors of Windows have
"special limitations", and the Windows installer also packs
precompiled "foreign" binaries (Tcl/Tk, expat, etc). So Windows
testing is tiresome but very necessary.
Concurrent with uploading the installer, Thomas installs Python
from it twice: once into the default directory suggested by the
installer, and later into a directory with embedded spaces in its
name. For each installation, he runs the full regression suite
from a DOS box, and both with and without -0.
He also tries *every* shortcut created under Start -> Menu -> the
Python group. When trying IDLE this way, you need to verify that
Help -> Python Documentation works. When trying pydoc this way
(the "Module Docs" Start menu entry), make sure the "Start
Browser" button works, and make sure you can search for a random
module (Thomas uses "random" <wink>) and then that the "go to
selected" button works.
It's amazing how much can go wrong here -- and even more amazing
how often last-second checkins break one of these things. If
you're "the Windows geek", keep in mind that you're likely the
only person routinely testing on Windows, and that Windows is
simply a mess.
Repeat all of the above on at least one flavor of Win9x, and one
of NT/2000/XP. On NT/2000/XP, try both an Admin and a plain User
(not Power User) account.
WRT Step 5 above (verify the release media), since by the time
release files are ready to download Thomas has generally run many
Windows tests on the installer he uploaded, he usually doesn't do
anything for Step 5 except a full byte-comparison ("fc /b" if
using a Windows shell) of the downloaded file against the file he
uploaded.
Copyright
This document has been placed in the public domain.
pep-0160 Python 1.6 Release Schedule
| PEP: | 160 |
|---|---|
| Title: | Python 1.6 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Fred L. Drake, Jr. <fdrake at acm.org> |
| Status: | Final |
| Type: | Informational |
| Created: | 25-Jul-2000 |
| Python-Version: | 1.6 |
| Post-History: |
Introduction
This PEP describes the Python 1.6 release schedule. The CVS
revision history of this file contains the definitive historical
record.
This release will be produced by BeOpen PythonLabs staff for the
Corporation for National Research Initiatives (CNRI).
Schedule
August 1 1.6 beta 1 release (planned).
August 3 1.6 beta 1 release (actual).
August 15 1.6 final release (planned).
September 5 1.6 final release (actual).
Features
A number of features are required for Python 1.6 in order to
fulfill the various promises that have been made. The following
are required to be fully operational, documented, and forward
compatible with the plans for Python 2.0:
* Unicode support: The Unicode object defined for Python 2.0 must
be provided, including all methods and codec support.
* SRE: Fredrik Lundh's new regular expression engine will be used
to provide support for both 8-bit strings and Unicode strings.
It must pass the regression test used for the pcre-based version
of the re module.
* The curses module was in the middle of a transformation to a
package, so the final form was adopted.
Mechanism
The release will be created as a branch from the development tree
rooted at CNRI's close of business on 16 May 2000. Patches
required from more recent checkins will be merged in by moving the
branch tag on individual files whenever possible in order to
reduce mailing list clutter and avoid divergent and incompatible
implementations.
The branch tag is "cnri-16-start".
Patches and features will be merged to the extent required to pass
regression tests in effect on 16 May 2000.
The beta release is tagged "r16b1" in the CVS repository, and the
final Python 1.6 release is tagged "release16" in the repository.
Copyright
This document has been placed in the public domain.
pep-0200 Python 2.0 Release Schedule
| PEP: | 200 |
|---|---|
| Title: | Python 2.0 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Final |
| Type: | Informational |
| Created: | |
| Python-Version: | 2.0 |
| Post-History: |
Introduction
This PEP describes the Python 2.0 release schedule, tracking the
status and ownership of the major new features, summarizes
discussions held in mailing list forums, and provides URLs for
further information, patches, and other outstanding issues. The
CVS revision history of this file contains the definitive
historical record.
Release Schedule
[revised 5 Oct 2000]
26-Sep-2000: 2.0 beta 2
9-Oct-2000: 2.0 release candidate 1 (2.0c1)
16-Oct-2000: 2.0 final
Previous milestones
14-Aug-2000: All 2.0 PEPs finished / feature freeze
5-Sep-2000: 2.0 beta 1
What is release candidate 1?
We believe that release candidate 1 will fix all known bugs that
we intend to fix for the 2.0 final release. This release should
be a bit more stable than the previous betas. We would like to
see even more widespread testing before the final release, so we
are producing this release candidate. The final release will be
exactly the same unless any show-stopping (or brown bag) bugs are
found by testers of the release candidate.
Guidelines for submitting patches and making changes
Use good sense when committing changes. You should know what we
mean by good sense or we wouldn't have given you commit privileges
<0.5 wink>. Some specific examples of good sense include:
- Do whatever the dictator tells you.
- Discuss any controversial changes on python-dev first. If you
get a lot of +1 votes and no -1 votes, make the change. If you
get a some -1 votes, think twice; consider asking Guido what he
thinks.
- If the change is to code you contributed, it probably makes
sense for you to fix it.
- If the change affects code someone else wrote, it probably makes
sense to ask him or her first.
- You can use the SF Patch Manager to submit a patch and assign it
to someone for review.
Any significant new feature must be described in a PEP and
approved before it is checked in.
Any significant code addition, such as a new module or large
patch, must include test cases for the regression test and
documentation. A patch should not be checked in until the tests
and documentation are ready.
If you fix a bug, you should write a test case that would have
caught the bug.
If you commit a patch from the SF Patch Manager or fix a bug from
the Jitterbug database, be sure to reference the patch/bug number
in the CVS log message. Also be sure to change the status in the
patch manager or bug database (if you have access to the bug
database).
It is not acceptable for any checked in code to cause the
regression test to fail. If a checkin causes a failure, it must
be fixed within 24 hours or it will be backed out.
All contributed C code must be ANSI C. If possible check it with
two different compilers, e.g. gcc and MSVC.
All contributed Python code must follow Guido's Python style
guide. http://www.python.org/doc/essays/styleguide.html
It is understood that any code contributed will be released under
an Open Source license. Do not contribute code if it can't be
released this way.
Failing test cases need to get fixed
We need to resolve errors in the regression test suite quickly.
Changes should not be committed to the CVS tree unless the
regression test runs cleanly with the changes applied. If it
fails, there may be bugs lurking in the code. (There may be bugs
anyway, but that's another matter.) If the test cases are known
to fail, they serve no useful purpose.
test case platform date reported
--------- -------- -------------
test_mmap Win ME 03-Sep-2000 Windows 2b1p2 prelease
[04-Sep-2000 tim
reported by Audun S. Runde mailto:audun@mindspring.com
the mmap constructor fails w/
WindowsError: [Errno 6] The handle is invalid
since there are no reports of this failing on other
flavors of Windows, this looks like to be an ME bug
]
Open items -- Need to be resolved before 2.0 final release
Decide whether cycle-gc should be enabled by default.
Resolve compatibility issues between core xml package and the
XML-SIG XML package.
Update Tools/compiler so that it is compatible with list
comprehensions, import as, and any other new language features.
Improve code coverage of test suite.
Finish writing the PEPs for the features that went out with
2.0b1(! sad, but realistic -- we'll get better with practice).
Major effort to whittle the bug database down to size. I've (tim)
seen this before: if you can keep all the open bugs fitting on one
screen, people will generally keep it that way. But let it
slobber over a screen for a month, & it just goes to hell (no
"visible progress" indeed!).
Accepted and in progress
* Currently none left. [4-Sep-2000 guido]
Open: proposed but not accepted or rejected
* There are a number of open patches again. We need to clear
these out soon.
Previously failing test cases
If you find a test bouncing between this section and the previous one,
the code it's testing is in trouble!
test case platform date reported
--------- -------- -------------
test_fork1 Linux 26-Jul-2000
[28-aug-2000 fixed by cgw; solution is to create copies of
lock in child process]
[19-Aug-2000 tim
Charles Waldman whipped up a patch to give child processes a new
"global lock":
http://sourceforge.net/patch/?func=detailpatch&patch_id=101226&group_id=5470
While this doesn't appear to address the symptoms we *saw*, it
*does* so far appear to be fixing the failing cases anyway
]
test_parser all 22-Aug-2000
test_posixpath all 22-Aug-2000
test_popen2 Win32 26-Jul-2000
[31-Aug-2000 tim
This died again, but for an entirely different reason: it uses a
dict to map file pointers to process handles, and calls a dict
access function during popen.close(). But .close releases threads,
which left the internal popen code accessing the dict without a
valid thread state. The dict implementation changed so that's no
longer accepted. Fixed by creating a temporary thread state in the
guts of popen's close routine, and grabbing the global lock with
it for the duration]
[20-Aug-2000 tim
changed the popen2.py _test function to use the "more" cmd
when os.name == "nt". This makes test_popen2 pass under
Win98SE.
HOWEVER, the Win98 "more" invents a leading newline out
of thin air, and I'm not sure that the other Windows flavors
of "more" also do that.
So, somebody please try under other Windows flavors!
]
[still fails 15-Aug-2000 for me, on Win98 - tim
test test_popen2 crashed -- exceptions.AssertionError :
The problem is that the test uses "cat", but there is
no such thing under Windows (unless you install it).
So it's the test that's broken here, not (necessarily)
the code.
]
test_winreg Win32 26-Jul-2000
[works 15-Aug-2000 for me, on Win98 - tim]
test_mmap Win32 26-Jul-2000
[believe that was fixed by Mark H.]
[works 15-Aug-2000 for me, on Win98 - tim]
test_longexp Win98+? 15-Aug-2000
[fails in release build,
passes in release build under verbose mode but doesn't
look like it should pass,
passes in debug build,
passes in debug build under verbose mode and looks like
it should pass
]
[18-Aug-2000, tim: can't reproduce, and nobody else
saw it. I believe there *is* a subtle bug in
regrtest.py when using -v, and I'll pursue that,
but can't provoke anything wrong with test_longexp
anymore; eyeballing Fred's changes didn't turn up
a suspect either
19-Aug-2000, tim: the "subtle bug" in regrtest.py -v is
actually a feature: -v masks *some* kinds of failures,
since it doesn't compare test output with the canned
output; this is what makes it say "test passed" even
in some cases where the test fails without -v
]
test_winreg2 Win32 26-Jul-2000
[20-Aug-2000 tim - the test has been removed from the project]
[19-Aug-2000 tim
This test will never work on Win98, because it's looking for
a part of registry that doesn't exist under W98.
The module (winreg.py) and this test case will be removed
before 2.0 for other reasons, though.
]
[still fails 15-Aug-2000 for me, on Win98 - tim
test test_winreg2 failed -- Writing: 'Test Failed: testHives',
expected: 'HKEY_PERFORMANCE_DATA\012'
]
Open items -- completed/fixed
[4-Sep-2000 guido: Fredrik finished this on 1-Sep]
* PyErr_Format - Fredrik Lundh
Make this function safe from buffer overflows.
[4-Sep-2000 guido: Fred has added popen2, popen3 on 28-Sep]
Add popen2 support for Linux -- Fred Drake
[4-Sep-2000 guido: done on 1-Sep]
Deal with buffering problem with SocketServer
[04-Sep-2000 tim: done; installer runs; w9xpopen not an issue]
[01-Sep-2000 tim: make a prerelease availabe]
Windows ME: Don't know anything about it. Will the installer
even run? Does it need the w9xpopen hack?
[04-Sep-2000 tim: done; tested on several Windows flavors now]
[01-Sep-2000 tim: completed but untested except on Win98SE]
Windows installer: If HKLM isn't writable, back off to HKCU (so
Python can be installed on NT & 2000 without admin privileges).
[01-Sep-200 tim - as Guido said, runtime code in posixmodule.c doesn't
call this on NT/2000, so no need to avoid installing it everywhere.
Added code to the installer *to* install it, though.]
Windows installer: Install w9xpopen.exe only under Win95/98.
[23-Aug-2000 jeremy - tim reports "completed recently"]
Windows: Look for registry info in HKCU before HKLM - Mark
Hammond.
[20-Aug-2000 tim - done]
Remove winreg.py and test_winreg2.py. Paul Prescod (the author)
now wants to make a registry API more like the MS .NET API. Unclear
whether that can be done in time for 2.0, but, regardless, if we
let winreg.py out the door we'll be stuck with it forever, and not
even Paul wants it anymore.
[24-Aug-2000 tim+guido - done]
Win98 Guido: popen is hanging on Guido, and even freezing the
whole machine. Was caused by Norton Antivirus 2000 (6.10.20) on
Windows 9x. Resolution: disable virus protection.
Accepted and completed
* Change meaning of \x escapes - PEP 223 - Fredrik Lundh
* Add \U1234678 escapes in u"" strings - Fredrik Lundh
* Support for opcode arguments > 2**16 - Charles Waldman
SF Patch 100893
* "import as" - Thomas Wouters
Extend the 'import' and 'from ... import' mechanism to enable
importing a symbol as another name. (Without adding a new keyword.)
* List comprehensions - Skip Montanaro
Tim Peters still needs to do PEP.
* Restore old os.path.commonprefix behavior
Do we have test cases that work on all platforms?
* Tim O'Malley's cookie module with good license
* Lockstep iteration ("zip" function) - Barry Warsaw
* SRE - Fredrik Lundh
[at least I *think* it's done, as of 15-Aug-2000 - tim]
* Fix xrange printing behavior - Fred Drake
Remove the tp_print handler for the xrange type; it produced a
list display instead of 'xrange(...)'. The new code produces a
minimal call to xrange(), enclosed in (... * N) when N != 1.
This makes the repr() more human readable while making it do
what reprs are advertised as doing. It also makes the xrange
objects obvious when working in the interactive interpreter.
* Extended print statement - Barry Warsaw
PEP 214
http://www.python.org/dev/peps/pep-0214/
SF Patch #100970
http://sourceforge.net/patch/?func=detailpatch&patch_id=100970&group_id=5470
* interface to poll system call - Andrew Kuchling
SF Patch 100852
* Augmented assignment - Thomas Wouters
Add += and family, plus Python and C hooks, and API functions.
* gettext.py module - Barry Warsaw
Postponed
* Extended slicing on lists - Michael Hudson
Make lists (and other builtin types) handle extended slices.
* Compression of Unicode database - Fredrik Lundh
SF Patch 100899
At least for 2.0b1. May be included in 2.0 as a bug fix.
* Range literals - Thomas Wouters
SF Patch 100902
We ended up having a lot of doubt about the proposal.
* Eliminated SET_LINENO opcode - Vladimir Marangozov
Small optimization achieved by using the code object's lnotab
instead of the SET_LINENO instruction. Uses code rewriting
technique (that Guido's frowns on) to support debugger, which
uses SET_LINENO.
http://starship.python.net/~vlad/lineno/
for (working at the time) patches
Discussions on python-dev:
- http://www.python.org/pipermail/python-dev/2000-April/subject.html
Subject: "Why do we need Traceback Objects?"
- http://www.python.org/pipermail/python-dev/1999-August/002252.html
* test harness for C code - Trent Mick
Rejected
* 'indexing-for' - Thomas Wouters
Special syntax to give Python code access to the loop-counter in 'for'
loops. (Without adding a new keyword.)
pep-0201 Lockstep Iteration
| PEP: | 201 |
|---|---|
| Title: | Lockstep Iteration |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 13-Jul-2000 |
| Python-Version: | 2.0 |
| Post-History: | 27-Jul-2000 |
Introduction
This PEP describes the `lockstep iteration' proposal. This PEP
tracks the status and ownership of this feature, slated for
introduction in Python 2.0. It contains a description of the
feature and outlines changes necessary to support the feature.
This PEP summarizes discussions held in mailing list forums, and
provides URLs for further information, where appropriate. The CVS
revision history of this file contains the definitive historical
record.
Motivation
Standard for-loops in Python iterate over every element in a
sequence until the sequence is exhausted[1]. However, for-loops
iterate over only a single sequence, and it is often desirable to
loop over more than one sequence in a lock-step fashion. In other
words, in a way such that nthe i-th iteration through the loop
returns an object containing the i-th element from each sequence.
The common idioms used to accomplish this are unintuitive. This
PEP proposes a standard way of performing such iterations by
introducing a new builtin function called `zip'.
While the primary motivation for zip() comes from lock-step
iteration, by implementing zip() as a built-in function, it has
additional utility in contexts other than for-loops.
Lockstep For-Loops
Lockstep for-loops are non-nested iterations over two or more
sequences, such that at each pass through the loop, one element
from each sequence is taken to compose the target. This behavior
can already be accomplished in Python through the use of the map()
built-in function:
>>> a = (1, 2, 3)
>>> b = (4, 5, 6)
>>> for i in map(None, a, b): print i
...
(1, 4)
(2, 5)
(3, 6)
>>> map(None, a, b)
[(1, 4), (2, 5), (3, 6)]
The for-loop simply iterates over this list as normal.
While the map() idiom is a common one in Python, it has several
disadvantages:
- It is non-obvious to programmers without a functional
programming background.
- The use of the magic `None' first argument is non-obvious.
- It has arbitrary, often unintended, and inflexible semantics
when the lists are not of the same length: the shorter sequences
are padded with `None'.
>>> c = (4, 5, 6, 7)
>>> map(None, a, c)
[(1, 4), (2, 5), (3, 6), (None, 7)]
For these reasons, several proposals were floated in the Python
2.0 beta time frame for syntactic support of lockstep for-loops.
Here are two suggestions:
for x in seq1, y in seq2:
# stuff
for x, y in seq1, seq2:
# stuff
Neither of these forms would work, since they both already mean
something in Python and changing the meanings would break existing
code. All other suggestions for new syntax suffered the same
problem, or were in conflict with other another proposed feature
called `list comprehensions' (see PEP 202).
The Proposed Solution
The proposed solution is to introduce a new built-in sequence
generator function, available in the __builtin__ module. This
function is to be called `zip' and has the following signature:
zip(seqa, [seqb, [...]])
zip() takes one or more sequences and weaves their elements
together, just as map(None, ...) does with sequences of equal
length. The weaving stops when the shortest sequence is
exhausted.
Return Value
zip() returns a real Python list, the same way map() does.
Examples
Here are some examples, based on the reference implementation
below.
>>> a = (1, 2, 3, 4)
>>> b = (5, 6, 7, 8)
>>> c = (9, 10, 11)
>>> d = (12, 13)
>>> zip(a, b)
[(1, 5), (2, 6), (3, 7), (4, 8)]
>>> zip(a, d)
[(1, 12), (2, 13)]
>>> zip(a, b, c, d)
[(1, 5, 9, 12), (2, 6, 10, 13)]
Note that when the sequences are of the same length, zip() is
reversible:
>>> a = (1, 2, 3)
>>> b = (4, 5, 6)
>>> x = zip(a, b)
>>> y = zip(*x) # alternatively, apply(zip, x)
>>> z = zip(*y) # alternatively, apply(zip, y)
>>> x
[(1, 4), (2, 5), (3, 6)]
>>> y
[(1, 2, 3), (4, 5, 6)]
>>> z
[(1, 4), (2, 5), (3, 6)]
>>> x == z
1
It is not possible to reverse zip this way when the sequences are
not all the same length.
Reference Implementation
Here is a reference implementation, in Python of the zip()
built-in function. This will be replaced with a C implementation
after final approval.
def zip(*args):
if not args:
raise TypeError('zip() expects one or more sequence arguments')
ret = []
i = 0
try:
while 1:
item = []
for s in args:
item.append(s[i])
ret.append(tuple(item))
i = i + 1
except IndexError:
return ret
BDFL Pronouncements
Note: the BDFL refers to Guido van Rossum, Python's Benevolent
Dictator For Life.
- The function's name. An earlier version of this PEP included an
open issue listing 20+ proposed alternative names to zip(). In
the face of no overwhelmingly better choice, the BDFL strongly
prefers zip() due to its Haskell[2] heritage. See version 1.7
of this PEP for the list of alternatives.
- zip() shall be a built-in function.
- Optional padding. An earlier version of this PEP proposed an
optional `pad' keyword argument, which would be used when the
argument sequences were not the same length. This is similar
behavior to the map(None, ...) semantics except that the user
would be able to specify pad object. This has been rejected by
the BDFL in favor of always truncating to the shortest sequence,
because of the KISS principle. If there's a true need, it is
easier to add later. If it is not needed, it would still be
impossible to delete it in the future.
- Lazy evaluation. An earlier version of this PEP proposed that
zip() return a built-in object that performed lazy evaluation
using __getitem__() protocol. This has been strongly rejected
by the BDFL in favor of returning a real Python list. If lazy
evaluation is desired in the future, the BDFL suggests an xzip()
function be added.
- zip() with no arguments. the BDFL strongly prefers this raise a
TypeError exception.
- zip() with one argument. the BDFL strongly prefers that this
return a list of 1-tuples.
- Inner and outer container control. An earlier version of this
PEP contains a rather lengthy discussion on a feature that some
people wanted, namely the ability to control what the inner and
outer container types were (they are tuples and list
respectively in this version of the PEP). Given the simplified
API and implementation, this elaboration is rejected. For a
more detailed analysis, see version 1.7 of this PEP.
Subsequent Change to zip()
In Python 2.4, zip() with no arguments was modified to return an
empty list rather than raising a TypeError exception. The rationale
for the original behavior was that the absence of arguments was
thought to indicate a programming error. However, that thinking
did not anticipate the use of zip() with the * operator for unpacking
variable length argument lists. For example, the inverse of zip
could be defined as: unzip = lambda s: zip(*s). That transformation
also defines a matrix transpose or an equivalent row/column swap for
tables defined as lists of tuples. The latter transformation is
commonly used when reading data files with records as rows and fields
as columns. For example, the code:
date, rain, high, low = zip(*csv.reader(file("weather.csv")))
rearranges columnar data so that each field is collected into
individual tuples for straight-forward looping and summarization:
print "Total rainfall", sum(rain)
Using zip(*args) is more easily coded if zip(*[]) is handled as an
allowable case rather than an exception. This is especially helpful
when data is either built up from or recursed down to a null case
with no records.
Seeing this possibility, the BDFL agreed (with some misgivings) to
have the behavior changed for Py2.4.
Other Changes
- The xzip() function discussed above was implemented in Py2.3 in
the itertools module as itertools.izip(). This function provides
lazy behavior, consuming single elements and producing a single
tuple on each pass. The "just-in-time" style saves memory and
runs faster than its list based counterpart, zip().
- The itertools module also added itertools.repeat() and
itertools.chain(). These tools can be used together to pad
sequences with None (to match the behavior of map(None, seqn)):
zip(firstseq, chain(secondseq, repeat(None)))
References
[1] http://docs.python.org/reference/compound_stmts.html#for
[2] http://www.haskell.org/onlinereport/standard-prelude.html#$vzip
Greg Wilson's questionaire on proposed syntax to some CS grad students
http://www.python.org/pipermail/python-dev/2000-July/013139.html
Copyright
This document has been placed in the public domain.
pep-0202 List Comprehensions
| PEP: | 202 |
|---|---|
| Title: | List Comprehensions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 13-Jul-2000 |
| Python-Version: | 2.0 |
| Post-History: |
Introduction
This PEP describes a proposed syntactical extension to Python,
list comprehensions.
The Proposed Solution
It is proposed to allow conditional construction of list literals
using for and if clauses. They would nest in the same way for
loops and if statements nest now.
Rationale
List comprehensions provide a more concise way to create lists in
situations where map() and filter() and/or nested loops would
currently be used.
Examples
>>> print [i for i in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> print [i for i in range(20) if i%2 == 0]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> nums = [1,2,3,4]
>>> fruit = ["Apples", "Peaches", "Pears", "Bananas"]
>>> print [(i,f) for i in nums for f in fruit]
[(1, 'Apples'), (1, 'Peaches'), (1, 'Pears'), (1, 'Bananas'),
(2, 'Apples'), (2, 'Peaches'), (2, 'Pears'), (2, 'Bananas'),
(3, 'Apples'), (3, 'Peaches'), (3, 'Pears'), (3, 'Bananas'),
(4, 'Apples'), (4, 'Peaches'), (4, 'Pears'), (4, 'Bananas')]
>>> print [(i,f) for i in nums for f in fruit if f[0] == "P"]
[(1, 'Peaches'), (1, 'Pears'),
(2, 'Peaches'), (2, 'Pears'),
(3, 'Peaches'), (3, 'Pears'),
(4, 'Peaches'), (4, 'Pears')]
>>> print [(i,f) for i in nums for f in fruit if f[0] == "P" if i%2 == 1]
[(1, 'Peaches'), (1, 'Pears'), (3, 'Peaches'), (3, 'Pears')]
>>> print [i for i in zip(nums,fruit) if i[0]%2==0]
[(2, 'Peaches'), (4, 'Bananas')]
Reference Implementation
List comprehensions become part of the Python language with
release 2.0, documented in [1].
BDFL Pronouncements
- The syntax proposed above is the Right One.
- The form [x, y for ...] is disallowed; one is required to write
[(x, y) for ...].
- The form [... for x... for y...] nests, with the last index
varying fastest, just like nested for loops.
References
[1] http://docs.python.org/reference/expressions.html#list-displays
pep-0203 Augmented Assignments
| PEP: | 203 |
|---|---|
| Title: | Augmented Assignments |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Thomas Wouters <thomas at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 13-Jul-2000 |
| Python-Version: | 2.0 |
| Post-History: | 14-Aug-2000 |
Introduction
This PEP describes the `augmented assignment' proposal for Python
2.0. This PEP tracks the status and ownership of this feature,
slated for introduction in Python 2.0. It contains a description
of the feature and outlines changes necessary to support the
feature. This PEP summarizes discussions held in mailing list
forums, and provides URLs for further information where
appropriate. The CVS revision history of this file contains the
definitive historical record.
Proposed semantics
The proposed patch that adds augmented assignment to Python
introduces the following new operators:
+= -= *= /= %= **= <<= >>= &= ^= |=
They implement the same operator as their normal binary form,
except that the operation is done `in-place' when the left-hand
side object supports it, and that the left-hand side is only
evaluated once.
They truly behave as augmented assignment, in that they perform
all of the normal load and store operations, in addition to the
binary operation they are intended to do. So, given the expression:
x += y
The object `x' is loaded, then `y' is added to it, and the
resulting object is stored back in the original place. The precise
action performed on the two arguments depends on the type of `x',
and possibly of `y'.
The idea behind augmented assignment in Python is that it isn't
just an easier way to write the common practice of storing the
result of a binary operation in its left-hand operand, but also a
way for the left-hand operand in question to know that it should
operate `on itself', rather than creating a modified copy of
itself.
To make this possible, a number of new `hooks' are added to Python
classes and C extension types, which are called when the object in
question is used as the left hand side of an augmented assignment
operation. If the class or type does not implement the `in-place'
hooks, the normal hooks for the particular binary operation are
used.
So, given an instance object `x', the expression
x += y
tries to call x.__iadd__(y), which is the `in-place' variant of
__add__. If __iadd__ is not present, x.__add__(y) is attempted,
and finally y.__radd__(x) if __add__ is missing too. There is no
`right-hand-side' variant of __iadd__, because that would require
for `y' to know how to in-place modify `x', which is unsafe to say
the least. The __iadd__ hook should behave similar to __add__,
returning the result of the operation (which could be `self')
which is to be assigned to the variable `x'.
For C extension types, the `hooks' are members of the
PyNumberMethods and PySequenceMethods structures. Some special
semantics apply to make the use of these methods, and the mixing
of Python instance objects and C types, as unsurprising as
possible.
In the generic case of `x <augop> y' (or a similar case using the
PyNumber_InPlace API functions) the principal object being
operated on is `x'. This differs from normal binary operations,
where `x' and `y' could be considered `co-operating', because
unlike in binary operations, the operands in an in-place operation
cannot be swapped. However, in-place operations do fall back to
normal binary operations when in-place modification is not
supported, resuling in the following rules:
- If the left-hand object (`x') is an instance object, and it
has a `__coerce__' method, call that function with `y' as the
argument. If coercion succeeds, and the resulting left-hand
object is a different object than `x', stop processing it as
in-place and call the appropriate function for the normal binary
operation, with the coerced `x' and `y' as arguments. The result
of the operation is whatever that function returns.
If coercion does not yield a different object for `x', or `x'
does not define a `__coerce__' method, and `x' has the
appropriate `__ihook__' for this operation, call that method
with `y' as the argument, and the result of the operation is
whatever that method returns.
- Otherwise, if the left-hand object is not an instance object,
but its type does define the in-place function for this
operation, call that function with `x' and `y' as the arguments,
and the result of the operation is whatever that function
returns.
Note that no coercion on either `x' or `y' is done in this case,
and it's perfectly valid for a C type to receive an instance
object as the second argument; that is something that cannot
happen with normal binary operations.
- Otherwise, process it exactly as a normal binary operation (not
in-place), including argument coercion. In short, if either
argument is an instance object, resolve the operation through
`__coerce__', `__hook__' and `__rhook__'. Otherwise, both
objects are C types, and they are coerced and passed to the
appropriate function.
- If no way to process the operation can be found, raise a
TypeError with an error message specific to the operation.
- Some special casing exists to account for the case of `+' and
`*', which have a special meaning for sequences: for `+',
sequence concatenation, no coercion what so ever is done if a C
type defines sq_concat or sq_inplace_concat. For `*', sequence
repeating, `y' is converted to a C integer before calling either
sq_inplace_repeat and sq_repeat. This is done even if `y' is an
instance, though not if `x' is an instance.
The in-place function should always return a new reference, either
to the old `x' object if the operation was indeed performed
in-place, or to a new object.
Rationale
There are two main reasons for adding this feature to Python:
simplicity of expression, and support for in-place operations. The
end result is a tradeoff between simplicity of syntax and
simplicity of expression; like most new features, augmented
assignment doesn't add anything that was previously impossible. It
merely makes these things easier to do.
Adding augmented assignment will make Python's syntax more complex.
Instead of a single assignment operation, there are now twelve
assignment operations, eleven of which also perform an binary
operation. However, these eleven new forms of assignment are easy
to understand as the coupling between assignment and the binary
operation, and they require no large conceptual leap to
understand. Furthermore, languages that do have augmented
assignment have shown that they are a popular, much used feature.
Expressions of the form
<x> = <x> <operator> <y>
are common enough in those languages to make the extra syntax
worthwhile, and Python does not have significantly fewer of those
expressions. Quite the opposite, in fact, since in Python you can
also concatenate lists with a binary operator, something that is
done quite frequently. Writing the above expression as
<x> <operator>= <y>
is both more readable and less error prone, because it is
instantly obvious to the reader that it is <x> that is being
changed, and not <x> that is being replaced by something almost,
but not quite, entirely unlike <x>.
The new in-place operations are especially useful to matrix
calculation and other applications that require large objects. In
order to efficiently deal with the available program memory, such
packages cannot blindly use the current binary operations. Because
these operations always create a new object, adding a single item
to an existing (large) object would result in copying the entire
object (which may cause the application to run out of memory), add
the single item, and then possibly delete the original object,
depending on reference count.
To work around this problem, the packages currently have to use
methods or functions to modify an object in-place, which is
definitely less readable than an augmented assignment expression.
Augmented assignment won't solve all the problems for these
packages, since some operations cannot be expressed in the limited
set of binary operators to start with, but it is a start. A
different PEP[2] is looking at adding new operators.
New methods
The proposed implementation adds the following 11 possible `hooks'
which Python classes can implement to overload the augmented
assignment operations:
__iadd__
__isub__
__imul__
__idiv__
__imod__
__ipow__
__ilshift__
__irshift__
__iand__
__ixor__
__ior__
The `i' in `__iadd__' stands for `in-place'.
For C extension types, the following struct members are added:
To PyNumberMethods:
binaryfunc nb_inplace_add;
binaryfunc nb_inplace_subtract;
binaryfunc nb_inplace_multiply;
binaryfunc nb_inplace_divide;
binaryfunc nb_inplace_remainder;
binaryfunc nb_inplace_power;
binaryfunc nb_inplace_lshift;
binaryfunc nb_inplace_rshift;
binaryfunc nb_inplace_and;
binaryfunc nb_inplace_xor;
binaryfunc nb_inplace_or;
To PySequenceMethods:
binaryfunc sq_inplace_concat;
intargfunc sq_inplace_repeat;
In order to keep binary compatibility, the tp_flags TypeObject
member is used to determine whether the TypeObject in question has
allocated room for these slots. Until a clean break in binary
compatibility is made (which may or may not happen before 2.0)
code that wants to use one of the new struct members must first
check that they are available with the `PyType_HasFeature()'
macro:
if (PyType_HasFeature(x->ob_type, Py_TPFLAGS_HAVE_INPLACE_OPS) &&
x->ob_type->tp_as_number && x->ob_type->tp_as_number->nb_inplace_add) {
/* ... */
This check must be made even before testing the method slots for
NULL values! The macro only tests whether the slots are available,
not whether they are filled with methods or not.
Implementation
The current implementation of augmented assignment[1] adds, in
addition to the methods and slots already covered, 13 new bytecodes
and 13 new API functions.
The API functions are simply in-place versions of the current
binary-operation API functions:
PyNumber_InPlaceAdd(PyObject *o1, PyObject *o2);
PyNumber_InPlaceSubtract(PyObject *o1, PyObject *o2);
PyNumber_InPlaceMultiply(PyObject *o1, PyObject *o2);
PyNumber_InPlaceDivide(PyObject *o1, PyObject *o2);
PyNumber_InPlaceRemainder(PyObject *o1, PyObject *o2);
PyNumber_InPlacePower(PyObject *o1, PyObject *o2);
PyNumber_InPlaceLshift(PyObject *o1, PyObject *o2);
PyNumber_InPlaceRshift(PyObject *o1, PyObject *o2);
PyNumber_InPlaceAnd(PyObject *o1, PyObject *o2);
PyNumber_InPlaceXor(PyObject *o1, PyObject *o2);
PyNumber_InPlaceOr(PyObject *o1, PyObject *o2);
PySequence_InPlaceConcat(PyObject *o1, PyObject *o2);
PySequence_InPlaceRepeat(PyObject *o, int count);
They call either the Python class hooks (if either of the objects
is a Python class instance) or the C type's number or sequence
methods.
The new bytecodes are:
INPLACE_ADD
INPLACE_SUBTRACT
INPLACE_MULTIPLY
INPLACE_DIVIDE
INPLACE_REMAINDER
INPLACE_POWER
INPLACE_LEFTSHIFT
INPLACE_RIGHTSHIFT
INPLACE_AND
INPLACE_XOR
INPLACE_OR
ROT_FOUR
DUP_TOPX
The INPLACE_* bytecodes mirror the BINARY_* bytecodes, except that
they are implemented as calls to the `InPlace' API functions. The
other two bytecodes are `utility' bytecodes: ROT_FOUR behaves like
ROT_THREE except that the four topmost stack items are rotated.
DUP_TOPX is a bytecode that takes a single argument, which should
be an integer between 1 and 5 (inclusive) which is the number of
items to duplicate in one block. Given a stack like this (where
the right side of the list is the `top' of the stack):
[1, 2, 3, 4, 5]
"DUP_TOPX 3" would duplicate the top 3 items, resulting in this
stack:
[1, 2, 3, 4, 5, 3, 4, 5]
DUP_TOPX with an argument of 1 is the same as DUP_TOP. The limit
of 5 is purely an implementation limit. The implementation of
augmented assignment requires only DUP_TOPX with an argument of 2
and 3, and could do without this new opcode at the cost of a fair
number of DUP_TOP and ROT_*.
Open Issues
The PyNumber_InPlace API is only a subset of the normal PyNumber
API: only those functions that are required to support the
augmented assignment syntax are included. If other in-place API
functions are needed, they can be added later.
The DUP_TOPX bytecode is a conveniency bytecode, and is not
actually necessary. It should be considered whether this bytecode
is worth having. There seems to be no other possible use for this
bytecode at this time.
Copyright
This document has been placed in the public domain.
References
[1] http://www.python.org/pipermail/python-list/2000-June/059556.html
[2] http://sourceforge.net/patch?func=detailpatch&patch_id=100699&group_id=5470
[3] PEP 211, Adding A New Outer Product Operator, Wilson
http://www.python.org/dev/peps/pep-0211/
pep-0204 Range Literals
| PEP: | 204 |
|---|---|
| Title: | Range Literals |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Thomas Wouters <thomas at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 14-Jul-2000 |
| Python-Version: | 2.0 |
| Post-History: |
Introduction
This PEP describes the `range literal' proposal for Python 2.0.
This PEP tracks the status and ownership of this feature, slated
for introduction in Python 2.0. It contains a description of the
feature and outlines changes necessary to support the feature.
This PEP summarizes discussions held in mailing list forums, and
provides URLs for further information, where appropriate. The CVS
revision history of this file contains the definitive historical
record.
List ranges
Ranges are sequences of numbers of a fixed stepping, often used in
for-loops. The Python for-loop is designed to iterate over a
sequence directly:
>>> l = ['a', 'b', 'c', 'd']
>>> for item in l:
... print item
a
b
c
d
However, this solution is not always prudent. Firstly, problems
arise when altering the sequence in the body of the for-loop,
resulting in the for-loop skipping items. Secondly, it is not
possible to iterate over, say, every second element of the
sequence. And thirdly, it is sometimes necessary to process an
element based on its index, which is not readily available in the
above construct.
For these instances, and others where a range of numbers is
desired, Python provides the `range' builtin function, which
creates a list of numbers. The `range' function takes three
arguments, `start', `end' and `step'. `start' and `step' are
optional, and default to 0 and 1, respectively.
The `range' function creates a list of numbers, starting at
`start', with a step of `step', up to, but not including `end', so
that `range(10)' produces a list that has exactly 10 items, the
numbers 0 through 9.
Using the `range' function, the above example would look like
this:
>>> for i in range(len(l)):
... print l[i]
a
b
c
d
Or, to start at the second element of `l' and processing only
every second element from then on:
>>> for i in range(1, len(l), 2):
... print l[i]
b
d
There are several disadvantages with this approach:
- Clarity of purpose: Adding another function call, possibly with
extra arithmetic to determine the desired length and step of the
list, does not improve readability of the code. Also, it is
possible to `shadow' the builtin `range' function by supplying a
local or global variable with the same name, effectively
replacing it. This may or may not be a desired effect.
- Efficiency: because the `range' function can be overridden, the
Python compiler cannot make assumptions about the for-loop, and
has to maintain a separate loop counter.
- Consistency: There already is a syntax that is used to denote
ranges, as shown below. This syntax uses the exact same
arguments, though all optional, in the exact same way. It seems
logical to extend this syntax to ranges, to form `range
literals'.
Slice Indices
In Python, a sequence can be indexed in one of two ways:
retrieving a single item, or retrieving a range of items.
Retrieving a range of items results in a new object of the same
type as the original sequence, containing zero or more items from
the original sequence. This is done using a `range notation':
>>> l[2:4]
['c', 'd']
This range notation consists of zero, one or two indices separated
by a colon. The first index is the `start' index, the second the
`end'. When either is left out, they default to respectively the
start and the end of the sequence.
There is also an extended range notation, which incorporates
`step' as well. Though this notation is not currently supported
by most builtin types, if it were, it would work as follows:
>>> l[1:4:2]
['b', 'd']
The third `argument' to the slice syntax is exactly the same as
the `step' argument to range(). The underlying mechanisms of the
standard, and these extended slices, are sufficiently different
and inconsistent that many classes and extensions outside of
mathematical packages do not implement support for the extended
variant. While this should be resolved, it is beyond the scope of
this PEP.
Extended slices do show, however, that there is already a
perfectly valid and applicable syntax to denote ranges in a way
that solve all of the earlier stated disadvantages of the use of
the range() function:
- It is clearer, more concise syntax, which has already proven to
be both intuitive and easy to learn.
- It is consistent with the other use of ranges in Python
(e.g. slices).
- Because it is built-in syntax, instead of a builtin function, it
cannot be overridden. This means both that a viewer can be
certain about what the code does, and that an optimizer will not
have to worry about range() being `shadowed'.
The Proposed Solution
The proposed implementation of range-literals combines the syntax
for list literals with the syntax for (extended) slices, to form
range literals:
>>> [1:10]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [:5]
[0, 1, 2, 3, 4]
>>> [5:1:-1]
[5, 4, 3, 2]
There is one minor difference between range literals and the slice
syntax: though it is possible to omit all of `start', `end' and
`step' in slices, it does not make sense to omit `end' in range
literals. In slices, `end' would default to the end of the list,
but this has no meaning in range literals.
Reference Implementation
The proposed implementation can be found on SourceForge[1]. It
adds a new bytecode, BUILD_RANGE, that takes three arguments from
the stack and builds a list on the bases of those. The list is
pushed back on the stack.
The use of a new bytecode is necessary to be able to build ranges
based on other calculations, whose outcome is not known at compile
time.
The code introduces two new functions to listobject.c, which are
currently hovering between private functions and full-fledged API
calls.
PyList_FromRange() builds a list from start, end and step,
returning NULL if an error occurs. Its prototype is:
PyObject * PyList_FromRange(long start, long end, long step)
PyList_GetLenOfRange() is a helper function used to determine the
length of a range. Previously, it was a static function in
bltinmodule.c, but is now necessary in both listobject.c and
bltinmodule.c (for xrange). It is made non-static solely to avoid
code duplication. Its prototype is:
long PyList_GetLenOfRange(long start, long end, long step)
Open issues
- One possible solution to the discrepancy of requiring the `end'
argument in range literals is to allow the range syntax to
create a `generator', rather than a list, such as the `xrange'
builtin function does. However, a generator would not be a
list, and it would be impossible, for instance, to assign to
items in the generator, or append to it.
The range syntax could conceivably be extended to include tuples
(i.e. immutable lists), which could then be safely implemented
as generators. This may be a desirable solution, especially for
large number arrays: generators require very little in the way
of storage and initialization, and there is only a small
performance impact in calculating and creating the appropriate
number on request. (TBD: is there any at all? Cursory testing
suggests equal performance even in the case of ranges of length
1)
However, even if idea was adopted, would it be wise to `special
case' the second argument, making it optional in one instance of
the syntax, and non-optional in other cases ?
- Should it be possible to mix range syntax with normal list
literals, creating a single list? E.g.:
>>> [5, 6, 1:6, 7, 9]
to create
[5, 6, 1, 2, 3, 4, 5, 7, 9]
- How should range literals interact with another proposed new
feature, `list comprehensions'[2]? Specifically, should it be
possible to create lists in list comprehensions? E.g.:
>>> [x:y for x in (1, 2) y in (3, 4)]
Should this example return a single list with multiple ranges:
[1, 2, 1, 2, 3, 2, 2, 3]
Or a list of lists, like so:
[[1, 2], [1, 2, 3], [2], [2, 3]]
However, as the syntax and semantics of list comprehensions are
still subject of hot debate, these issues are probably best
addressed by the `list comprehensions' PEP.
- Range literals accept objects other than integers: it performs
PyInt_AsLong() on the objects passed in, so as long as the
objects can be coerced into integers, they will be accepted.
The resulting list, however, is always composed of standard
integers.
Should range literals create a list of the passed-in type? It
might be desirable in the cases of other builtin types, such as
longs and strings:
>>> [ 1L : 2L<<64 : 2<<32L ]
>>> ["a":"z":"b"]
>>> ["a":"z":2]
However, this might be too much `magic' to be obvious. It might
also present problems with user-defined classes: even if the
base class can be found and a new instance created, the instance
may require additional arguments to __init__, causing the
creation to fail.
- The PyList_FromRange() and PyList_GetLenOfRange() functions need
to be classified: are they part of the API, or should they be
made private functions?
Rejection
After careful consideration, and a period of meditation, this
proposal has been rejected. The open issues, as well as some
confusion between ranges and slice syntax, raised enough questions
for Guido not to accept it for Python 2.0, and later to reject the
proposal altogether. The new syntax and its intentions were deemed
not obvious enough.
[ TBD: Guido, ammend/confirm this, please. Preferably both; this
is a PEP, it should contain *all* the reasons for rejection
and/or reconsideration, for future reference. ]
Copyright
This document has been placed in the Public Domain.
References:
[1] http://sourceforge.net/patch/?func=detailpatch&patch_id=100902&group_id=5470
[2] PEP 202, List Comprehensions
pep-0205 Weak References
| PEP: | 205 |
|---|---|
| Title: | Weak References |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Fred L. Drake, Jr. <fdrake at acm.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | |
| Python-Version: | 2.1 |
| Post-History: | 11-Jan-2001 |
Motivation
There are two basic applications for weak references which have
been noted by Python programmers: object caches and reduction of
pain from circular references.
Caches (weak dictionaries)
There is a need to allow objects to be maintained that represent
external state, mapping a single instance to the external
reality, where allowing multiple instances to be mapped to the
same external resource would create unnecessary difficulty
maintaining synchronization among instances. In these cases,
a common idiom is to support a cache of instances; a factory
function is used to return either a new or existing instance.
The difficulty in this approach is that one of two things must
be tolerated: either the cache grows without bound, or there
needs to be explicit management of the cache elsewhere in the
application. The later can be very tedious and leads to more
code than is really necessary to solve the problem at hand,
and the former can be unacceptable for long-running processes
or even relatively short processes with substantial memory
requirements.
- External objects that need to be represented by a single
instance, no matter how many internal users there are. This
can be useful for representing files that need to be written
back to disk in whole rather than locked & modified for
every use.
- Objects that are expensive to create, but may be needed by
multiple internal consumers. Similar to the first case, but
not necessarily bound to external resources, and possibly
not an issue for shared state. Weak references are only
useful in this case if there is some flavor of "soft"
references or if there is a high likelihood that users of
individual objects will overlap in lifespan.
Circular references
- DOMs require a huge amount of circular (to parent & document
nodes) references, but these could be eliminated using a weak
dictionary mapping from each node to its parent. This
might be especially useful in the context of something like
xml.dom.pulldom, allowing the .unlink() operation to become
a no-op.
This proposal is divided into the following sections:
- Proposed Solution
- Implementation Strategy
- Possible Applications
- Previous Weak Reference Work in Python
- Weak References in Java
The full text of one early proposal is included as an appendix
since it does not appear to be available on the net.
Aspects of the Solution Space
There are two distinct aspects to the weak references problem:
- Invalidation of weak references
- Presentation of weak references to Python code
Invalidation:
Past approaches to weak reference invalidation have often hinged
on storing a strong reference and being able to examine all the
instances of weak reference objects, and invalidating them when
the reference count of their referent goes to one (indicating that
the reference stored by the weak reference is the last remaining
reference). This has the advantage that the memory management
machinery in Python need not change, and that any type can be
weakly referenced.
The disadvantage of this approach to invalidation is that it
assumes that the management of the weak references is called
sufficiently frequently that weakly-referenced objects are noticed
within a reasonably short time frame; since this means a scan over
some data structure to invalidate references, an operation which
is O(N) on the number of weakly referenced objects, this is not
effectively amortized for any single object which is weakly
referenced. This also assumes that the application is calling
into code which handles weakly-referenced objects with some
frequency, which makes weak-references less attractive for library
code.
An alternate approach to invalidation is that the de-allocation
code to be aware of the possibility of weak references and make a
specific call into the weak-reference management code to all
invalidation whenever an object is deallocated. This requires a
change in the tp_dealloc handler for weakly-referencable objects;
an additional call is needed at the "top" of the handler for
objects which support weak-referencing, and an efficient way to
map from an object to a chain of weak references for that object
is needed as well.
Presentation:
Two ways that weak references are presented to the Python layer
have been as explicit reference objects upon which some operation
is required in order to retrieve a usable reference to the
underlying object, and proxy objects which masquerade as the
original objects as much as possible.
Reference objects are easy to work with when some additional layer
of object managemenet is being added in Python; references can be
checked for liveness explicitly, without having to invoke
operations on the referents and catching some special exception
raised when an invalid weak reference is used.
However, a number of users favor the proxy appoach simply because
the weak reference looks so much like the original object.
Proposed Solution
Weak references should be able to point to any Python object that
may have substantial memory size (directly or indirectly), or hold
references to external resources (database connections, open
files, etc.).
A new module, weakref, will contain new functions used to create
weak references. weakref.ref() will create a "weak reference
object" and optionally attach a callback which will be called when
the object is about to be finalized. weakref.mapping() will
create a "weak dictionary". A third function, weakref.proxy(),
will create a proxy object that behaves somewhat like the original
object.
A weak reference object will allow access to the referenced object
if it hasn't been collected and to determine if the object still
exists in memory. Retrieving the referent is done by calling the
reference object. If the referent is no longer alive, this will
return None instead.
A weak dictionary maps arbitrary keys to values, but does not own
a reference to the values. When the values are finalized, the
(key, value) pairs for which it is a value are removed from all
the mappings containing such pairs. Like dictionaries, weak
dictionaries are not hashable.
Proxy objects are weak references that attempt to behave like the
object they proxy, as much as they can. Regardless of the
underlying type, proxies are not hashable since their ability to
act as a weak reference relies on a fundamental mutability that
will cause failures when used as dictionary keys -- even if the
proper hash value is computed before the referent dies, the
resulting proxy cannot be used as a dictionary key since it cannot
be compared once the referent has expired, and comparability is
necessary for dictionary keys. Operations on proxy objects after
the referent dies cause weakref.ReferenceError to be raised in
most cases. "is" comparisons, type(), and id() will continue to
work, but always refer to the proxy and not the referent.
The callbacks registered with weak references must accept a single
parameter, which will be the weak reference or proxy object
itself. The object cannot be accessed or resurrected in the
callback.
Implementation Strategy
The implementation of weak references will include a list of
reference containers that must be cleared for each weakly-
referencable object. If the reference is from a weak dictionary,
the dictionary entry is cleared first. Then, any associated
callback is called with the object passed as a parameter. Once
all callbacks have been called, the object is finalized and
deallocated.
Many built-in types will participate in the weak-reference
management, and any extension type can elect to do so. The type
structure will contain an additional field which provides an
offset into the instance structure which contains a list of weak
reference structures. If the value of the field is <= 0, the
object does not participate. In this case, weakref.ref(),
<weakdict>.__setitem__() and .setdefault(), and item assignment will
raise TypeError. If the value of the field is > 0, a new weak
reference can be generated and added to the list.
This approach is taken to allow arbitrary extension types to
participate, without taking a memory hit for numbers or other
small types.
Standard types which support weak references include instances,
functions, and bound & unbound methods. With the addition of
class types ("new-style classes") in Python 2.2, types grew
support for weak references. Instances of class types are weakly
referencable if they have a base type which is weakly referencable,
the class not specify __slots__, or a slot is named __weakref__.
Generators also support weak references.
Possible Applications
PyGTK+ bindings?
Tkinter -- could avoid circular references by using weak
references from widgets to their parents. Objects won't be
discarded any sooner in the typical case, but there won't be so
much dependence on the programmer calling .destroy() before
releasing a reference. This would mostly benefit long-running
applications.
DOM trees.
Previous Weak Reference Work in Python
Dianne Hackborn has proposed something called "virtual references".
'vref' objects are very similar to java.lang.ref.WeakReference
objects, except there is no equivalent to the invalidation
queues. Implementing a "weak dictionary" would be just as
difficult as using only weak references (without the invalidation
queue) in Java. Information on this has disappeared from the Web,
but is included below as an Appendix.
Marc-AndrĂŠ Lemburg's mx.Proxy package:
http://www.lemburg.com/files/python/mxProxy.html
The weakdict module by Dieter Maurer is implemented in C and
Python. It appears that the Web pages have not been updated since
Python 1.5.2a, so I'm not yet sure if the implementation is
compatible with Python 2.0.
http://www.handshake.de/~dieter/weakdict.html
PyWeakReference by Alex Shindich:
http://sourceforge.net/projects/pyweakreference/
Eric Tiedemann has a weak dictionary implementation:
http://www.hyperreal.org/~est/python/weak/
Weak References in Java
http://java.sun.com/j2se/1.3/docs/api/java/lang/ref/package-summary.html
Java provides three forms of weak references, and one interesting
helper class. The three forms are called "weak", "soft", and
"phantom" references. The relevant classes are defined in the
java.lang.ref package.
For each of the reference types, there is an option to add the
reference to a queue when it is invalidated by the memory
allocator. The primary purpose of this facility seems to be that
it allows larger structures to be composed to incorporate
weak-reference semantics without having to impose substantial
additional locking requirements. For instance, it would not be
difficult to use this facility to create a "weak" hash table which
removes keys and referents when a reference is no longer used
elsewhere. Using weak references for the objects without some
sort of notification queue for invalidations leads to much more
tedious implementation of the various operations required on hash
tables. This can be a performance bottleneck if deallocations of
the stored objects are infrequent.
Java's "weak" references are most like Dianne Hackborn's old vref
proposal: a reference object refers to a single Python object,
but does not own a reference to that object. When that object is
deallocated, the reference object is invalidated. Users of the
reference object can easily determine that the reference has been
invalidated, or a NullObjectDereferenceError can be raised when
an attempt is made to use the referred-to object.
The "soft" references are similar, but are not invalidated as soon
as all other references to the referred-to object have been
released. The "soft" reference does own a reference, but allows
the memory allocator to free the referent if the memory is needed
elsewhere. It is not clear whether this means soft references are
released before the malloc() implementation calls sbrk() or its
equivalent, or if soft references are only cleared when malloc()
returns NULL.
"Phantom" references are a little different; unlike weak and soft
references, the referent is not cleared when the reference is
added to its queue. When all phantom references for an object
are dequeued, the object is cleared. This can be used to keep an
object alive until some additional cleanup is performed which
needs to happen before the objects .finalize() method is called.
Unlike the other two reference types, "phantom" references must be
associated with an invalidation queue.
Appendix -- Dianne Hackborn's vref proposal (1995)
[This has been indented and paragraphs reflowed, but there have be
no content changes. --Fred]
Proposal: Virtual References
In an attempt to partly address the recurring discussion
concerning reference counting vs. garbage collection, I would like
to propose an extension to Python which should help in the
creation of "well structured" cyclic graphs. In particular, it
should allow at least trees with parent back-pointers and
doubly-linked lists to be created without worry about cycles.
The basic mechanism I'd like to propose is that of a "virtual
reference," or a "vref" from here on out. A vref is essentially a
handle on an object that does not increment the object's reference
count. This means that holding a vref on an object will not keep
the object from being destroyed. This would allow the Python
programmer, for example, to create the aforementioned tree
structure tree structure, which is automatically destroyed when it
is no longer in use -- by making all of the parent back-references
into vrefs, they no longer create reference cycles which keep the
tree from being destroyed.
In order to implement this mechanism, the Python core must ensure
that no -real- pointers are ever left referencing objects that no
longer exist. The implementation I would like to propose involves
two basic additions to the current Python system:
1. A new "vref" type, through which the Python programmer creates
and manipulates virtual references. Internally, it is
basically a C-level Python object with a pointer to the Python
object it is a reference to. Unlike all other Python code,
however, it does not change the reference count of this object.
In addition, it includes two pointers to implement a
doubly-linked list, which is used below.
2. The addition of a new field to the basic Python object
[PyObject_Head in object.h], which is either NULL, or points to
the head of a list of all vref objects that reference it. When
a vref object attaches itself to another object, it adds itself
to this linked list. Then, if an object with any vrefs on it
is deallocated, it may walk this list and ensure that all of
the vrefs on it point to some safe value, e.g. Nothing.
This implementation should hopefully have a minimal impact on the
current Python core -- when no vrefs exist, it should only add one
pointer to all objects, and a check for a NULL pointer every time
an object is deallocated.
Back at the Python language level, I have considered two possible
semantics for the vref object --
==> Pointer semantics:
In this model, a vref behaves essentially like a Python-level
pointer; the Python program must explicitly dereference the vref
to manipulate the actual object it references.
An example vref module using this model could include the
function "new"; When used as 'MyVref = vref.new(MyObject)', it
returns a new vref object such that that MyVref.object ==
MyObject. MyVref.object would then change to Nothing if
MyObject is ever deallocated.
For a concrete example, we may introduce some new C-style syntax:
& -- unary operator, creates a vref on an object, same as vref.new().
* -- unary operator, dereference a vref, same as VrefObject.object.
We can then define:
1. type(&MyObject) == vref.VrefType
2. *(&MyObject) == MyObject
3. (*(&MyObject)).attr == MyObject.attr
4. &&MyObject == Nothing
5. *MyObject -> exception
Rule #4 is subtle, but comes about because we have made a vref
to (a vref with no real references). Thus the outer vref is
cleared to Nothing when the inner one inevitably disappears.
==> Proxy semantics:
In this model, the Python programmer manipulates vref objects
just as if she were manipulating the object it is a reference
of. This is accomplished by implementing the vref so that all
operations on it are redirected to its referenced object. With
this model, the dereference operator (*) no longer makes sense;
instead, we have only the reference operator (&), and define:
1. type(&MyObject) == type(MyObject)
2. &MyObject == MyObject
3. (&MyObject).attr == MyObject.attr
4. &&MyObject == MyObject
Again, rule #4 is important -- here, the outer vref is in fact a
reference to the original object, and -not- the inner vref.
This is because all operations applied to a vref actually apply
to its object, so that creating a vref of a vref actually
results in creating a vref of the latter's object.
The first, pointer semantics, has the advantage that it would be
very easy to implement; the vref type is extremely simple,
requiring at minimum a single attribute, object, and a function to
create a reference.
However, I really like the proxy semantics. Not only does it put
less of a burden on the Python programmer, but it allows you to do
nice things like use a vref anywhere you would use the actual
object. Unfortunately, it would probably an extreme pain, if not
practically impossible, to implement in the current Python
implementation. I do have some thoughts, though, on how to do
this, if it seems interesting; one possibility is to introduce new
type-checking functions which handle the vref. This would
hopefully older C modules which don't expect vrefs to simply
return a type error, until they can be fixed.
Finally, there are some other additional capabilities that this
system could provide. One that seems particularily interesting to
me involves allowing the Python programmer to add "destructor"
function to a vref -- this Python function would be called
immediately prior to the referenced object being deallocated,
allowing a Python program to invisibly attach itself to another
object and watch for it to disappear. This seems neat, though I
haven't actually come up with any practical uses for it, yet... :)
-- Dianne
Copyright
This document has been placed in the public domain.
pep-0206 Python Advanced Library
| PEP: | 206 |
|---|---|
| Title: | Python Advanced Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Withdrawn |
| Type: | Informational |
| Created: | |
| Post-History: |
Introduction
This PEP describes the Python Advanced Library, a collection of
high-quality and frequently-used third party extension modules.
Batteries Included Philosophy
The Python source distribution has long maintained the philosophy
of "batteries included" -- having a rich and versatile standard
library which is immediately available, without making the user
download separate packages. This gives the Python language a head
start in many projects.
However, the standard library modules aren't always the best
choices for a job. Some library modules were quick hacks
(e.g. calendar, commands), some were designed poorly and are now
near-impossible to fix (cgi), and some have been rendered obsolete
by other, more complete modules (binascii offers the same features
as the binhex, uu, base64 modules). This PEP describes a list of
third-party modules that make Python more competitive for various
application domains, forming the Python Advanced Library.
The deliverable is a set of scripts that will retrieve, build, and
install the packages for a particular application domain. The
Python Package Index now contains enough information to let
software automatically find packages and download them, so the
time is ripe to implement this.
Currently this document doesn't suggest *removing* modules from
the standard library that are superseded by a third-party module.
That's difficult to do because it entails many backward-compatibility
problems, so it's not worth bothering with now.
Please suggest additional domains of interest.
Domain: Web tasks
XML parsing: ElementTree + SAX.
URL retrieval: libcurl? other possibilities?
HTML parsing: mxTidy? HTMLParser?
Async network I/O: Twisted
RDF parser: ???
HTTP serving: ???
HTTP cookie processing: ???
Web framework: A WSGI gateway, perhaps? Paste?
Graphics: PIL, Chaco.
Domain: Scientific Programming
Numeric: Numeric, SciPy
Graphics: PIL, Chaco.
Domain: Application Development
GUI toolkit: ???
Graphics: Reportlab for PDF generation.
Domain: Education
Graphics: PyGame
Software covered by the GNU General Public License
Some of these third-party modules are covered by the GNU General
Public License and the GNU Lesser General Public License.
Providing a script to download and install such packages, or even
assembling all these packages into a single tarball or CD-ROM,
shouldn't cause any difficulties with the GPL, under the "mere
aggregation" clause of the license.
Open Issues
What other application domains are important?
Should this just be a set of Ubuntu or Debian packages? Compiling
things such as PyGame can be very complicated and may be too
difficult to automate.
Acknowledgements
The PEP is based on an earlier draft PEP by Moshe Zadka, titled
"2.0 Batteries Included."
pep-0207 Rich Comparisons
| PEP: | 207 |
|---|---|
| Title: | Rich Comparisons |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org>, David Ascher <DavidA at ActiveState.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
This PEP proposes several new features for comparisons:
- Allow separately overloading of <, >, <=, >=, ==, !=, both in
classes and in C extensions.
- Allow any of those overloaded operators to return something else
besides a Boolean result.
Motivation
The main motivation comes from NumPy, whose users agree that A<B
should return an array of elementwise comparison outcomes; they
currently have to spell this as less(A,B) because A<B can only
return a Boolean result or raise an exception.
An additional motivation is that frequently, types don't have a
natural ordering, but still need to be compared for equality.
Currently such a type *must* implement comparison and thus define
an arbitrary ordering, just so that equality can be tested.
Also, for some object types an equality test can be implemented
much more efficiently than an ordering test; for example, lists
and dictionaries that differ in length are unequal, but the
ordering requires inspecting some (potentially all) items.
Previous Work
Rich Comparisons have been proposed before; in particular by David
Ascher, after experience with Numerical Python:
http://starship.python.net/crew/da/proposals/richcmp.html
It is also included below as an Appendix. Most of the material in
this PEP is derived from David's proposal.
Concerns
1 Backwards compatibility, both at the Python level (classes using
__cmp__ need not be changed) and at the C level (extensions
defining tp_compare need not be changed, code using
PyObject_Compare() must work even if the compared objects use
the new rich comparison scheme).
2 When A<B returns a matrix of elementwise comparisons, an easy
mistake to make is to use this expression in a Boolean context.
Without special precautions, it would always be true. This use
should raise an exception instead.
3 If a class overrides x==y but nothing else, should x!=y be
computed as not(x==y), or fail? What about the similar
relationship between < and >=, or between > and <=?
4 Similarly, should we allow x<y to be calculated from y>x? And
x<=y from not(x>y)? And x==y from y==x, or x!=y from y!=x?
5 When comparison operators return elementwise comparisons, what
to do about shortcut operators like A<B<C, ``A<B and C<D'',
``A<B or C<D''?
6 What to do about min() and max(), the 'in' and 'not in'
operators, list.sort(), dictionary key comparison, and other
uses of comparisons by built-in operations?
Proposed Resolutions
1 Full backwards compatibility can be achieved as follows. When
an object defines tp_compare() but not tp_richcompare(), and a
rich comparison is requested, the outcome of tp_compare() is
used in the ovious way. E.g. if "<" is requested, an exception if
tp_compare() raises an exception, the outcome is 1 if
tp_compare() is negative, and 0 if it is zero or positive. Etc.
Full forward compatibility can be achieved as follows. When a
classic comparison is requested on an object that implements
tp_richcompare(), up to three comparisons are used: first == is
tried, and if it returns true, 0 is returned; next, < is tried
and if it returns true, -1 is returned; next, > is tried and if
it returns true, +1 is returned. If any operator tried returns
a non-Boolean value (see below), the exception raised by
conversion to Boolean is passed through. If none of the
operators tried returns true, the classic comparison fallbacks
are tried next.
(I thought long and hard about the order in which the three
comparisons should be tried. At one point I had a convincing
argument for doing it in this order, based on the behavior of
comparisons for cyclical data structures. But since that code
has changed again, I'm not so sure that it makes a difference
any more.)
2 Any type that returns a collection of Booleans instead of a
single boolean should define nb_nonzero() to raise an exception.
Such a type is considered a non-Boolean.
3 The == and != operators are not assumed to be each other's
complement (e.g. IEEE 754 floating point numbers do not satisfy
this). It is up to the type to implement this if desired.
Similar for < and >=, or > and <=; there are lots of examples
where these assumptions aren't true (e.g. tabnanny).
4 The reflexivity rules *are* assumed by Python. Thus, the
interpreter may swap y>x with x<y, y>=x with x<=y, and may swap
the arguments of x==y and x!=y. (Note: Python currently assumes
that x==x is always true and x!=x is never true; this should not
be assumed.)
5 In the current proposal, when A<B returns an array of
elementwise comparisons, this outcome is considered non-Boolean,
and its interpretation as Boolean by the shortcut operators
raises an exception. David Ascher's proposal tries to deal
with this; I don't think this is worth the additional complexity
in the code generator. Instead of A<B<C, you can write
(A<B)&(B<C).
6 The min() and list.sort() operations will only use the
< operator; max() will only use the > operator. The 'in' and
'not in' operators and dictionary lookup will only use the ==
operator.
Implementation Proposal
This closely follows David Ascher's proposal.
C API
- New functions:
PyObject *PyObject_RichCompare(PyObject *, PyObject *, int)
This performs the requested rich comparison, returning a Python
object or raising an exception. The 3rd argument must be one of
Py_LT, Py_LE, Py_EQ, Py_NE, Py_GT or Py_GE.
int PyObject_RichCompareBool(PyObject *, PyObject *, int)
This performs the requested rich comparison, returning a
Boolean: -1 for exception, 0 for false, 1 for true. The 3rd
argument must be one of Py_LT, Py_LE, Py_EQ, Py_NE, Py_GT or
Py_GE. Note that when PyObject_RichCompare() returns a
non-Boolean object, PyObject_RichCompareBool() will raise an
exception.
- New typedef:
typedef PyObject *(*richcmpfunc) (PyObject *, PyObject *, int);
- New slot in type object, replacing spare tp_xxx7:
richcmpfunc tp_richcompare;
This should be a function with the same signature as
PyObject_RichCompare(), and performing the same comparison.
At least one of the arguments is of the type whose
tp_richcompare slot is being used, but the other may have a
different type. If the function cannot compare the particular
combination of objects, it should return a new reference to
Py_NotImplemented.
- PyObject_Compare() is changed to try rich comparisons if they
are defined (but only if classic comparisons aren't defined).
Changes to the interpreter
- Whenever PyObject_Compare() is called with the intent of getting
the outcome of a particular comparison (e.g. in list.sort(), and
of course for the comparison operators in ceval.c), the code is
changed to call PyObject_RichCompare() or
PyObject_RichCompareBool() instead; if the C code needs to know
the outcome of the comparison, PyObject_IsTrue() is called on
the result (which may raise an exception).
- Most built-in types that currently define a comparison will be
modified to define a rich comparison instead. (This is
optional; I've converted lists, tuples, complex numbers, and
arrays so far, and am not sure whether I will convert others.)
Classes
- Classes can define new special methods __lt__, __le__, __eq__,
__ne__,__gt__, __ge__ to override the corresponding operators.
(I.e., <, <=, ==, !=, >, >=. You gotta love the Fortran
heritage.) If a class defines __cmp__ as well, it is only used
when __lt__ etc. have been tried and return NotImplemented.
Copyright
This document has been placed in the public domain.
Appendix
Here is most of David Ascher's original proposal (version 0.2.1,
dated Wed Jul 22 16:49:28 1998; I've left the Contents, History
and Patches sections out). It addresses almost all concerns
above.
Abstract
A new mechanism allowing comparisons of Python objects to return
values other than -1, 0, or 1 (or raise exceptions) is
proposed. This mechanism is entirely backwards compatible, and can
be controlled at the level of the C PyObject type or of the Python
class definition. There are three cooperating parts to the
proposed mechanism:
- the use of the last slot in the type object structure to store a
pointer to a rich comparison function
- the addition of special methods for classes
- the addition of an optional argument to the builtin cmp()
function.
Motivation
The current comparison protocol for Python objects assumes that
any two Python objects can be compared (as of Python 1.5, object
comparisons can raise exceptions), and that the return value for
any comparison should be -1, 0 or 1. -1 indicates that the first
argument to the comparison function is less than the right one, +1
indicating the contrapositive, and 0 indicating that the two
objects are equal. While this mechanism allows the establishment
of a order relationship (e.g. for use by the sort() method of list
objects), it has proven to be limited in the context of Numeric
Python (NumPy).
Specifically, NumPy allows the creation of multidimensional
arrays, which support most of the numeric operators. Thus:
x = array((1,2,3,4)) y = array((2,2,4,4))
are two NumPy arrays. While they can be added elementwise,:
z = x + y # z == array((3,4,7,8))
they cannot be compared in the current framework - the released
version of NumPy compares the pointers, (thus yielding junk
information) which was the only solution before the recent
addition of the ability (in 1.5) to raise exceptions in comparison
functions.
Even with the ability to raise exceptions, the current protocol
makes array comparisons useless. To deal with this fact, NumPy
includes several functions which perform the comparisons: less(),
less_equal(), greater(), greater_equal(), equal(),
not_equal(). These functions return arrays with the same shape as
their arguments (modulo broadcasting), filled with 0's and 1's
depending on whether the comparison is true or not for each
element pair. Thus, for example, using the arrays x and y defined
above:
less(x,y)
would be an array containing the numbers (1,0,0,0).
The current proposal is to modify the Python object interface to
allow the NumPy package to make it so that x < y returns the same
thing as less(x,y). The exact return value is up to the NumPy
package -- what this proposal really asks for is changing the
Python core so that extension objects have the ability to return
something other than -1, 0, 1, should their authors choose to do
so.
Current State of Affairs
The current protocol is, at the C level, that each object type
defines a tp_compare slot, which is a pointer to a function which
takes two PyObject* references and returns -1, 0, or 1. This
function is called by the PyObject_Compare() function defined in
the C API. PyObject_Compare() is also called by the builtin
function cmp() which takes two arguments.
Proposed Mechanism
1. Changes to the C structure for type objects
The last available slot in the PyTypeObject, reserved up to now
for future expansion, is used to optionally store a pointer to a
new comparison function, of type richcmpfunc defined by:
typedef PyObject *(*richcmpfunc)
Py_PROTO((PyObject *, PyObject *, int));
This function takes three arguments. The first two are the objects
to be compared, and the third is an integer corresponding to an
opcode (one of LT, LE, EQ, NE, GT, GE). If this slot is left NULL,
then rich comparison for that object type is not supported (except
for class instances whose class provide the special methods
described below).
The above opcodes need to be added to the published Python/C API
(probably under the names Py_LT, Py_LE, etc.)
2. Additions of special methods for classes
Classes wishing to support the rich comparison mechanisms must add
one or more of the following new special methods:
def __lt__(self, other):
...
def __le__(self, other):
...
def __gt__(self, other):
...
def __ge__(self, other):
...
def __eq__(self, other):
...
def __ne__(self, other):
...
Each of these is called when the class instance is the on the
left-hand-side of the corresponding operators (<, <=, >, >=, ==,
and != or <>). The argument other is set to the object on the
right side of the operator. The return value of these methods is
up to the class implementor (after all, that's the entire point of
the proposal).
If the object on the left side of the operator does not define an
appropriate rich comparison operator (either at the C level or
with one of the special methods, then the comparison is reversed,
and the right hand operator is called with the opposite operator,
and the two objects are swapped. This assumes that a < b and b > a
are equivalent, as are a <= b and b >= a, and that == and != are
commutative (e.g. a == b if and only if b == a).
For example, if obj1 is an object which supports the rich
comparison protocol and x and y are objects which do not support
the rich comparison protocol, then obj1 < x will call the __lt__
method of obj1 with x as the second argument. x < obj1 will call
obj1's __gt__ method with x as a second argument, and x < y will
just use the existing (non-rich) comparison mechanism.
The above mechanism is such that classes can get away with not
implementing either __lt__ and __le__ or __gt__ and
__ge__. Further smarts could have been added to the comparison
mechanism, but this limited set of allowed "swaps" was chosen
because it doesn't require the infrastructure to do any processing
(negation) of return values. The choice of six special methods was
made over a single (e.g. __richcmp__) method to allow the
dispatching on the opcode to be performed at the level of the C
implementation rather than the user-defined method.
3. Addition of an optional argument to the builtin cmp()
The builtin cmp() is still used for simple comparisons. For rich
comparisons, it is called with a third argument, one of "<", "<=",
">", ">=", "==", "!=", "<>" (the last two have the same
meaning). When called with one of these strings as the third
argument, cmp() can return any Python object. Otherwise, it can
only return -1, 0 or 1 as before.
Chained Comparisons
Problem
It would be nice to allow objects for which the comparison returns
something other than -1, 0, or 1 to be used in chained
comparisons, such as:
x < y < z
Currently, this is interpreted by Python as:
temp1 = x < y
if temp1:
return y < z
else:
return temp1
Note that this requires testing the truth value of the result of
comparisons, with potential "shortcutting" of the right-side
comparison testings. In other words, the truth-value of the result
of the result of the comparison determines the result of a chained
operation. This is problematic in the case of arrays, since if x,
y and z are three arrays, then the user expects:
x < y < z
to be an array of 0's and 1's where 1's are in the locations
corresponding to the elements of y which are between the
corresponding elements in x and z. In other words, the right-hand
side must be evaluated regardless of the result of x < y, which is
incompatible with the mechanism currently in use by the parser.
Solution
Guido mentioned that one possible way out would be to change the
code generated by chained comparisons to allow arrays to be
chained-compared intelligently. What follows is a mixture of his
idea and my suggestions. The code generated for x < y < z would be
equivalent to:
temp1 = x < y
if temp1:
temp2 = y < z
return boolean_combine(temp1, temp2)
else:
return temp1
where boolean_combine is a new function which does something like
the following:
def boolean_combine(a, b):
if hasattr(a, '__boolean_and__') or \
hasattr(b, '__boolean_and__'):
try:
return a.__boolean_and__(b)
except:
return b.__boolean_and__(a)
else: # standard behavior
if a:
return b
else:
return 0
where the __boolean_and__ special method is implemented for
C-level types by another value of the third argument to the
richcmp function. This method would perform a boolean comparison
of the arrays (currently implemented in the umath module as the
logical_and ufunc).
Thus, objects returned by rich comparisons should always test
true, but should define another special method which creates
boolean combinations of them and their argument.
This solution has the advantage of allowing chained comparisons to
work for arrays, but the disadvantage that it requires comparison
arrays to always return true (in an ideal world, I'd have them
always raise an exception on truth testing, since the meaning of
testing "if a>b:" is massively ambiguous.
The inlining already present which deals with integer comparisons
would still apply, resulting in no performance cost for the most
common cases.
pep-0208 Reworking the Coercion Model
| PEP: | 208 |
|---|---|
| Title: | Reworking the Coercion Model |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neil Schemenauer <nas at arctrix.com>, Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 04-Dec-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
Many Python types implement numeric operations. When the arguments of
a numeric operation are of different types, the interpreter tries to
coerce the arguments into a common type. The numeric operation is
then performed using this common type. This PEP proposes a new type
flag to indicate that arguments to a type's numeric operations should
not be coerced. Operations that do not support the supplied types
indicate it by returning a new singleton object. Types which do not
set the type flag are handled in a backwards compatible manner.
Allowing operations handle different types is often simpler, more
flexible, and faster than having the interpreter do coercion.
Rationale
When implementing numeric or other related operations, it is often
desirable to provide not only operations between operands of one type
only, e.g. integer + integer, but to generalize the idea behind the
operation to other type combinations as well, e.g. integer + float.
A common approach to this mixed type situation is to provide a method
of "lifting" the operands to a common type (coercion) and then use
that type's operand method as execution mechanism. Yet, this strategy
has a few drawbacks:
* the "lifting" process creates at least one new (temporary)
operand object,
* since the coercion method is not being told about the operation
that is to follow, it is not possible to implement operation
specific coercion of types,
* there is no elegant way to solve situations were a common type
is not at hand, and
* the coercion method will always have to be called prior to the
operation's method itself.
A fix for this situation is obviously needed, since these drawbacks
make implementations of types needing these features very cumbersome,
if not impossible. As an example, have a look at the DateTime and
DateTimeDelta[1] types, the first being absolute, the second
relative. You can always add a relative value to an absolute one,
giving a new absolute value. Yet, there is no common type which the
existing coercion mechanism could use to implement that operation.
Currently, PyInstance types are treated specially by the interpreter
in that their numeric methods are passed arguments of different types.
Removing this special case simplifies the interpreter and allows other
types to implement numeric methods that behave like instance types.
This is especially useful for extension types like ExtensionClass.
Specification
Instead of using a central coercion method, the process of handling
different operand types is simply left to the operation. If the
operation finds that it cannot handle the given operand type
combination, it may return a special singleton as indicator.
Note that "numbers" (anything that implements the number protocol, or
part of it) written in Python already use the first part of this
strategy - it is the C level API that we focus on here.
To maintain nearly 100% backward compatibility we have to be very
careful to make numbers that don't know anything about the new
strategy (old style numbers) work just as well as those that expect
the new scheme (new style numbers). Furthermore, binary compatibility
is a must, meaning that the interpreter may only access and use new
style operations if the number indicates the availability of these.
A new style number is considered by the interpreter as such if and
only it it sets the type flag Py_TPFLAGS_CHECKTYPES. The main
difference between an old style number and a new style one is that the
numeric slot functions can no longer assume to be passed arguments of
identical type. New style slots must check all arguments for proper
type and implement the necessary conversions themselves. This may seem
to cause more work on the behalf of the type implementor, but is in
fact no more difficult than writing the same kind of routines for an
old style coercion slot.
If a new style slot finds that it cannot handle the passed argument
type combination, it may return a new reference of the special
singleton Py_NotImplemented to the caller. This will cause the caller
to try the other operands operation slots until it finds a slot that
does implement the operation for the specific type combination. If
none of the possible slots succeed, it raises a TypeError.
To make the implementation easy to understand (the whole topic is
esoteric enough), a new layer in the handling of numeric operations is
introduced. This layer takes care of all the different cases that need
to be taken into account when dealing with all the possible
combinations of old and new style numbers. It is implemented by the
two static functions binary_op() and ternary_op(), which are both
internal functions that only the functions in Objects/abstract.c
have access to. The numeric API (PyNumber_*) is easy to adapt to
this new layer.
As a side-effect all numeric slots can be NULL-checked (this has to be
done anyway, so the added feature comes at no extra cost).
The scheme used by the layer to execute a binary operation is as
follows:
v | w | Action taken
---------+------------+----------------------------------
new | new | v.op(v,w), w.op(v,w)
new | old | v.op(v,w), coerce(v,w), v.op(v,w)
old | new | w.op(v,w), coerce(v,w), v.op(v,w)
old | old | coerce(v,w), v.op(v,w)
The indicated action sequence is executed from left to right until
either the operation succeeds and a valid result (!=
Py_NotImplemented) is returned or an exception is raised. Exceptions
are returned to the calling function as-is. If a slot returns
Py_NotImplemented, the next item in the sequence is executed.
Note that coerce(v,w) will use the old style nb_coerce slot methods
via a call to PyNumber_Coerce().
Ternary operations have a few more cases to handle:
v | w | z | Action taken
----+-----+-----+------------------------------------
new | new | new | v.op(v,w,z), w.op(v,w,z), z.op(v,w,z)
new | old | new | v.op(v,w,z), z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
old | new | new | w.op(v,w,z), z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
old | old | new | z.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
new | new | old | v.op(v,w,z), w.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
new | old | old | v.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
old | new | old | w.op(v,w,z), coerce(v,w,z), v.op(v,w,z)
old | old | old | coerce(v,w,z), v.op(v,w,z)
The same notes as above, except that coerce(v,w,z) actually does:
if z != Py_None:
coerce(v,w), coerce(v,z), coerce(w,z)
else:
# treat z as absent variable
coerce(v,w)
The current implementation uses this scheme already (there's only one
ternary slot: nb_pow(a,b,c)).
Note that the numeric protocol is also used for some other related
tasks, e.g. sequence concatenation. These can also benefit from the
new mechanism by implementing right-hand operations for type
combinations that would otherwise fail to work. As an example, take
string concatenation: currently you can only do string + string. With
the new mechanism, a new string-like type could implement new_type +
string and string + new_type, even though strings don't know anything
about new_type.
Since comparisons also rely on coercion (every time you compare an
integer to a float, the integer is first converted to float and then
compared...), a new slot to handle numeric comparisons is needed:
PyObject *nb_cmp(PyObject *v, PyObject *w)
This slot should compare the two objects and return an integer object
stating the result. Currently, this result integer may only be -1, 0,
1. If the slot cannot handle the type combination, it may return a
reference to Py_NotImplemented. [XXX Note that this slot is still
in flux since it should take into account rich comparisons
(i.e. PEP 207).]
Numeric comparisons are handled by a new numeric protocol API:
PyObject *PyNumber_Compare(PyObject *v, PyObject *w)
This function compare the two objects as "numbers" and return an
integer object stating the result. Currently, this result integer may
only be -1, 0, 1. In case the operation cannot be handled by the given
objects, a TypeError is raised.
The PyObject_Compare() API needs to adjusted accordingly to make use
of this new API.
Other changes include adapting some of the built-in functions (e.g.
cmp()) to use this API as well. Also, PyNumber_CoerceEx() will need to
check for new style numbers before calling the nb_coerce slot. New
style numbers don't provide a coercion slot and thus cannot be
explicitly coerced.
Reference Implementation
A preliminary patch for the CVS version of Python is available through
the Source Forge patch manager[2].
Credits
This PEP and the patch are heavily based on work done by Marc-AndrĂŠ
Lemburg[3].
Copyright
This document has been placed in the public domain.
References
[1] http://www.lemburg.com/files/python/mxDateTime.html
[2] http://sourceforge.net/patch/?func=detailpatch&patch_id=102652&group_id=5470
[3] http://www.lemburg.com/files/python/CoercionProposal.html
pep-0209 Multi-dimensional Arrays
| PEP: | 209 |
|---|---|
| Title: | Multi-dimensional Arrays |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Barrett <barrett at stsci.edu>, Travis Oliphant <oliphant at ee.byu.edu> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 03-Jan-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
This PEP proposes a redesign and re-implementation of the multi-
dimensional array module, Numeric, to make it easier to add new
features and functionality to the module. Aspects of Numeric 2
that will receive special attention are efficient access to arrays
exceeding a gigabyte in size and composed of inhomogeneous data
structures or records. The proposed design uses four Python
classes: ArrayType, UFunc, Array, and ArrayView; and a low-level
C-extension module, _ufunc, to handle the array operations
efficiently. In addition, each array type has its own C-extension
module which defines the coercion rules, operations, and methods
for that type. This design enables new types, features, and
functionality to be added in a modular fashion. The new version
will introduce some incompatibilities with the current Numeric.
Motivation
Multi-dimensional arrays are commonly used to store and manipulate
data in science, engineering, and computing. Python currently has
an extension module, named Numeric (henceforth called Numeric 1),
which provides a satisfactory set of functionality for users
manipulating homogeneous arrays of data of moderate size (of order
10 MB). For access to larger arrays (of order 100 MB or more) of
possibly inhomogeneous data, the implementation of Numeric 1 is
inefficient and cumbersome. In the future, requests by the
Numerical Python community for additional functionality is also
likely as PEPs 211: Adding New Linear Operators to Python, and
225: Elementwise/Objectwise Operators illustrate.
Proposal
This proposal recommends a re-design and re-implementation of
Numeric 1, henceforth called Numeric 2, which will enable new
types, features, and functionality to be added in an easy and
modular manner. The initial design of Numeric 2 should focus on
providing a generic framework for manipulating arrays of various
types and should enable a straightforward mechanism for adding new
array types and UFuncs. Functional methods that are more specific
to various disciplines can then be layered on top of this core.
This new module will still be called Numeric and most of the
behavior found in Numeric 1 will be preserved.
The proposed design uses four Python classes: ArrayType, UFunc,
Array, and ArrayView; and a low-level C-extension module to handle
the array operations efficiently. In addition, each array type
has its own C-extension module which defines the coercion rules,
operations, and methods for that type. At a later date, when core
functionality is stable, some Python classes can be converted to
C-extension types.
Some planned features are:
1. Improved memory usage
This feature is particularly important when handling large arrays
and can produce significant improvements in performance as well as
memory usage. We have identified several areas where memory usage
can be improved:
a. Use a local coercion model
Instead of using Python's global coercion model which creates
temporary arrays, Numeric 2, like Numeric 1, will implement a
local coercion model as described in PEP 208 which defers the
responsibility of coercion to the operator. By using internal
buffers, a coercion operation can be done for each array
(including output arrays), if necessary, at the time of the
operation. Benchmarks [1] have shown that performance is at
most degraded only slightly and is improved in cases where the
internal buffers are less than the L2 cache size and the
processor is under load. To avoid array coercion altogether,
C functions having arguments of mixed type are allowed in
Numeric 2.
b. Avoid creation of temporary arrays
In complex array expressions (i.e. having more than one
operation), each operation will create a temporary array which
will be used and then deleted by the succeeding operation. A
better approach would be to identify these temporary arrays
and reuse their data buffers when possible, namely when the
array shape and type are the same as the temporary array being
created. This can be done by checking the temporary array's
reference count. If it is 1, then it will be deleted once the
operation is done and is a candidate for reuse.
c. Optional use of memory-mapped files
Numeric users sometimes need to access data from very large
files or to handle data that is greater than the available
memory. Memory-mapped arrays provide a mechanism to do this
by storing the data on disk while making it appear to be in
memory. Memory- mapped arrays should improve access to all
files by eliminating one of two copy steps during a file
access. Numeric should be able to access in-memory and
memory-mapped arrays transparently.
d. Record access
In some fields of science, data is stored in files as binary
records. For example in astronomy, photon data is stored as a
1 dimensional list of photons in order of arrival time. These
records or C-like structures contain information about the
detected photon, such as its arrival time, its position on the
detector, and its energy. Each field may be of a different
type, such as char, int, or float. Such arrays introduce new
issues that must be dealt with, in particular byte alignment
or byte swapping may need to be performed for the numeric
values to be properly accessed (though byte swapping is also
an issue for memory mapped data). Numeric 2 is designed to
automatically handle alignment and representational issues
when data is accessed or operated on. There are two
approaches to implementing records; as either a derived array
class or a special array type, depending on your point-of-
view. We defer this discussion to the Open Issues section.
2. Additional array types
Numeric 1 has 11 defined types: char, ubyte, sbyte, short, int,
long, float, double, cfloat, cdouble, and object. There are no
ushort, uint, or ulong types, nor are there more complex types
such as a bit type which is of use to some fields of science and
possibly for implementing masked-arrays. The design of Numeric 1
makes the addition of these and other types a difficult and
error-prone process. To enable the easy addition (and deletion)
of new array types such as a bit type described below, a re-design
of Numeric is necessary.
a. Bit type
The result of a rich comparison between arrays is an array of
boolean values. The result can be stored in an array of type
char, but this is an unnecessary waste of memory. A better
implementation would use a bit or boolean type, compressing
the array size by a factor of eight. This is currently being
implemented for Numeric 1 (by Travis Oliphant) and should be
included in Numeric 2.
3. Enhanced array indexing syntax
The extended slicing syntax was added to Python to provide greater
flexibility when manipulating Numeric arrays by allowing
step-sizes greater than 1. This syntax works well as a shorthand
for a list of regularly spaced indices. For those situations
where a list of irregularly spaced indices are needed, an enhanced
array indexing syntax would allow 1-D arrays to be arguments.
4. Rich comparisons
The implementation of PEP 207: Rich Comparisons in Python 2.1
provides additional flexibility when manipulating arrays. We
intend to implement this feature in Numeric 2.
5. Array broadcasting rules
When an operation between a scalar and an array is done, the
implied behavior is to create a new array having the same shape as
the array operand containing the scalar value. This is called
array broadcasting. It also works with arrays of lesser rank,
such as vectors. This implicit behavior is implemented in Numeric
1 and will also be implemented in Numeric 2.
Design and Implementation
The design of Numeric 2 has four primary classes:
1. ArrayType:
This is a simple class that describes the fundamental properties
of an array-type, e.g. its name, its size in bytes, its coercion
relations with respect to other types, etc., e.g.
> Int32 = ArrayType('Int32', 4, 'doc-string')
Its relation to the other types is defined when the C-extension
module for that type is imported. The corresponding Python code
is:
> Int32.astype[Real64] = Real64
This says that the Real64 array-type has higher priority than the
Int32 array-type.
The following attributes and methods are proposed for the core
implementation. Additional attributes can be added on an
individual basis, e.g. .bitsize or .bitstrides for the bit type.
Attributes:
.name: e.g. "Int32", "Float64", etc.
.typecode: e.g. 'i', 'f', etc.
(for backward compatibility)
.size (in bytes): e.g. 4, 8, etc.
.array_rules (mapping): rules between array types
.pyobj_rules (mapping): rules between array and python types
.doc: documentation string
Methods:
__init__(): initialization
__del__(): destruction
__repr__(): representation
C-API:
This still needs to be fleshed-out.
2. UFunc:
This class is the heart of Numeric 2. Its design is similar to
that of ArrayType in that the UFunc creates a singleton callable
object whose attributes are name, total and input number of
arguments, a document string, and an empty CFunc dictionary; e.g.
> add = UFunc('add', 3, 2, 'doc-string')
When defined the add instance has no C functions associated with
it and therefore can do no work. The CFunc dictionary is
populated or registered later when the C-extension module for an
array-type is imported. The arguments of the register method are:
function name, function descriptor, and the CUFunc object. The
corresponding Python code is
> add.register('add', (Int32, Int32, Int32), cfunc-add)
In the initialization function of an array type module, e.g.
Int32, there are two C API functions: one to initialize the
coercion rules and the other to register the CFunc objects.
When an operation is applied to some arrays, the __call__ method
is invoked. It gets the type of each array (if the output array
is not given, it is created from the coercion rules) and checks
the CFunc dictionary for a key that matches the argument types.
If it exists the operation is performed immediately, otherwise the
coercion rules are used to search for a related operation and set
of conversion functions. The __call__ method then invokes a
compute method written in C to iterate over slices of each array,
namely:
> _ufunc.compute(slice, data, func, swap, conv)
The 'func' argument is a CFuncObject, while the 'swap' and 'conv'
arguments are lists of CFuncObjects for those arrays needing pre-
or post-processing, otherwise None is used. The data argument is
a list of buffer objects, and the slice argument gives the number
of iterations for each dimension along with the buffer offset and
step size for each array and each dimension.
We have predefined several UFuncs for use by the __call__ method:
cast, swap, getobj, and setobj. The cast and swap functions do
coercion and byte-swapping, respectively and the getobj and setobj
functions do coercion between Numeric arrays and Python sequences.
The following attributes and methods are proposed for the core
implementation.
Attributes:
.name: e.g. "add", "subtract", etc.
.nargs: number of total arguments
.iargs: number of input arguments
.cfuncs (mapping): the set C functions
.doc: documentation string
Methods:
__init__(): initialization
__del__(): destruction
__repr__(): representation
__call__(): look-up and dispatch method
initrule(): initialize coercion rule
uninitrule(): uninitialize coercion rule
register(): register a CUFunc
unregister(): unregister a CUFunc
C-API:
This still needs to be fleshed-out.
3. Array:
This class contains information about the array, such as shape,
type, endian-ness of the data, etc.. Its operators, '+', '-',
etc. just invoke the corresponding UFunc function, e.g.
> def __add__(self, other):
> return ufunc.add(self, other)
The following attributes, methods, and functions are proposed for
the core implementation.
Attributes:
.shape: shape of the array
.format: type of the array
.real (only complex): real part of a complex array
.imag (only complex): imaginary part of a complex array
Methods:
__init__(): initialization
__del__(): destruction
__repr_(): representation
__str__(): pretty representation
__cmp__(): rich comparison
__len__():
__getitem__():
__setitem__():
__getslice__():
__setslice__():
numeric methods:
copy(): copy of array
aslist(): create list from array
asstring(): create string from array
Functions:
fromlist(): create array from sequence
fromstring(): create array from string
array(): create array with shape and value
concat(): concatenate two arrays
resize(): resize array
C-API:
This still needs to be fleshed-out.
4. ArrayView
This class is similar to the Array class except that the reshape
and flat methods will raise exceptions, since non-contiguous
arrays cannot be reshaped or flattened using just pointer and
step-size information.
C-API:
This still needs to be fleshed-out.
5. C-extension modules:
Numeric2 will have several C-extension modules.
a. _ufunc:
The primary module of this set is the _ufuncmodule.c. The
intention of this module is to do the bare minimum,
i.e. iterate over arrays using a specified C function. The
interface of these functions is the same as Numeric 1, i.e.
int (*CFunc)(char *data, int *steps, int repeat, void *func);
and their functionality is expected to be the same, i.e. they
iterate over the inner-most dimension.
The following attributes and methods are proposed for the core
implementation.
Attributes:
Methods:
compute():
C-API:
This still needs to be fleshed-out.
b. _int32, _real64, etc.:
There will also be C-extension modules for each array type,
e.g. _int32module.c, _real64module.c, etc. As mentioned
previously, when these modules are imported by the UFunc
module, they will automatically register their functions and
coercion rules. New or improved versions of these modules can
be easily implemented and used without affecting the rest of
Numeric 2.
Open Issues
1. Does slicing syntax default to copy or view behavior?
The default behavior of Python is to return a copy of a sub-list
or tuple when slicing syntax is used, whereas Numeric 1 returns a
view into the array. The choice made for Numeric 1 is apparently
for reasons of performance: the developers wish to avoid the
penalty of allocating and copying the data buffer during each
array operation and feel that the need for a deep copy of an array
to be rare. Yet, some have argued that Numeric's slice notation
should also have copy behavior to be consistent with Python lists.
In this case the performance penalty associated with copy behavior
can be minimized by implementing copy-on-write. This scheme has
both arrays sharing one data buffer (as in view behavior) until
either array is assigned new data at which point a copy of the
data buffer is made. View behavior would then be implemented by
an ArrayView class, whose behavior be similar to Numeric 1 arrays,
i.e. .shape is not settable for non-contiguous arrays. The use of
an ArrayView class also makes explicit what type of data the array
contains.
2. Does item syntax default to copy or view behavior?
A similar question arises with the item syntax. For example, if a
= [[0,1,2], [3,4,5]] and b = a[0], then changing b[0] also changes
a[0][0], because a[0] is a reference or view of the first row of
a. Therefore, if c is a 2-d array, it would appear that c[i]
should return a 1-d array which is a view into, instead of a copy
of, c for consistency. Yet, c[i] can be considered just a
shorthand for c[i,:] which would imply copy behavior assuming
slicing syntax returns a copy. Should Numeric 2 behave the same
way as lists and return a view or should it return a copy.
3. How is scalar coercion implemented?
Python has fewer numeric types than Numeric which can cause
coercion problems. For example when multiplying a Python scalar
of type float and a Numeric array of type float, the Numeric array
is converted to a double, since the Python float type is actually
a double. This is often not the desired behavior, since the
Numeric array will be doubled in size which is likely to be
annoying, particularly for very large arrays. We prefer that the
array type trumps the python type for the same type class, namely
integer, float, and complex. Therefore an operation between a
Python integer and an Int16 (short) array will return an Int16
array. Whereas an operation between a Python float and an Int16
array would return a Float64 (double) array. Operations between
two arrays use normal coercion rules.
4. How is integer division handled?
In a future version of Python, the behavior of integer division
will change. The operands will be converted to floats, so the
result will be a float. If we implement the proposed scalar
coercion rules where arrays have precedence over Python scalars,
then dividing an array by an integer will return an integer array
and will not be consistent with a future version of Python which
would return an array of type double. Scientific programmers are
familiar with the distinction between integer and float-point
division, so should Numeric 2 continue with this behavior?
5. How should records be implemented?
There are two approaches to implementing records depending on your
point-of-view. The first is two divide arrays into separate
classes depending on the behavior of their types. For example
numeric arrays are one class, strings a second, and records a
third, because the range and type of operations of each class
differ. As such, a record array is not a new type, but a
mechanism for a more flexible form of array. To easily access and
manipulate such complex data, the class is comprised of numeric
arrays having different byte offsets into the data buffer. For
example, one might have a table consisting of an array of Int16,
Real32 values. Two numeric arrays, one with an offset of 0 bytes
and a stride of 6 bytes to be interpreted as Int16, and one with an
offset of 2 bytes and a stride of 6 bytes to be interpreted as
Real32 would represent the record array. Both numeric arrays
would refer to the same data buffer, but have different offset and
stride attributes, and a different numeric type.
The second approach is to consider a record as one of many array
types, albeit with fewer, and possibly different, array operations
than for numeric arrays. This approach considers an array type to
be a mapping of a fixed-length string. The mapping can either be
simple, like integer and floating-point numbers, or complex, like
a complex number, a byte string, and a C-structure. The record
type effectively merges the struct and Numeric modules into a
multi-dimensional struct array. This approach implies certain
changes to the array interface. For example, the 'typecode'
keyword argument should probably be changed to the more
descriptive 'format' keyword.
a. How are record semantics defined and implemented?
Which ever implementation approach is taken for records, the
syntax and semantics of how they are to be accessed and
manipulated must be decided, if one wishes to have access to
sub-fields of records. In this case, the record type can
essentially be considered an inhomogeneous list, like a tuple
returned by the unpack method of the struct module; and a 1-d
array of records may be interpreted as a 2-d array with the
second dimension being the index into the list of fields.
This enhanced array semantics makes access to an array of one
or more of the fields easy and straightforward. It also
allows a user to do array operations on a field in a natural
and intuitive way. If we assume that records are implemented
as an array type, then last dimension defaults to 0 and can
therefore be neglected for arrays comprised of simple types,
like numeric.
6. How are masked-arrays implemented?
Masked-arrays in Numeric 1 are implemented as a separate array
class. With the ability to add new array types to Numeric 2, it
is possible that masked-arrays in Numeric 2 could be implemented
as a new array type instead of an array class.
7. How are numerical errors handled (IEEE floating-point errors in
particular)?
It is not clear to the proposers (Paul Barrett and Travis
Oliphant) what is the best or preferred way of handling errors.
Since most of the C functions that do the operation, iterate over
the inner-most (last) dimension of the array. This dimension
could contain a thousand or more items having one or more errors
of differing type, such as divide-by-zero, underflow, and
overflow. Additionally, keeping track of these errors may come at
the expense of performance. Therefore, we suggest several
options:
a. Print a message of the most severe error, leaving it to
the user to locate the errors.
b. Print a message of all errors that occurred and the number
of occurrences, leaving it to the user to locate the errors.
c. Print a message of all errors that occurred and a list of
where they occurred.
d. Or use a hybrid approach, printing only the most severe
error, yet keeping track of what and where the errors
occurred. This would allow the user to locate the errors
while keeping the error message brief.
8. What features are needed to ease the integration of FORTRAN
libraries and code?
It would be a good idea at this stage to consider how to ease the
integration of FORTRAN libraries and user code in Numeric 2.
Implementation Steps
1. Implement basic UFunc capability
a. Minimal Array class:
Necessary class attributes and methods, e.g. .shape, .data,
.type, etc.
b. Minimal ArrayType class:
Int32, Real64, Complex64, Char, Object
c. Minimal UFunc class:
UFunc instantiation, CFunction registration, UFunc call for
1-D arrays including the rules for doing alignment,
byte-swapping, and coercion.
d. Minimal C-extension module:
_UFunc, which does the innermost array loop in C.
This step implements whatever is needed to do: 'c = add(a, b)'
where a, b, and c are 1-D arrays. It teaches us how to add
new UFuncs, to coerce the arrays, to pass the necessary
information to a C iterator method and to do the actually
computation.
2. Continue enhancing the UFunc iterator and Array class
a. Implement some access methods for the Array class:
print, repr, getitem, setitem, etc.
b. Implement multidimensional arrays
c. Implement some of basic Array methods using UFuncs:
+, -, *, /, etc.
d. Enable UFuncs to use Python sequences.
3. Complete the standard UFunc and Array class behavior
a. Implement getslice and setslice behavior
b. Work on Array broadcasting rules
c. Implement Record type
4. Add additional functionality
a. Add more UFuncs
b. Implement buffer or mmap access
Incompatibilities
The following is a list of incompatibilities in behavior between
Numeric 1 and Numeric 2.
1. Scalar coercion rules
Numeric 1 has single set of coercion rules for array and Python
numeric types. This can cause unexpected and annoying problems
during the calculation of an array expression. Numeric 2 intends
to overcome these problems by having two sets of coercion rules:
one for arrays and Python numeric types, and another just for
arrays.
2. No savespace attribute
The savespace attribute in Numeric 1 makes arrays with this
attribute set take precedence over those that do not have it set.
Numeric 2 will not have such an attribute and therefore normal
array coercion rules will be in effect.
3. Slicing syntax returns a copy
The slicing syntax in Numeric 1 returns a view into the original
array. The slicing behavior for Numeric 2 will be a copy. You
should use the ArrayView class to get a view into an array.
4. Boolean comparisons return a boolean array
A comparison between arrays in Numeric 1 results in a Boolean
scalar, because of current limitations in Python. The advent of
Rich Comparisons in Python 2.1 will allow an array of Booleans to
be returned.
5. Type characters are deprecated
Numeric 2 will have an ArrayType class composed of Type instances,
for example Int8, Int16, Int32, and Int for signed integers. The
typecode scheme in Numeric 1 will be available for backward
compatibility, but will be deprecated.
Appendices
A. Implicit sub-arrays iteration
A computer animation is composed of a number of 2-D images or
frames of identical shape. By stacking these images into a single
block of memory, a 3-D array is created. Yet the operations to be
performed are not meant for the entire 3-D array, but on the set
of 2-D sub-arrays. In most array languages, each frame has to be
extracted, operated on, and then reinserted into the output array
using a for-like loop. The J language allows the programmer to
perform such operations implicitly by having a rank for the frame
and array. By default these ranks will be the same during the
creation of the array. It was the intention of the Numeric 1
developers to implement this feature, since it is based on the
language J. The Numeric 1 code has the required variables for
implementing this behavior, but was never implemented. We intend
to implement implicit sub-array iteration in Numeric 2, if the
array broadcasting rules found in Numeric 1 do not fully support
this behavior.
Copyright
This document is placed in the public domain.
Related PEPs
PEP 207: Rich Comparisons
by Guido van Rossum and David Ascher
PEP 208: Reworking the Coercion Model
by Neil Schemenauer and Marc-Andre' Lemburg
PEP 211: Adding New Linear Algebra Operators to Python
by Greg Wilson
PEP 225: Elementwise/Objectwise Operators
by Huaiyu Zhu
PEP 228: Reworking Python's Numeric Model
by Moshe Zadka
References
[1] P. Greenfield 2000. private communication.
pep-0210 Decoupling the Interpreter Loop
| PEP: | 210 |
|---|---|
| Title: | Decoupling the Interpreter Loop |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Ascher <davida at activestate.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 15-Jul-2000 |
| Python-Version: | 2.1 |
| Post-History: |
pep-0211 Adding A New Outer Product Operator
| PEP: | 211 |
|---|---|
| Title: | Adding A New Outer Product Operator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Greg Wilson <gvwilson at ddj.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 15-Jul-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP describes a proposal to define "@" (pronounced "across")
as a new outer product operator in Python 2.2. When applied to
sequences (or other iterable objects), this operator will combine
their iterators, so that:
for (i, j) in S @ T:
pass
will be equivalent to:
for i in S:
for j in T:
pass
Classes will be able to overload this operator using the special
methods "__across__", "__racross__", and "__iacross__". In
particular, the new Numeric module (PEP 0209) will overload this
operator for multi-dimensional arrays to implement matrix
multiplication.
Background
Number-crunching is now just a small part of computing, but many
programmers --- including many Python users --- still need to
express complex mathematical operations in code. Most numerical
languages, such as APL, Fortran-90, MATLAB, IDL, and Mathematica,
therefore provide two forms of the common arithmetic operators.
One form works element-by-element, e.g. multiplies corresponding
elements of its matrix arguments. The other implements the
"mathematical" definition of that operation, e.g. performs
row-column matrix multiplication.
Zhu and Lielens have proposed doubling up Python's operators in
this way [1]. Their proposal would create six new binary infix
operators, and six new in-place operators.
The original version of this proposal was much more conservative.
The author consulted the developers of GNU Octave [2], an open
source clone of MATLAB. Its developers agreed that providing an
infix operator for matrix multiplication was important: numerical
programmers really do care whether they have to write "mmul(A,B)"
instead of "A op B".
On the other hand, when asked how important it was to have infix
operators for matrix solution and other operations, Prof. James
Rawlings replied [3]:
I DON'T think it's a must have, and I do a lot of matrix
inversion. I cannot remember if its A\b or b\A so I always
write inv(A)*b instead. I recommend dropping \.
Based on this discussion, and feedback from students at the US
national laboratories and elsewhere, we recommended adding only
one new operator, for matrix multiplication, to Python.
Iterators
The planned addition of iterators to Python 2.2 opens up a broader
scope for this proposal. As part of the discussion of PEP 201,
Lockstep Iteration[4], the author of this proposal conducted an
informal usability experiment[5]. The results showed that users
are psychologically receptive to "cross-product" loop syntax. For
example, most users expected:
S = [10, 20, 30]
T = [1, 2, 3]
for x in S; y in T:
print x+y,
to print "11 12 13 21 22 23 31 32 33". We believe that users will
have the same reaction to:
for (x, y) in S @ T:
print x+y
i.e. that they will naturally interpret this as a tidy way to
write loop nests.
This is where iterators come in. Actually constructing the
cross-product of two (or more) sequences before executing the loop
would be very expensive. On the other hand, "@" could be defined
to get its arguments' iterators, and then create an outer iterator
which returns tuples of the values returned by the inner
iterators.
Discussion
1. Adding a named function "across" would have less impact on
Python than a new infix operator. However, this would not make
Python more appealing to numerical programmers, who really do
care whether they can write matrix multiplication using an
operator, or whether they have to write it as a function call.
2. "@" would have be chainable in the same way as comparison
operators, i.e.:
(1, 2) @ (3, 4) @ (5, 6)
would have to return (1, 3, 5) ... (2, 4, 6), and *not*
((1, 3), 5) ... ((2, 4), 6). This should not require special
support from the parser, as the outer iterator created by the
first "@" could easily be taught how to combine itself with
ordinary iterators.
3. There would have to be some way to distinguish restartable
iterators from ones that couldn't be restarted. For example,
if S is an input stream (e.g. a file), and L is a list, then "S
@ L" is straightforward, but "L @ S" is not, since iteration
through the stream cannot be repeated. This could be treated
as an error, or by having the outer iterator detect
non-restartable inner iterators and cache their values.
4. Whiteboard testing of this proposal in front of three novice
Python users (all of them experienced programmers) indicates
that users will expect:
"ab" @ "cd"
to return four strings, not four tuples of pairs of
characters. Opinion was divided on what:
("a", "b") @ "cd"
ought to return...
Alternatives
1. Do nothing --- keep Python simple.
This is always the default choice.
2. Add a named function instead of an operator.
Python is not primarily a numerical language; it may not be worth
complexifying it for this special case. However, support for real
matrix multiplication *is* frequently requested, and the proposed
semantics for "@" for built-in sequence types would simplify
expression of a very common idiom (nested loops).
3. Introduce prefixed forms of all existing operators, such as
"~*" and "~+", as proposed in PEP 225 [1].
Our objections to this are that there isn't enough demand to
justify the additional complexity (see Rawlings' comments [3]),
and that the proposed syntax fails the "low toner" readability
test.
Acknowledgments
I am grateful to Huaiyu Zhu for initiating this discussion, and to
James Rawlings and students in various Python courses for their
discussions of what numerical programmers really care about.
References
[1] PEP 225, Elementwise/Objectwise Operators, Zhu, Lielens
http://www.python.org/dev/peps/pep-0225/
[2] http://bevo.che.wisc.edu/octave/
[3] http://www.egroups.com/message/python-numeric/4
[4] PEP 201, Lockstep Iteration, Warsaw
http://www.python.org/dev/peps/pep-0201/
[5] http://mail.python.org/pipermail/python-dev/2000-July/006427.html
pep-0212 Loop Counter Iteration
| PEP: | 212 |
|---|---|
| Title: | Loop Counter Iteration |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Peter Schneider-Kamp <nowonder at nowonder.de> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 22-Aug-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP describes the often proposed feature of exposing the loop
counter in for-loops. This PEP tracks the status and ownership of
this feature. It contains a description of the feature and
outlines changes necessary to support the feature. This PEP
summarizes discussions held in mailing list forums, and provides
URLs for further information, where appropriate. The CVS revision
history of this file contains the definitive historical record.
Motivation
Standard for-loops in Python iterate over the elements of a
sequence[1]. Often it is desirable to loop over the indices or
both the elements and the indices instead.
The common idioms used to accomplish this are unintuitive. This
PEP proposes two different ways of exposing the indices.
Loop counter iteration
The current idiom for looping over the indices makes use of the
built-in 'range' function:
for i in range(len(sequence)):
# work with index i
Looping over both elements and indices can be achieved either by the
old idiom or by using the new 'zip' built-in function[2]:
for i in range(len(sequence)):
e = sequence[i]
# work with index i and element e
or
for i, e in zip(range(len(sequence)), sequence):
# work with index i and element e
The Proposed Solutions
There are three solutions that have been discussed. One adds a
non-reserved keyword, the other adds two built-in functions.
A third solution adds methods to sequence objects.
Non-reserved keyword 'indexing'
This solution would extend the syntax of the for-loop by adding
an optional '<variable> indexing' clause which can also be used
instead of the '<variable> in' clause..
Looping over the indices of a sequence would thus become:
for i indexing sequence:
# work with index i
Looping over both indices and elements would similarly be:
for i indexing e in sequence:
# work with index i and element e
Built-in functions 'indices' and 'irange'
This solution adds two built-in functions 'indices' and 'irange'.
The semantics of these can be described as follows:
def indices(sequence):
return range(len(sequence))
def irange(sequence):
return zip(range(len(sequence)), sequence)
These functions could be implemented either eagerly or lazily and
should be easy to extend in order to accept more than one sequence
argument.
The use of these functions would simplify the idioms for looping
over the indices and over both elements and indices:
for i in indices(sequence):
# work with index i
for i, e in irange(sequence):
# work with index i and element e
Methods for sequence objects
This solution proposes the addition of 'indices', 'items'
and 'values' methods to sequences, which enable looping over
indices only, both indices and elements, and elements only
respectively.
This would immensely simplify the idioms for looping over indices
and for looping over both elements and indices:
for i in sequence.indices():
# work with index i
for i, e in sequence.items():
# work with index i and element e
Additionally it would allow to do looping over the elements
of sequences and dicitionaries in a consistent way:
for e in sequence_or_dict.values():
# do something with element e
Implementations
For all three solutions some more or less rough patches exist
as patches at SourceForge:
'for i indexing a in l': exposing the for-loop counter[3]
add indices() and irange() to built-ins[4]
add items() method to listobject[5]
All of them have been pronounced on and rejected by the BDFL.
Note that the 'indexing' keyword is only a NAME in the
grammar and so does not hinder the general use of 'indexing'.
Backward Compatibility Issues
As no keywords are added and the semantics of existing code
remains unchanged, all three solutions can be implemented
without breaking existing code.
Copyright
This document has been placed in the public domain.
References
[1] http://docs.python.org/reference/compound_stmts.html#for
[2] Lockstep Iteration, PEP 201
[3] http://sourceforge.net/patch/download.php?id=101138
[4] http://sourceforge.net/patch/download.php?id=101129
[5] http://sourceforge.net/patch/download.php?id=101178
pep-0213 Attribute Access Handlers
| PEP: | 213 |
|---|---|
| Title: | Attribute Access Handlers |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Prescod <paul at prescod.net> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 21-Jul-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
It is possible (and even relatively common) in Python code and
in extension modules to "trap" when an instance's client code
attempts to set an attribute and execute code instead. In other
words it is possible to allow users to use attribute assignment/
retrieval/deletion syntax even though the underlying implementation
is doing some computation rather than directly modifying a
binding.
This PEP describes a feature that makes it easier, more efficient
and safer to implement these handlers for Python instances.
Justification
Scenario 1:
You have a deployed class that works on an attribute named
"stdout". After a while, you think it would be better to
check that stdout is really an object with a "write" method
at the moment of assignment. Rather than change to a
setstdout method (which would be incompatible with deployed
code) you would rather trap the assignment and check the
object's type.
Scenario 2:
You want to be as compatible as possible with an object
model that has a concept of attribute assignment. It could
be the W3C Document Object Model or a particular COM
interface (e.g. the PowerPoint interface). In that case
you may well want attributes in the model to show up as
attributes in the Python interface, even though the
underlying implementation may not use attributes at all.
Scenario 3:
A user wants to make an attribute read-only.
In short, this feature allows programmers to separate the
interface of their module from the underlying implementation
for whatever purpose. Again, this is not a new feature but
merely a new syntax for an existing convention.
Current Solution
To make some attributes read-only:
class foo:
def __setattr__( self, name, val ):
if name=="readonlyattr":
raise TypeError
elif name=="readonlyattr2":
raise TypeError
...
else:
self.__dict__["name"]=val
This has the following problems:
1. The creator of the method must be intimately aware of whether
somewhere else in the class hiearchy __setattr__ has also been
trapped for any particular purpose. If so, she must specifically
call that method rather than assigning to the dictionary. There
are many different reasons to overload __setattr__ so there is a
decent potential for clashes. For instance object database
implementations often overload setattr for an entirely unrelated
purpose.
2. The string-based switch statement forces all attribute handlers
to be specified in one place in the code. They may then dispatch
to task-specific methods (for modularity) but this could cause
performance problems.
3. Logic for the setting, getting and deleting must live in
__getattr__, __setattr__ and __delattr__. Once again, this can be
mitigated through an extra level of method call but this is
inefficient.
Proposed Syntax
Special methods should declare themselves with declarations of the
following form:
class x:
def __attr_XXX__(self, op, val ):
if op=="get":
return someComputedValue(self.internal)
elif op=="set":
self.internal=someComputedValue(val)
elif op=="del":
del self.internal
Client code looks like this:
fooval=x.foo
x.foo=fooval+5
del x.foo
Semantics
Attribute references of all three kinds should call the method.
The op parameter can be "get"/"set"/"del". Of course this string
will be interned so the actual checks for the string will be
very fast.
It is disallowed to actually have an attribute named XXX in the
same instance as a method named __attr_XXX__.
An implementation of __attr_XXX__ takes precedence over an
implementation of __getattr__ based on the principle that
__getattr__ is supposed to be invoked only after finding an
appropriate attribute has failed.
An implementation of __attr_XXX__ takes precedence over an
implementation of __setattr__ in order to be consistent. The
opposite choice seems fairly feasible also, however. The same
goes for __del_y__.
Proposed Implementation
There is a new object type called an attribute access handler.
Objects of this type have the following attributes:
name (e.g. XXX, not __attr__XXX__
method (pointer to a method object
In PyClass_New, methods of the appropriate form will be detected and
converted into objects (just like unbound method objects). These are
stored in the class __dict__ under the name XXX. The original method
is stored as an unbound method under its original name.
If there are any attribute access handlers in an instance at all,
a flag is set. Let's call it "I_have_computed_attributes" for
now. Derived classes inherit the flag from base classes. Instances
inherit the flag from classes.
A get proceeds as usual until just before the object is returned.
In addition to the current check whether the returned object is a
method it would also check whether a returned object is an access
handler. If so, it would invoke the getter method and return
the value. To remove an attribute access handler you could directly
fiddle with the dictionary.
A set proceeds by checking the "I_have_computed_attributes" flag. If
it is not set, everything proceeds as it does today. If it is set
then we must do a dictionary get on the requested object name. If it
returns an attribute access handler then we call the setter function
with the value. If it returns any other object then we discard the
result and continue as we do today. Note that having an attribute
access handler will mildly affect attribute "setting" performance for
all sets on a particular instance, but no more so than today, using
__setattr__. Gets are more efficient than they are today with
__getattr__.
The I_have_computed_attributes flag is intended to eliminate the
performance degradation of an extra "get" per "set" for objects not
using this feature. Checking this flag should have miniscule
performance implications for all objects.
The implementation of delete is analogous to the implementation
of set.
Caveats
1. You might note that I have not proposed any logic to keep
the I_have_computed_attributes flag up to date as attributes
are added and removed from the instance's dictionary. This is
consistent with current Python. If you add a __setattr__ method
to an object after it is in use, that method will not behave as
it would if it were available at "compile" time. The dynamism is
arguably not worth the extra implementation effort. This snippet
demonstrates the current behavior:
>>> def prn(*args):print args
>>> class a:
... __setattr__=prn
>>> a().foo=5
(<__main__.a instance at 882890>, 'foo', 5)
>>> class b: pass
>>> bi=b()
>>> bi.__setattr__=prn
>>> b.foo=5
2. Assignment to __dict__["XXX"] can overwrite the attribute
access handler for __attr_XXX__. Typically the access handlers will
store information away in private __XXX variables
3. An attribute access handler that attempts to call setattr or getattr
on the object itself can cause an infinite loop (as with __getattr__)
Once again, the solution is to use a special (typically private)
variable such as __XXX.
Note
The descriptor mechanism described in PEP 252 is powerful enough
to support this more directly. A 'getset' constructor may be
added to the language making this possible:
class C:
def get_x(self):
return self.__x
def set_x(self, v):
self.__x = v
x = getset(get_x, set_x)
Additional syntactic sugar might be added, or a naming convention
could be recognized.
pep-0214 Extended Print Statement
| PEP: | 214 |
|---|---|
| Title: | Extended Print Statement |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 24-Jul-2000 |
| Python-Version: | 2.0 |
| Post-History: | 16-Aug-2000 |
Introduction
This PEP describes a syntax to extend the standard `print'
statement so that it can be used to print to any file-like object,
instead of the default sys.stdout. This PEP tracks the status and
ownership of this feature. It contains a description of the
feature and outlines changes necessary to support the feature.
This PEP summarizes discussions held in mailing list forums, and
provides URLs for further information, where appropriate. The CVS
revision history of this file contains the definitive historical
record.
Proposal
This proposal introduces a syntax extension to the print
statement, which allows the programmer to optionally specify the
output file target. An example usage is as follows:
print >> mylogfile, 'this message goes to my log file'
Formally, the syntax of the extended print statement is
print_stmt: ... | '>>' test [ (',' test)+ [','] ] )
where the ellipsis indicates the original print_stmt syntax
unchanged. In the extended form, the expression just after >>
must yield an object with a write() method (i.e. a file-like
object). Thus these two statements are equivalent:
print 'hello world'
print >> sys.stdout, 'hello world'
As are these two statements:
print
print >> sys.stdout
These two statements are syntax errors:
print ,
print >> sys.stdout,
Justification
`print' is a Python keyword and introduces the print statement as
described in section 6.6 of the language reference manual[1].
The print statement has a number of features:
- it auto-converts the items to strings
- it inserts spaces between items automatically
- it appends a newline unless the statement ends in a comma
The formatting that the print statement performs is limited; for
more control over the output, a combination of sys.stdout.write(),
and string interpolation can be used.
The print statement by definition outputs to sys.stdout. More
specifically, sys.stdout must be a file-like object with a write()
method, but it can be rebound to redirect output to files other
than specifically standard output. A typical idiom is
save_stdout = sys.stdout
try:
sys.stdout = mylogfile
print 'this message goes to my log file'
finally:
sys.stdout = save_stdout
The problem with this approach is that the binding is global, and
so affects every statement inside the try: clause. For example,
if we added a call to a function that actually did want to print
to stdout, this output too would get redirected to the logfile.
This approach is also very inconvenient for interleaving prints to
various output streams, and complicates coding in the face of
legitimate try/except or try/finally clauses.
Reference Implementation
A reference implementation, in the form of a patch against the
Python 2.0 source tree, is available on SourceForge's patch
manager[2]. This approach adds two new opcodes, PRINT_ITEM_TO and
PRINT_NEWLINE_TO, which simply pop the file like object off the
top of the stack and use it instead of sys.stdout as the output
stream.
(This reference implementation has been adopted in Python 2.0.)
Alternative Approaches
An alternative to this syntax change has been proposed (originally
by Moshe Zadka) which requires no syntax changes to Python. A
writeln() function could be provided (possibly as a builtin), that
would act much like extended print, with a few additional
features.
def writeln(*args, **kws):
import sys
file = sys.stdout
sep = ' '
end = '\n'
if kws.has_key('file'):
file = kws['file']
del kws['file']
if kws.has_key('nl'):
if not kws['nl']:
end = ' '
del kws['nl']
if kws.has_key('sep'):
sep = kws['sep']
del kws['sep']
if kws:
raise TypeError('unexpected keywords')
file.write(sep.join(map(str, args)) + end)
writeln() takes a three optional keyword arguments. In the
context of this proposal, the relevant argument is `file' which
can be set to a file-like object with a write() method. Thus
print >> mylogfile, 'this goes to my log file'
would be written as
writeln('this goes to my log file', file=mylogfile)
writeln() has the additional functionality that the keyword
argument `nl' is a flag specifying whether to append a newline or
not, and an argument `sep' which specifies the separator to output
in between each item.
More Justification by the BDFL
The proposal has been challenged on the newsgroup. One series of
challenges doesn't like '>>' and would rather see some other
symbol.
Challenge: Why not one of these?
print in stderr items,....
print + stderr items,.......
print[stderr] items,.....
print to stderr items,.....
Response: If we want to use a special symbol (print <symbol>
expression), the Python parser requires that it is not already a
symbol that can start an expression -- otherwise it can't decide
which form of print statement is used. (The Python parser is a
simple LL(1) or recursive descent parser.)
This means that we can't use the "keyword only in context trick"
that was used for "import as", because an identifier can start an
expression. This rules out +stderr, [sterr], and to stderr. It
leaves us with binary operator symbols and other miscellaneous
symbols that are currently illegal here, such as 'import'.
If I had to choose between 'print in file' and 'print >> file' I
would definitely choose '>>'. In part because 'in' would be a new
invention (I know of no other language that uses it, while '>>' is
used in sh, awk, Perl, and C++), in part because '>>', being
non-alphabetic, stands out more so is more likely to catch the
reader's attention.
Challenge: Why does there have to be a comma between the file and
the rest?
Response: The comma separating the file from the following expression is
necessary! Of course you want the file to be an arbitrary
expression, not just a single word. (You definitely want to be
able to write print >>sys.stderr.) Without the expression the
parser would't be able to distinguish where that expression ends
and where the next one begins, e.g.
print >>i +1, 2
print >>a [1], 2
print >>f (1), 2
Challenge: Why do you need a syntax extension? Why not
writeln(file, item, ...)?
Response: First of all, this is lacking a feature of the print
statement: the trailing comma to print which suppresses the final
newline. Note that 'print a,' still isn't equivalent to
'sys.stdout.write(a)' -- print inserts a space between items, and
takes arbitrary objects as arguments; write() doesn't insert a
space and requires a single string.
When you are considering an extension for the print statement,
it's not right to add a function or method that adds a new feature
in one dimension (where the output goes) but takes away in another
dimension (spaces between items, and the choice of trailing
newline or not). We could add a whole slew of methods or
functions to deal with the various cases but that seems to add
more confusion than necessary, and would only make sense if we
were to deprecate the print statement altogether.
I feel that this debate is really about whether print should have
been a function or method rather than a statement. If you are in
the function camp, of course adding special syntax to the existing
print statement is not something you like. I suspect the
objection to the new syntax comes mostly from people who already
think that the print statement was a bad idea. Am I right?
About 10 years ago I debated with myself whether to make the most
basic from of output a function or a statement; basically I was
trying to decide between "print(item, ...)" and "print item, ...".
I chose to make it a statement because printing needs to be taught
very early on, and is very important in the programs that
beginners write. Also, because ABC, which lead the way for so
many things, made it a statement. In a move that's typical for
the interaction between ABC and Python, I changed the name from
WRITE to print, and reversed the convention for adding newlines
from requiring extra syntax to add a newline (ABC used trailing
slashes to indicate newlines) to requiring extra syntax (the
trailing comma) to suppress the newline. I kept the feature that
items are separated by whitespace on output.
Full example: in ABC,
WRITE 1
WRITE 2/
has the same effect as
print 1,
print 2
has in Python, outputting in effect "1 2\n".
I'm not 100% sure that the choice for a statement was right (ABC
had the compelling reason that it used statement syntax for
anything with side effects, but Python doesn't have this
convention), but I'm also not convinced that it's wrong. I
certainly like the economy of the print statement. (I'm a rabid
Lisp-hater -- syntax-wise, not semantics-wise! -- and excessive
parentheses in syntax annoy me. Don't ever write return(i) or
if(x==y): in your Python code! :-)
Anyway, I'm not ready to deprecate the print statement, and over
the years we've had many requests for an option to specify the
file.
Challenge: Why not > instead of >>?
Response: To DOS and Unix users, >> suggests "append", while >
suggests "overwrite"; the semantics are closest to append. Also,
for C++ programmers, >> and << are I/O operators.
Challenge: But in C++, >> is input and << is output!
Response: doesn't matter; C++ clearly took it from Unix and
reversed the arrows. The important thing is that for output, the
arrow points to the file.
Challenge: Surely you can design a println() function can do all
what print>>file can do; why isn't that enough?
Response: I think of this in terms of a simple programming
exercise. Suppose a beginning programmer is asked to write a
function that prints the tables of multiplication. A reasonable
solution is:
def tables(n):
for j in range(1, n+1):
for i in range(1, n+1):
print i, 'x', j, '=', i*j
print
Now suppose the second exercise is to add printing to a different
file. With the new syntax, the programmer only needs to learn one
new thing: print >> file, and the answer can be like this:
def tables(n, file=sys.stdout):
for j in range(1, n+1):
for i in range(1, n+1):
print >> file, i, 'x', j, '=', i*j
print >> file
With only a print statement and a println() function, the
programmer first has to learn about println(), transforming the
original program to using println():
def tables(n):
for j in range(1, n+1):
for i in range(1, n+1):
println(i, 'x', j, '=', i*j)
println()
and *then* about the file keyword argument:
def tables(n, file=sys.stdout):
for j in range(1, n+1):
for i in range(1, n+1):
println(i, 'x', j, '=', i*j, file=sys.stdout)
println(file=sys.stdout)
Thus, the transformation path is longer:
(1) print
(2) print >> file
vs.
(1) print
(2) println()
(3) println(file=...)
Note: defaulting the file argument to sys.stdout at compile time
is wrong, because it doesn't work right when the caller assigns to
sys.stdout and then uses tables() without specifying the file.
This is a common problem (and would occur with a println()
function too). The standard solution so far has been:
def tables(n, file=None):
if file is None:
file = sys.stdout
for j in range(1, n+1):
for i in range(1, n+1):
print >> file, i, 'x', j, '=', i*j
print >> file
I've added a feature to the implementation (which I would also
recommend to println()) whereby if the file argument is None,
sys.stdout is automatically used. Thus,
print >> None, foo bar
(or, of course, print >> x where x is a variable whose value is
None) means the same as
print foo, bar
and the tables() function can be written as follows:
def tables(n, file=None):
for j in range(1, n+1):
for i in range(1, n+1):
print >> file, i, 'x', j, '=', i*j
print >> file
[XXX this needs more justification, and a section of its own]
References
[1] http://docs.python.org/reference/simple_stmts.html#print
[2] http://sourceforge.net/patch/download.php?id=100970
pep-0215 String Interpolation
| PEP: | 215 |
|---|---|
| Title: | String Interpolation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee <ping at zesty.ca> |
| Status: | Superseded |
| Type: | Standards Track |
| Created: | 24-Jul-2000 |
| Python-Version: | 2.1 |
| Post-History: | |
| Superseded-By: | 292 |
Abstract
This document proposes a string interpolation feature for Python
to allow easier string formatting. The suggested syntax change
is the introduction of a '$' prefix that triggers the special
interpretation of the '$' character within a string, in a manner
reminiscent to the variable interpolation found in Unix shells,
awk, Perl, or Tcl.
Copyright
This document is in the public domain.
Specification
Strings may be preceded with a '$' prefix that comes before the
leading single or double quotation mark (or triplet) and before
any of the other string prefixes ('r' or 'u'). Such a string is
processed for interpolation after the normal interpretation of
backslash-escapes in its contents. The processing occurs just
before the string is pushed onto the value stack, each time the
string is pushed. In short, Python behaves exactly as if '$'
were a unary operator applied to the string. The operation
performed is as follows:
The string is scanned from start to end for the '$' character
(\x24 in 8-bit strings or \u0024 in Unicode strings). If there
are no '$' characters present, the string is returned unchanged.
Any '$' found in the string, followed by one of the two kinds of
expressions described below, is replaced with the value of the
expression as evaluated in the current namespaces. The value is
converted with str() if the containing string is an 8-bit string,
or with unicode() if it is a Unicode string.
1. A Python identifier optionally followed by any number of
trailers, where a trailer consists of:
- a dot and an identifier,
- an expression enclosed in square brackets, or
- an argument list enclosed in parentheses
(This is exactly the pattern expressed in the Python grammar
by "NAME trailer*", using the definitions in Grammar/Grammar.)
2. Any complete Python expression enclosed in curly braces.
Two dollar-signs ("$$") are replaced with a single "$".
Examples
Here is an example of an interactive session exhibiting the
expected behaviour of this feature.
>>> a, b = 5, 6
>>> print $'a = $a, b = $b'
a = 5, b = 6
>>> $u'uni${a}ode'
u'uni5ode'
>>> print $'\$a'
5
>>> print $r'\$a'
\5
>>> print $'$$$a.$b'
$5.6
>>> print $'a + b = ${a + b}'
a + b = 11
>>> import sys
>>> print $'References to $a: $sys.getrefcount(a)'
References to 5: 15
>>> print $"sys = $sys, sys = $sys.modules['sys']"
sys = <module 'sys' (built-in)>, sys = <module 'sys' (built-in)>
>>> print $'BDFL = $sys.copyright.split()[4].upper()'
BDFL = GUIDO
Discussion
'$' is chosen as the interpolation character within the
string for the sake of familiarity, since it is already used
for this purpose in many other languages and contexts.
It is then natural to choose '$' as a prefix, since it is a
mnemonic for the interpolation character.
Trailers are permitted to give this interpolation mechanism
even more power than the interpolation available in most other
languages, while the expression to be interpolated remains
clearly visible and free of curly braces.
'$' works like an operator and could be implemented as an
operator, but that prevents the compile-time optimization
and presents security issues. So, it is only allowed as a
string prefix.
Security Issues
"$" has the power to eval, but only to eval a literal. As
described here (a string prefix rather than an operator), it
introduces no new security issues since the expressions to be
evaluated must be literally present in the code.
Implementation
The Itpl module at http://www.lfw.org/python/Itpl.py provides a
prototype of this feature. It uses the tokenize module to find
the end of an expression to be interpolated, then calls eval()
on the expression each time a value is needed. In the prototype,
the expression is parsed and compiled again each time it is
evaluated.
As an optimization, interpolated strings could be compiled
directly into the corresponding bytecode; that is,
$'a = $a, b = $b'
could be compiled as though it were the expression
('a = ' + str(a) + ', b = ' + str(b))
so that it only needs to be compiled once.
pep-0216 Docstring Format
| PEP: | 216 |
|---|---|
| Title: | Docstring Format |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Moshe Zadka <moshez at zadka.site.co.il> |
| Status: | Rejected |
| Type: | Informational |
| Created: | 31-Jul-2000 |
| Post-History: | |
| Superseded-By: | 287 |
Notice
This PEP is rejected by the author. It has been superseded by PEP
287.
Abstract
Named Python objects, such as modules, classes and functions, have a
string attribute called __doc__. If the first expression inside
the definition is a literal string, that string is assigned
to the __doc__ attribute.
The __doc__ attribute is called a documentation string, or docstring.
It is often used to summarize the interface of the module, class or
function. However, since there is no common format for documentation
string, tools for extracting docstrings and transforming those into
documentation in a standard format (e.g., DocBook) have not sprang
up in abundance, and those that do exist are for the most part
unmaintained and unused.
Perl Documentation
In Perl, most modules are documented in a format called POD -- Plain
Old Documentation. This is an easy-to-type, very low level format
which integrates well with the Perl parser. Many tools exist to turn
POD documentation into other formats: info, HTML and man pages, among
others. However, in Perl, the information is not available at run-time.
Java Documentation
In Java, special comments before classes and functions function to
document the code. A program to extract these, and turn them into
HTML documentation is called javadoc, and is part of the standard
Java distribution. However, the only output format that is supported
is HTML, and JavaDoc has a very intimate relationship with HTML.
Python Docstring Goals
Python documentation string are easy to spot during parsing, and are
also available to the runtime interpreter. This double purpose is
a bit problematic, sometimes: for example, some are reluctant to have
too long docstrings, because they do not want to take much space in
the runtime. In addition, because of the current lack of tools, people
read objects' docstrings by "print"ing them, so a tendancy to make them
brief and free of markups has sprung up. This tendancy hinders writing
better documentation-extraction tools, since it causes docstrings to
contain little information, which is hard to parse.
High Level Solutions
To counter the objection that the strings take up place in the running
program, it is suggested that documentation extraction tools will
concatenate a maximum prefix of string literals which appear in the
beginning of a definition. The first of these will also be available
in the interactive interpreter, so it should contain a few summary
lines.
Docstring Format Goals
These are the goals for the docstring format, as discussed ad neasum
in the doc-sig.
1. It must be easy to type with any standard text editor.
2. It must be readable to the casual observer.
3. It must not contain information which can be deduced from parsing
the module.
4. It must contain sufficient information so it can be converted
to any reasonable markup format.
5. It must be possible to write a module's entire documentation in
docstrings, without feeling hampered by the markup language.
Docstring Contents
For requirement 5. above, it is needed to specify what must be
in docstrings.
At least the following must be available:
a. A tag that means "this is a Python ``something'', guess what"
Example: In the sentence "The POP3 class", we need to markup "POP3"
so. The parser will be able to guess it is a class from the contents
of the poplib module, but we need to make it guess.
b. Tags that mean "this is a Python class/module/class var/instance var..."
Example: The usual Python idiom for singleton class A is to have _A
as the class, and A a function which returns _A objects. It's usual
to document the class, nonetheless, as being A. This requires the
strength to say "The class A" and have A hyperlinked and marked-up
as a class.
c. An easy way to include Python source code/Python interactive sessions
d. Emphasis/bold
e. List/tables
Docstring Basic Structure
The documentation strings will be in StructuredTextNG
(http://www.zope.org/Members/jim/StructuredTextWiki/StructuredTextNG)
Since StructuredText is not yet strong enough to handle (a) and (b)
above, we will need to extend it. I suggest using
'[<optional description>:python identifier]'.
E.g.: [class:POP3], [:POP3.list], etc. If the description is missing,
a guess will be made from the text.
Unresolved Issues
Is there a way to escape characters in ST? If so, how?
(example: * at the beginning of a line without being bullet symbol)
Is my suggestion above for Python symbols compatible with ST-NG?
How hard would it be to extend ST-NG to support it?
How do we describe input and output types of functions?
What additional constraint do we enforce on each docstring?
(module/class/function)?
What are the guesser rules?
Rejected Suggestions
XML -- it's very hard to type, and too cluttered to read it
comfortably.
pep-0217 Display Hook for Interactive Use
| PEP: | 217 |
|---|---|
| Title: | Display Hook for Interactive Use |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Moshe Zadka <moshez at zadka.site.co.il> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 31-Jul-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
Python's interactive mode is one of the implementation's great
strengths -- being able to write expressions on the command line
and get back a meaningful output. However, the output function
cannot be all things to all people, and the current output
function too often falls short of this goal. This PEP describes a
way to provides alternatives to the built-in display function in
Python, so users will have control over the output from the
interactive interpreter.
Interface
The current Python solution has worked for many users, and this
should not break it. Therefore, in the default configuration,
nothing will change in the REPL loop. To change the way the
interpreter prints interactively entered expressions, users
will have to rebind sys.displayhook to a callable object.
The result of calling this object with the result of the
interactively entered expression should be print-able,
and this is what will be printed on sys.stdout.
Solution
The bytecode PRINT_EXPR will call sys.displayhook(POP())
A displayhook() will be added to the sys builtin module, which is
equivalent to
import __builtin__
def displayhook(o):
if o is None:
return
__builtin__._ = None
print `o`
__builtin__._ = o
Jython Issues
The method Py.printResult will be similarily changed.
pep-0218 Adding a Built-In Set Object Type
| PEP: | 218 |
|---|---|
| Title: | Adding a Built-In Set Object Type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Greg Wilson <gvwilson at ddj.com>, Raymond Hettinger <python at rcn.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 31-Jul-2000 |
| Python-Version: | 2.2 |
| Post-History: |
Introduction
This PEP proposes adding a Set module to the standard Python
library, and to then make sets a built-in Python type if that
module is widely used. After explaining why sets are desirable,
and why the common idiom of using dictionaries in their place is
inadequate, we describe how we intend built-in sets to work, and
then how the preliminary Set module will behave. The last
section discusses the mutability (or otherwise) of sets and set
elements, and the solution which the Set module will implement.
Rationale
Sets are a fundamental mathematical structure, and are very
commonly used in algorithm specifications. They are much less
frequently used in implementations, even when they are the "right"
structure. Programmers frequently use lists instead, even when
the ordering information in lists is irrelevant, and by-value
lookups are frequent. (Most medium-sized C programs contain a
depressing number of start-to-end searches through malloc'd
vectors to determine whether particular items are present or
not...)
Programmers are often told that they can implement sets as
dictionaries with "don't care" values. Items can be added to
these "sets" by assigning the "don't care" value to them;
membership can be tested using "dict.has_key"; and items can be
deleted using "del". However, the other main operations on sets
(union, intersection, and difference) are not directly supported
by this representation, since their meaning is ambiguous for
dictionaries containing key/value pairs.
Proposal
The long-term goal of this PEP is to add a built-in set type to
Python. This type will be an unordered collection of unique
values, just as a dictionary is an unordered collection of
key/value pairs.
Iteration and comprehension will be implemented in the obvious
ways, so that:
for x in S:
will step through the elements of S in arbitrary order, while:
set(x**2 for x in S)
will produce a set containing the squares of all elements in S,
Membership will be tested using "in" and "not in", and basic set
operations will be implemented by a mixture of overloaded
operators:
| union
& intersection
^ symmetric difference
- asymmetric difference
== != equality and inequality tests
< <= >= > subset and superset tests
and methods:
S.add(x) Add "x" to the set.
S.update(s) Add all elements of sequence "s" to the set.
S.remove(x) Remove "x" from the set. If "x" is not
present, this method raises a LookupError
exception.
S.discard(x) Remove "x" from the set if it is present, or
do nothing if it is not.
S.pop() Remove and return an arbitrary element,
raising a LookupError if the element is not
present.
S.clear() Remove all elements from this set.
S.copy() Make a new set.
s.issuperset() Check for a superset relationship.
s.issubset() Check for a subset relationship.
and two new built-in conversion functions:
set(x) Create a set containing the elements of the
collection "x".
frozenset(x) Create an immutable set containing the elements
of the collection "x".
Notes:
1. We propose using the bitwise operators "|&" for intersection
and union. While "+" for union would be intuitive, "*" for
intersection is not (very few of the people asked guessed what
it did correctly).
2. We considered using "+" to add elements to a set, rather than
"add". However, Guido van Rossum pointed out that "+" is
symmetric for other built-in types (although "*" is not). Use
of "add" will also avoid confusion between that operation and
set union.
Set Notation
The PEP originally proposed {1,2,3} as the set notation and {-} for
the empty set. Experience with Python 2.3's sets.py showed that
the notation was not necessary. Also, there was some risk of making
dictionaries less instantly recognizable.
It was also contemplated that the braced notation would support set
comprehensions; however, Python 2.4 provided generator expressions
which fully met that need and did so it a more general way.
(See PEP 289 for details on generator expressions).
So, Guido ruled that there would not be a set syntax; however, the
issue could be revisited for Python 3000 (see PEP 3000).
History
To gain experience with sets, a pure python module was introduced
in Python 2.3. Based on that implementation, the set and frozenset
types were introduced in Python 2.4. The improvements are:
* Better hash algorithm for frozensets
* More compact pickle format (storing only an element list
instead of a dictionary of key:value pairs where the value
is always True).
* Use a __reduce__ function so that deep copying is automatic.
* The BaseSet concept was eliminated.
* The union_update() method became just update().
* Auto-conversion between mutable and immutable sets was dropped.
* The _repr method was dropped (the need is met by the new
sorted() built-in function).
Tim Peters believes that the class's constructor should take a
single sequence as an argument, and populate the set with that
sequence's elements. His argument is that in most cases,
programmers will be creating sets from pre-existing sequences, so
that this case should be the common one. However, this would
require users to remember an extra set of parentheses when
initializing a set with known values:
>>> Set((1, 2, 3, 4)) # case 1
On the other hand, feedback from a small number of novice Python
users (all of whom were very experienced with other languages)
indicates that people will find a "parenthesis-free" syntax more
natural:
>>> Set(1, 2, 3, 4) # case 2
Ultimately, we adopted the first strategy in which the initializer
takes a single iterable argument.
Mutability
The most difficult question to resolve in this proposal was
whether sets ought to be able to contain mutable elements. A
dictionary's keys must be immutable in order to support fast,
reliable lookup. While it would be easy to require set elements
to be immutable, this would preclude sets of sets (which are
widely used in graph algorithms and other applications).
Earlier drafts of PEP 218 had only a single set type, but the
sets.py implementation in Python 2.3 has two, Set and
ImmutableSet. For Python 2.4, the new built-in types were named
set and frozenset which are slightly less cumbersome.
There are two classes implemented in the "sets" module. Instances
of the Set class can be modified by the addition or removal of
elements, and the ImmutableSet class is "frozen", with an
unchangeable collection of elements. Therefore, an ImmutableSet
may be used as a dictionary key or as a set element, but cannot be
updated. Both types of set require that their elements are
immutable, hashable objects. Parallel comments apply to the "set"
and "frozenset" built-in types.
Copyright
This document has been placed in the Public Domain.
pep-0219 Stackless Python
| PEP: | 219 |
|---|---|
| Title: | Stackless Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gordon McMillan <gmcm at hypernet.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 14-Aug-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP discusses changes required to core Python in order to
efficiently support generators, microthreads and coroutines. It is
related to PEP 220, which describes how Python should be extended
to support these facilities. The focus of this PEP is strictly on
the changes required to allow these extensions to work.
While these PEPs are based on Christian Tismer's Stackless[1]
implementation, they do not regard Stackless as a reference
implementation. Stackless (with an extension module) implements
continuations, and from continuations one can implement
coroutines, microthreads (as has been done by Will Ware[2]) and
generators. But in more that a year, no one has found any other
productive use of continuations, so there seems to be no demand
for their support.
However, Stackless support for continuations is a relatively minor
piece of the implementation, so one might regard it as "a"
reference implementation (rather than "the" reference
implementation).
Background
Generators and coroutines have been implemented in a number of
languages in a number of ways. Indeed, Tim Peters has done pure
Python implementations of generators[3] and coroutines[4] using
threads (and a thread-based coroutine implementation exists for
Java). However, the horrendous overhead of a thread-based
implementation severely limits the usefulness of this approach.
Microthreads (a.k.a "green" or "user" threads) and coroutines
involve transfers of control that are difficult to accommodate in
a language implementation based on a single stack. (Generators can
be done on a single stack, but they can also be regarded as a very
simple case of coroutines.)
Real threads allocate a full-sized stack for each thread of
control, and this is the major source of overhead. However,
coroutines and microthreads can be implemented in Python in a way
that involves almost no overhead. This PEP, therefor, offers a
way for making Python able to realistically manage thousands of
separate "threads" of activity (vs. todays limit of perhaps dozens
of separate threads of activity).
Another justification for this PEP (explored in PEP 220) is that
coroutines and generators often allow a more direct expression of
an algorithm than is possible in today's Python.
Discussion
The first thing to note is that Python, while it mingles
interpreter data (normal C stack usage) with Python data (the
state of the interpreted program) on the stack, the two are
logically separate. They just happen to use the same stack.
A real thread gets something approaching a process-sized stack
because the implementation has no way of knowing how much stack
space the thread will require. The stack space required for an
individual frame is likely to be reasonable, but stack switching
is an arcane and non-portable process, not supported by C.
Once Python stops putting Python data on the C stack, however,
stack switching becomes easy.
The fundamental approach of the PEP is based on these two
ideas. First, separate C's stack usage from Python's stack
usage. Secondly, associate with each frame enough stack space to
handle that frame's execution.
In the normal usage, Stackless Python has a normal stack
structure, except that it is broken into chunks. But in the
presence of a coroutine / microthread extension, this same
mechanism supports a stack with a tree structure. That is, an
extension can support transfers of control between frames outside
the normal "call / return" path.
Problems
The major difficulty with this approach is C calling Python. The
problem is that the C stack now holds a nested execution of the
byte-code interpreter. In that situation, a coroutine /
microthread extension cannot be permitted to transfer control to a
frame in a different invocation of the byte-code interpreter. If a
frame were to complete and exit back to C from the wrong
interpreter, the C stack could be trashed.
The ideal solution is to create a mechanism where nested
executions of the byte code interpreter are never needed. The easy
solution is for the coroutine / microthread extension(s) to
recognize the situation and refuse to allow transfers outside the
current invocation.
We can categorize code that involves C calling Python into two
camps: Python's implementation, and C extensions. And hopefully we
can offer a compromise: Python's internal usage (and C extension
writers who want to go to the effort) will no longer use a nested
invocation of the interpreter. Extensions which do not go to the
effort will still be safe, but will not play well with coroutines
/ microthreads.
Generally, when a recursive call is transformed into a loop, a bit
of extra bookkeeping is required. The loop will need to keep its
own "stack" of arguments and results since the real stack can now
only hold the most recent. The code will be more verbose, because
it's not quite as obvious when we're done. While Stackless is not
implemented this way, it has to deal with the same issues.
In normal Python, PyEval_EvalCode is used to build a frame and
execute it. Stackless Python introduces the concept of a
FrameDispatcher. Like PyEval_EvalCode, it executes one frame. But
the interpreter may signal the FrameDispatcher that a new frame
has been swapped in, and the new frame should be executed. When a
frame completes, the FrameDispatcher follows the back pointer to
resume the "calling" frame.
So Stackless transforms recursions into a loop, but it is not the
FrameDispatcher that manages the frames. This is done by the
interpreter (or an extension that knows what it's doing).
The general idea is that where C code needs to execute Python
code, it creates a frame for the Python code, setting its back
pointer to the current frame. Then it swaps in the frame, signals
the FrameDispatcher and gets out of the way. The C stack is now
clean - the Python code can transfer control to any other frame
(if an extension gives it the means to do so).
In the vanilla case, this magic can be hidden from the programmer
(even, in most cases, from the Python-internals programmer). Many
situations present another level of difficulty, however.
The map builtin function involves two obstacles to this
approach. It cannot simply construct a frame and get out of the
way, not just because there's a loop involved, but each pass
through the loop requires some "post" processing. In order to play
well with others, Stackless constructs a frame object for map
itself.
Most recursions of the interpreter are not this complex, but
fairly frequently, some "post" operations are required. Stackless
does not fix these situations because of amount of code changes
required. Instead, Stackless prohibits transfers out of a nested
interpreter. While not ideal (and sometimes puzzling), this
limitation is hardly crippling.
Advantages
For normal Python, the advantage to this approach is that C stack
usage becomes much smaller and more predictable. Unbounded
recursion in Python code becomes a memory error, instead of a
stack error (and thus, in non-Cupertino operating systems,
something that can be recovered from). The price, of course, is
the added complexity that comes from transforming recursions of
the byte-code interpreter loop into a higher order loop (and the
attendant bookkeeping involved).
The big advantage comes from realizing that the Python stack is
really a tree, and the frame dispatcher can transfer control
freely between leaf nodes of the tree, thus allowing things like
microthreads and coroutines.
References
[1] http://www.stackless.com
[2] http://world.std.com/~wware/uthread.html
[3] Demo/threads/Generator.py in the source distribution
[4] http://www.stackless.com/coroutines.tim.peters.html
pep-0220 Coroutines, Generators, Continuations
| PEP: | 220 |
|---|---|
| Title: | Coroutines, Generators, Continuations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gordon McMillan <gmcm at hypernet.com> |
| Status: | Rejected |
| Type: | Informational |
| Created: | 14-Aug-2000 |
| Post-History: |
Abstract
Demonstrates why the changes described in the stackless PEP are
desirable. A low-level continuations module exists. With it,
coroutines and generators and "green" threads can be written. A
higher level module that makes coroutines and generators easy to
create is desirable (and being worked on). The focus of this PEP
is on showing how coroutines, generators, and green threads can
simplify common programming problems.
pep-0221 Import As
| PEP: | 221 |
|---|---|
| Title: | Import As |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Thomas Wouters <thomas at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 15-Aug-2000 |
| Python-Version: | 2.0 |
| Post-History: |
Introduction
This PEP describes the `import as' proposal for Python 2.0. This
PEP tracks the status and ownership of this feature. It contains
a description of the feature and outlines changes necessary to
support the feature. The CVS revision history of this file
contains the definitive historical record.
Rationale
This PEP proposes an extention of Python syntax regarding the
`import' and `from <module> import' statements. These statements
load in a module, and either bind that module to a local name, or
binds objects from that module to a local name. However, it is
sometimes desirable to bind those objects to a different name, for
instance to avoid name clashes. This can currently be achieved
using the following idiom:
import os
real_os = os
del os
And similarly for the `from ... import' statement:
from os import fdopen, exit, stat
os_fdopen = fdopen
os_stat = stat
del fdopen, stat
The proposed syntax change would add an optional `as' clause to
both these statements, as follows:
import os as real_os
from os import fdopen as os_fdopen, exit, stat as os_stat
The `as' name is not intended to be a keyword, and some trickery
has to be used to convince the CPython parser it isn't one. For
more advanced parsers/tokenizers, however, this should not be a
problem.
A slightly special case exists for importing sub-modules. The
statement
import os.path
stores the module `os' locally as `os', so that the imported
submodule `path' is accessible as `os.path'. As a result,
import os.path as p
stores `os.path', not `os', in `p'. This makes it effectively the
same as
from os import path as p
Implementation details
This PEP has been accepted, and the suggested code change has been
checked in. The patch can still be found in the SourceForge patch
manager[1]. Currently, a NAME field is used in the grammar rather
than a bare string, to avoid the keyword issue. It introduces a
new bytecode, IMPORT_STAR, which performs the `from module import
*' behaviour, and changes the behaviour of the IMPORT_FROM
bytecode so that it loads the requested name (which is always a
single name) onto the stack, to be subsequently stored by a STORE
opcode. As a result, all names explicitly imported now follow the
`global' directives.
The special case of `from module import *' remains a special case,
in that it cannot accomodate an `as' clause, and that no STORE
opcodes are generated; the objects imported are loaded directly
into the local namespace. This also means that names imported in
this fashion are always local, and do not follow the `global'
directive.
An additional change to this syntax has also been suggested, to
generalize the expression given after the `as' clause. Rather
than a single name, it could be allowed to be any expression that
yields a valid l-value; anything that can be assigned to. The
change to accomodate this is minimal, as the patch[2] proves, and
the resulting generalization allows a number of new constructs
that run completely parallel with other Python assignment
constructs. However, this idea has been rejected by Guido, as
`hypergeneralization'.
Copyright
This document has been placed in the Public Domain.
References
[1] http://sourceforge.net/patch/?func=detailpatch&patch_id=101135&group_id=5470
[2] http://sourceforge.net/patch/?func=detailpatch&patch_id=101234&group_id=5470
pep-0222 Web Library Enhancements
| PEP: | 222 |
|---|---|
| Title: | Web Library Enhancements |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 18-Aug-2000 |
| Python-Version: | 2.1 |
| Post-History: | 22-Dec-2000 |
Abstract
This PEP proposes a set of enhancements to the CGI development
facilities in the Python standard library. Enhancements might be
new features, new modules for tasks such as cookie support, or
removal of obsolete code.
The original intent was to make improvements to Python 2.1.
However, there seemed little interest from the Python community,
and time was lacking, so this PEP has been deferred to some future
Python release.
Open Issues
This section lists changes that have been suggested, but about
which no firm decision has yet been made. In the final version of
this PEP, this section should be empty, as all the changes should
be classified as accepted or rejected.
cgi.py: We should not be told to create our own subclass just so
we can handle file uploads. As a practical matter, I have yet to
find the time to do this right, so I end up reading cgi.py's temp
file into, at best, another file. Some of our legacy code actually
reads it into a second temp file, then into a final destination!
And even if we did, that would mean creating yet another object
with its __init__ call and associated overhead.
cgi.py: Currently, query data with no `=' are ignored. Even if
keep_blank_values is set, queries like `...?value=&...' are
returned with blank values but queries like `...?value&...' are
completely lost. It would be great if such data were made
available through the FieldStorage interface, either as entries
with None as values, or in a separate list.
Utility function: build a query string from a list of 2-tuples
Dictionary-related utility classes: NoKeyErrors (returns an empty
string, never a KeyError), PartialStringSubstitution (returns
the original key string, never a KeyError)
New Modules
This section lists details about entire new packages or modules
that should be added to the Python standard library.
* fcgi.py : A new module adding support for the FastCGI protocol.
Robin Dunn's code needs to be ported to Windows, though.
Major Changes to Existing Modules
This section lists details of major changes to existing modules,
whether in implementation or in interface. The changes in this
section therefore carry greater degrees of risk, either in
introducing bugs or a backward incompatibility.
The cgi.py module would be deprecated. (XXX A new module or
package name hasn't been chosen yet: 'web'? 'cgilib'?)
Minor Changes to Existing Modules
This section lists details of minor changes to existing modules.
These changes should have relatively small implementations, and
have little risk of introducing incompatibilities with previous
versions.
Rejected Changes
The changes listed in this section were proposed for Python 2.1,
but were rejected as unsuitable. For each rejected change, a
rationale is given describing why the change was deemed
inappropriate.
* An HTML generation module is not part of this PEP. Several such
modules exist, ranging from HTMLgen's purely programming
interface to ASP-inspired simple templating to DTML's complex
templating. There's no indication of which templating module to
enshrine in the standard library, and that probably means that
no module should be so chosen.
* cgi.py: Allowing a combination of query data and POST data.
This doesn't seem to be standard at all, and therefore is
dubious practice.
Proposed Interface
XXX open issues: naming convention (studlycaps or
underline-separated?); need to look at the cgi.parse*() functions
and see if they can be simplified, too.
Parsing functions: carry over most of the parse* functions from
cgi.py
# The Response class borrows most of its methods from Zope's
# HTTPResponse class.
class Response:
"""
Attributes:
status: HTTP status code to return
headers: dictionary of response headers
body: string containing the body of the HTTP response
"""
def __init__(self, status=200, headers={}, body=""):
pass
def setStatus(self, status, reason=None):
"Set the numeric HTTP response code"
pass
def setHeader(self, name, value):
"Set an HTTP header"
pass
def setBody(self, body):
"Set the body of the response"
pass
def setCookie(self, name, value,
path = '/',
comment = None,
domain = None,
max-age = None,
expires = None,
secure = 0
):
"Set a cookie"
pass
def expireCookie(self, name):
"Remove a cookie from the user"
pass
def redirect(self, url):
"Redirect the browser to another URL"
pass
def __str__(self):
"Convert entire response to a string"
pass
def dump(self):
"Return a string representation useful for debugging"
pass
# XXX methods for specific classes of error:serverError,
# badRequest, etc.?
class Request:
"""
Attributes:
XXX should these be dictionaries, or dictionary-like objects?
.headers : dictionary containing HTTP headers
.cookies : dictionary of cookies
.fields : data from the form
.env : environment dictionary
"""
def __init__(self, environ=os.environ, stdin=sys.stdin,
keep_blank_values=1, strict_parsing=0):
"""Initialize the request object, using the provided environment
and standard input."""
pass
# Should people just use the dictionaries directly?
def getHeader(self, name, default=None):
pass
def getCookie(self, name, default=None):
pass
def getField(self, name, default=None):
"Return field's value as a string (even if it's an uploaded file)"
pass
def getUploadedFile(self, name):
"""Returns a file object that can be read to obtain the contents
of an uploaded file. XXX should this report an error if the
field isn't actually an uploaded file? Or should it wrap
a StringIO around simple fields for consistency?
"""
def getURL(self, n=0, query_string=0):
"""Return the URL of the current request, chopping off 'n' path
components from the right. Eg. if the URL is
"http://foo.com/bar/baz/quux", n=2 would return
"http://foo.com/bar". Does not include the query string (if
any)
"""
def getBaseURL(self, n=0):
"""Return the base URL of the current request, adding 'n' path
components to the end to recreate more of the whole URL.
Eg. if the request URL is
"http://foo.com/q/bar/baz/qux", n=0 would return
"http://foo.com/", and n=2 "http://foo.com/q/bar".
Returned URL does not include the query string, if any.
"""
def dump(self):
"String representation suitable for debugging output"
pass
# Possibilities? I don't know if these are worth doing in the
# basic objects.
def getBrowser(self):
"Returns Mozilla/IE/Lynx/Opera/whatever"
def isSecure(self):
"Return true if this is an SSLified request"
# Module-level function
def wrapper(func, logfile=sys.stderr):
"""
Calls the function 'func', passing it the arguments
(request, response, logfile). Exceptions are trapped and
sent to the file 'logfile'.
"""
# This wrapper will detect if it's being called from the command-line,
# and if so, it will run in a debugging mode; name=value pairs
# can be entered on standard input to set field values.
# (XXX how to do file uploads in this syntax?)
Copyright
This document has been placed in the public domain.
pep-0223 Change the Meaning of \x Escapes
| PEP: | 223 |
|---|---|
| Title: | Change the Meaning of \x Escapes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tim Peters <tim at zope.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 20-Aug-2000 |
| Python-Version: | 2.0 |
| Post-History: | 23-Aug-2000 |
Abstract
Change \x escapes, in both 8-bit and Unicode strings, to consume
exactly the two hex digits following. The proposal views this as
correcting an original design flaw, leading to clearer expression
in all flavors of string, a cleaner Unicode story, better
compatibility with Perl regular expressions, and with minimal risk
to existing code.
Syntax
The syntax of \x escapes, in all flavors of non-raw strings, becomes
\xhh
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
not clearly specified in the Reference Manual; it says
\xhh...
implying "two or more" hex digits, but one-digit forms are also
accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
itself (i.e., a backslash followed by the letter x). It's unclear
whether the Reference Manual intended either of the 1-digit or
0-digit behaviors.
Semantics
In an 8-bit non-raw string,
\xij
expands to the character
chr(int(ij, 16))
Note that this is the same as in 1.6 and before.
In a Unicode string,
\xij
acts the same as
\u00ij
i.e. it expands to the obvious Latin-1 character from the initial
segment of the Unicode space.
An \x not followed by at least two hex digits is a compile-time error,
specifically ValueError in 8-bit strings, and UnicodeError (a subclass
of ValueError) in Unicode strings. Note that if an \x is followed by
more than two hex digits, only the first two are "consumed". In 1.6
and before all but the *last* two were silently ignored.
Example
In 1.5.2:
>>> "\x123465" # same as "\x65"
'e'
>>> "\x65"
'e'
>>> "\x1"
'\001'
>>> "\x\x"
'\\x\\x'
>>>
In 2.0:
>>> "\x123465" # \x12 -> \022, "3456" left alone
'\0223456'
>>> "\x65"
'e'
>>> "\x1"
[ValueError is raised]
>>> "\x\x"
[ValueError is raised]
>>>
History and Rationale
\x escapes were introduced in C as a way to specify variable-width
character encodings. Exactly which encodings those were, and how many
hex digits they required, was left up to each implementation. The
language simply stated that \x "consumed" *all* hex digits following,
and left the meaning up to each implementation. So, in effect, \x in C
is a standard hook to supply platform-defined behavior.
Because Python explicitly aims at platform independence, the \x escape
in Python (up to and including 1.6) has been treated the same way
across all platforms: all *except* the last two hex digits were
silently ignored. So the only actual use for \x escapes in Python was
to specify a single byte using hex notation.
Larry Wall appears to have realized that this was the only real use for
\x escapes in a platform-independent language, as the proposed rule for
Python 2.0 is in fact what Perl has done from the start (although you
need to run in Perl -w mode to get warned about \x escapes with fewer
than 2 hex digits following -- it's clearly more Pythonic to insist on
2 all the time).
When Unicode strings were introduced to Python, \x was generalized so
as to ignore all but the last *four* hex digits in Unicode strings.
This caused a technical difficulty for the new regular expression engine:
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
strings in intuitive ways, and it no longer had any way to guess what,
for example, r"\x123456" should mean as a pattern: is it asking to match
the 8-bit character \x56 or the Unicode character \u3456?
There are hacky ways to guess, but it doesn't end there. The ISO C99
standard also introduces 8-digit \U12345678 escapes to cover the entire
ISO 10646 character space, and it's also desired that Python 2 support
that from the start. But then what are \x escapes supposed to mean?
Do they ignore all but the last *eight* hex digits then? And if less
than 8 following in a Unicode string, all but the last 4? And if less
than 4, all but the last 2?
This was getting messier by the minute, and the proposal cuts the
Gordian knot by making \x simpler instead of more complicated. Note
that the 4-digit generalization to \xijkl in Unicode strings was also
redundant, because it meant exactly the same thing as \uijkl in Unicode
strings. It's more Pythonic to have just one obvious way to specify a
Unicode character via hex notation.
Development and Discussion
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
Tim Peters in email. It was subsequently explained and disussed on
Python-Dev under subject "Go \x yourself", starting 2000-08-03.
Response was overwhelmingly positive; no objections were raised.
Backward Compatibility
Changing the meaning of \x escapes does carry risk of breaking existing
code, although no instances of incompabitility have yet been discovered.
The risk is believed to be minimal.
Tim Peters verified that, except for pieces of the standard test suite
deliberately provoking end cases, there are no instances of \xabcdef...
with fewer or more than 2 hex digits following, in either the Python
CVS development tree, or in assorted Python packages sitting on his
machine.
It's unlikely there are any with fewer than 2, because the Reference
Manual implied they weren't legal (although this is debatable!). If
there are any with more than 2, Guido is ready to argue they were buggy
anyway <0.9 wink>.
Guido reported that the O'Reilly Python books *already* document that
Python works the proposed way, likely due to their Perl editing
heritage (as above, Perl worked (very close to) the proposed way from
its start).
Finn Bock reported that what JPython does with \x escapes is
unpredictable today. This proposal gives a clear meaning that can be
consistently and easily implemented across all Python implementations.
Effects on Other Tools
Believed to be none. The candidates for breakage would mostly be
parsing tools, but the author knows of none that worry about the
internal structure of Python strings beyond the approximation "when
there's a backslash, swallow the next character". Tim Peters checked
python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
coloring subsystem, and believes there's no need to change any of
them. Tools like tabnanny.py and checkappend.py inherit their immunity
from tokenize.py.
Reference Implementation
The code changes are so simple that a separate patch will not be produced.
Fredrik Lundh is writing the code, is an expert in the area, and will
simply check the changes in before 2.0b1 is released.
BDFL Pronouncements
Yes, ValueError, not SyntaxError. "Problems with literal interpretations
traditionally raise 'runtime' exceptions rather than syntax errors."
Copyright
This document has been placed in the public domain.
pep-0224 Attribute Docstrings
| PEP: | 224 |
|---|---|
| Title: | Attribute Docstrings |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 23-Aug-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP describes the "attribute docstring" proposal for Python
2.0. This PEP tracks the status and ownership of this feature.
It contains a description of the feature and outlines changes
necessary to support the feature. The CVS revision history of
this file contains the definitive historical record.
Rationale
This PEP proposes a small addition to the way Python currently
handles docstrings embedded in Python code.
Python currently only handles the case of docstrings which appear
directly after a class definition, a function definition or as
first string literal in a module. The string literals are added
to the objects in question under the __doc__ attribute and are
from then on available for introspection tools which can extract
the contained information for help, debugging and documentation
purposes.
Docstrings appearing in locations other than the ones mentioned
are simply ignored and don't result in any code generation.
Here is an example:
class C:
"class C doc-string"
a = 1
"attribute C.a doc-string (1)"
b = 2
"attribute C.b doc-string (2)"
The docstrings (1) and (2) are currently being ignored by the
Python byte code compiler, but could obviously be put to good use
for documenting the named assignments that precede them.
This PEP proposes to also make use of these cases by proposing
semantics for adding their content to the objects in which they
appear under new generated attribute names.
The original idea behind this approach which also inspired the
above example was to enable inline documentation of class
attributes, which can currently only be documented in the class's
docstring or using comments which are not available for
introspection.
Implementation
Docstrings are handled by the byte code compiler as expressions.
The current implementation special cases the few locations
mentioned above to make use of these expressions, but otherwise
ignores the strings completely.
To enable use of these docstrings for documenting named
assignments (which is the natural way of defining e.g. class
attributes), the compiler will have to keep track of the last
assigned name and then use this name to assign the content of the
docstring to an attribute of the containing object by means of
storing it in as a constant which is then added to the object's
namespace during object construction time.
In order to preserve features like inheritance and hiding of
Python's special attributes (ones with leading and trailing double
underscores), a special name mangling has to be applied which
uniquely identifies the docstring as belonging to the name
assignment and allows finding the docstring later on by inspecting
the namespace.
The following name mangling scheme achieves all of the above:
__doc_<attributename>__
To keep track of the last assigned name, the byte code compiler
stores this name in a variable of the compiling structure. This
variable defaults to NULL. When it sees a docstring, it then
checks the variable and uses the name as basis for the above name
mangling to produce an implicit assignment of the docstring to the
mangled name. It then resets the variable to NULL to avoid
duplicate assignments.
If the variable does not point to a name (i.e. is NULL), no
assignments are made. These will continue to be ignored like
before. All classical docstrings fall under this case, so no
duplicate assignments are done.
In the above example this would result in the following new class
attributes to be created:
C.__doc_a__ == "attribute C.a doc-string (1)"
C.__doc_b__ == "attribute C.b doc-string (2)"
A patch to the current CVS version of Python 2.0 which implements
the above is available on SourceForge at [1].
Caveats of the Implementation
Since the implementation does not reset the compiling structure
variable when processing a non-expression, e.g. a function
definition, the last assigned name remains active until either the
next assignment or the next occurrence of a docstring.
This can lead to cases where the docstring and assignment may be
separated by other expressions:
class C:
"C doc string"
b = 2
def x(self):
"C.x doc string"
y = 3
return 1
"b's doc string"
Since the definition of method "x" currently does not reset the
used assignment name variable, it is still valid when the compiler
reaches the docstring "b's doc string" and thus assigns the string
to __doc_b__.
A possible solution to this problem would be resetting the name
variable for all non-expression nodes in the compiler.
Possible Problems
Even though highly unlikely, attribute docstrings could get
accidentally concatenated to the attribute's value:
class C:
x = "text" \
"x's docstring"
The trailing slash would cause the Python compiler to concatenate
the attribute value and the docstring.
A modern syntax highlighting editor would easily make this
accident visible, though, and by simply inserting emtpy lines
between the attribute definition and the docstring you can avoid
the possible concatenation completely, so the problem is
negligible.
Another possible problem is that of using triple quoted strings as
a way to uncomment parts of your code.
If there happens to be an assignment just before the start of the
comment string, then the compiler will treat the comment as
docstring attribute and apply the above logic to it.
Besides generating a docstring for an otherwise undocumented
attribute there is no breakage.
Comments from our BDFL
Early comments on the PEP from Guido:
I "kinda" like the idea of having attribute docstrings (meaning
it's not of great importance to me) but there are two things I
don't like in your current proposal:
1. The syntax you propose is too ambiguous: as you say,
stand-alone string literal are used for other purposes and could
suddenly become attribute docstrings.
2. I don't like the access method either (__doc_<attrname>__).
The author's reply:
> 1. The syntax you propose is too ambiguous: as you say, stand-alone
> string literal are used for other purposes and could suddenly
> become attribute docstrings.
This can be fixed by introducing some extra checks in the
compiler to reset the "doc attribute" flag in the compiler
struct.
> 2. I don't like the access method either (__doc_<attrname>__).
Any other name will do. It will only have to match these
criteria:
* must start with two underscores (to match __doc__)
* must be extractable using some form of inspection (e.g. by using
a naming convention which includes some fixed name part)
* must be compatible with class inheritance (i.e. should be
stored as attribute)
Later on in March, Guido pronounced on this PEP in March 2001 (on
python-dev). Here are his reasons for rejection mentioned in
private mail to the author of this PEP:
...
It might be useful, but I really hate the proposed syntax.
a = 1
"foo bar"
b = 1
I really have no way to know whether "foo bar" is a docstring
for a or for b.
...
You can use this convention:
a = 1
__doc_a__ = "doc string for a"
This makes it available at runtime.
> Are you completely opposed to adding attribute documentation
> to Python or is it just the way the implementation works ? I
> find the syntax proposed in the PEP very intuitive and many
> other users on c.l.p and in private emails have supported it
> at the time I wrote the PEP.
It's not the implementation, it's the syntax. It doesn't
convey a clear enough coupling between the variable and the
doc string.
Copyright
This document has been placed in the Public Domain.
References
[1] http://sourceforge.net/patch/?func=detailpatch&patch_id=101264&group_id=5470
pep-0225 Elementwise/Objectwise Operators
| PEP: | 225 |
|---|---|
| Title: | Elementwise/Objectwise Operators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Huaiyu Zhu <hzhu at users.sourceforge.net>, Gregory Lielens <gregory.lielens at fft.be> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 19-Sep-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP describes a proposal to add new operators to Python which
are useful for distinguishing elementwise and objectwise
operations, and summarizes discussions in the news group
comp.lang.python on this topic. See Credits and Archives section
at end. Issues discussed here include:
- Background.
- Description of proposed operators and implementation issues.
- Analysis of alternatives to new operators.
- Analysis of alternative forms.
- Compatibility issues
- Description of wider extensions and other related ideas.
A substantial portion of this PEP describes ideas that do not go
into the proposed extension. They are presented because the
extension is essentially syntactic sugar, so its adoption must be
weighed against various possible alternatives. While many
alternatives may be better in some aspects, the current proposal
appears to be overall advantageous.
The issues concerning elementwise-objectwise operations extends to
wider areas than numerical computation. This document also
describes how the current proposal may be integrated with more
general future extensions.
Background
Python provides six binary infix math operators: + - * / % **
hereafter generically represented by "op". They can be overloaded
with new semantics for user-defined classes. However, for objects
composed of homogeneous elements, such as arrays, vectors and
matrices in numerical computation, there are two essentially
distinct flavors of semantics. The objectwise operations treat
these objects as points in multidimensional spaces. The
elementwise operations treat them as collections of individual
elements. These two flavors of operations are often intermixed in
the same formulas, thereby requiring syntactical distinction.
Many numerical computation languages provide two sets of math
operators. For example, in MatLab, the ordinary op is used for
objectwise operation while .op is used for elementwise operation.
In R, op stands for elementwise operation while %op% stands for
objectwise operation.
In Python, there are other methods of representation, some of
which already used by available numerical packages, such as
- function: mul(a,b)
- method: a.mul(b)
- casting: a.E*b
In several aspects these are not as adequate as infix operators.
More details will be shown later, but the key points are
- Readability: Even for moderately complicated formulas, infix
operators are much cleaner than alternatives.
- Familiarity: Users are familiar with ordinary math operators.
- Implementation: New infix operators will not unduly clutter
Python syntax. They will greatly ease the implementation of
numerical packages.
While it is possible to assign current math operators to one
flavor of semantics, there is simply not enough infix operators to
overload for the other flavor. It is also impossible to maintain
visual symmetry between these two flavors if one of them does not
contain symbols for ordinary math operators.
Proposed extension
- Six new binary infix operators ~+ ~- ~* ~/ ~% ~** are added to
core Python. They parallel the existing operators + - * / % **.
- Six augmented assignment operators ~+= ~-= ~*= ~/= ~%= ~**= are
added to core Python. They parallel the operators += -= *= /=
%= **= available in Python 2.0.
- Operator ~op retains the syntactical properties of operator op,
including precedence.
- Operator ~op retains the semantical properties of operator op on
built-in number types.
- Operator ~op raise syntax error on non-number builtin types.
This is temporary until the proper behavior can be agreed upon.
- These operators are overloadable in classes with names that
prepend "t" (for tilde) to names of ordinary math operators.
For example, __tadd__ and __rtadd__ work for ~+ just as __add__
and __radd__ work for +.
- As with existing operators, the __r*__() methods are invoked when
the left operand does not provide the appropriate method.
It is intended that one set of op or ~op is used for elementwise
operations, the other for objectwise operations, but it is not
specified which version of operators stands for elementwise or
objectwise operations, leaving the decision to applications.
The proposed implementation is to patch several files relating to
the tokenizer, parser, grammar and compiler to duplicate the
functionality of corresponding existing operators as necessary.
All new semantics are to be implemented in the classes that
overload them.
The symbol ~ is already used in Python as the unary "bitwise not"
operator. Currently it is not allowed for binary operators. The
new operators are completely backward compatible.
Prototype Implementation
Greg Lielens implemented the infix ~op as a patch against Python
2.0b1 source[1].
To allow ~ to be part of binary operators, the tokenizer would
treat ~+ as one token. This means that currently valid expression
~+1 would be tokenized as ~+ 1 instead of ~ + 1. The parser would
then treat ~+ as composite of ~ +. The effect is invisible to
applications.
Notes about current patch:
- It does not include ~op= operators yet.
- The ~op behaves the same as op on lists, instead of raising
exceptions.
These should be fixed when the final version of this proposal is
ready.
- It reserves xor as an infix operator with the semantics
equivalent to:
def __xor__(a, b):
if not b: return a
elif not a: return b
else: 0
This preserves true value as much as possible, otherwise preserve
left hand side value if possible.
This is done so that bitwise operators could be regarded as
elementwise logical operators in the future (see below).
Alternatives to adding new operators
The discussions on comp.lang.python and python-dev mailing list
explored many alternatives. Some of the leading alternatives are
listed here, using the multiplication operator as an example.
1. Use function mul(a,b).
Advantage:
- No need for new operators.
Disadvantage:
- Prefix forms are cumbersome for composite formulas.
- Unfamiliar to the intended users.
- Too verbose for the intended users.
- Unable to use natural precedence rules.
2. Use method call a.mul(b)
Advantage:
- No need for new operators.
Disadvantage:
- Asymmetric for both operands.
- Unfamiliar to the intended users.
- Too verbose for the intended users.
- Unable to use natural precedence rules.
3. Use "shadow classes". For matrix class define a shadow array
class accessible through a method .E, so that for matrices a
and b, a.E*b would be a matrix object that is
elementwise_mul(a,b).
Likewise define a shadow matrix class for arrays accessible
through a method .M so that for arrays a and b, a.M*b would be
an array that is matrixwise_mul(a,b).
Advantage:
- No need for new operators.
- Benefits of infix operators with correct precedence rules.
- Clean formulas in applications.
Disadvantage:
- Hard to maintain in current Python because ordinary numbers
cannot have user defined class methods; i.e. a.E*b will fail
if a is a pure number.
- Difficult to implement, as this will interfere with existing
method calls, like .T for transpose, etc.
- Runtime overhead of object creation and method lookup.
- The shadowing class cannot replace a true class, because it
does not return its own type. So there need to be a M class
with shadow E class, and an E class with shadow M class.
- Unnatural to mathematicians.
4. Implement matrixwise and elementwise classes with easy casting
to the other class. So matrixwise operations for arrays would
be like a.M*b.M and elementwise operations for matrices would
be like a.E*b.E. For error detection a.E*b.M would raise
exceptions.
Advantage:
- No need for new operators.
- Similar to infix notation with correct precedence rules.
Disadvantage:
- Similar difficulty due to lack of user-methods for pure numbers.
- Runtime overhead of object creation and method lookup.
- More cluttered formulas
- Switching of flavor of objects to facilitate operators
becomes persistent. This introduces long range context
dependencies in application code that would be extremely hard
to maintain.
5. Using mini parser to parse formulas written in arbitrary
extension placed in quoted strings.
Advantage:
- Pure Python, without new operators
Disadvantage:
- The actual syntax is within the quoted string, which does not
resolve the problem itself.
- Introducing zones of special syntax.
- Demanding on the mini-parser.
6. Introducing a single operator, such as @, for matrix
multiplication.
Advantage:
- Introduces less operators
Disadvantage:
- The distinctions for operators like + - ** are equally
important. Their meaning in matrix or array-oriented
packages would be reversed (see below).
- The new operator occupies a special character.
- This does not work well with more general object-element issues.
Among these alternatives, the first and second are used in current
applications to some extent, but found inadequate. The third is
the most favorite for applications, but it will incur huge
implementation complexity. The fourth would make applications
codes very context-sensitive and hard to maintain. These two
alternatives also share significant implementational difficulties
due to current type/class split. The fifth appears to create more
problems than it would solve. The sixth does not cover the same
range of applications.
Alternative forms of infix operators
Two major forms and several minor variants of new infix operators
were discussed:
- Bracketed form
(op)
[op]
{op}
<op>
:op:
~op~
%op%
- Meta character form
.op
@op
~op
Alternatively the meta character is put after the operator.
- Less consistent variations of these themes. These are
considered unfavorably. For completeness some are listed here
- Use @/ and /@ for left and right division
- Use [*] and (*) for outer and inner products
- Use a single operator @ for multiplication.
- Use __call__ to simulate multiplication.
a(b) or (a)(b)
Criteria for choosing among the representations include:
- No syntactical ambiguities with existing operators.
- Higher readability in actual formulas. This makes the
bracketed forms unfavorable. See examples below.
- Visually similar to existing math operators.
- Syntactically simple, without blocking possible future
extensions.
With these criteria the overall winner in bracket form appear to
be {op}. A clear winner in the meta character form is ~op.
Comparing these it appears that ~op is the favorite among them
all.
Some analysis are as follows:
- The .op form is ambiguous: 1.+a would be different from 1 .+a
- The bracket type operators are most favorable when standing
alone, but not in formulas, as they interfere with visual
parsing of parenthesis for precedence and function argument.
This is so for (op) and [op], and somewhat less so for {op}
and <op>.
- The <op> form has the potential to be confused with < > and =
- The @op is not favored because @ is visually heavy (dense,
more like a letter): a@+b is more readily read as a@ + b
than a @+ b.
- For choosing meta-characters: Most of existing ASCII symbols
have already been used. The only three unused are @ $ ?.
Semantics of new operators
There are convincing arguments for using either set of operators
as objectwise or elementwise. Some of them are listed here:
1. op for element, ~op for object
- Consistent with current multiarray interface of Numeric package
- Consistent with some other languages
- Perception that elementwise operations are more natural
- Perception that elementwise operations are used more frequently
2. op for object, ~op for element
- Consistent with current linear algebra interface of MatPy package
- Consistent with some other languages
- Perception that objectwise operations are more natural
- Perception that objectwise operations are used more frequently
- Consistent with the current behavior of operators on lists
- Allow ~ to be a general elementwise meta-character in future
extensions.
It is generally agreed upon that
- there is no absolute reason to favor one or the other
- it is easy to cast from one representation to another in a
sizable chunk of code, so the other flavor of operators is
always minority
- there are other semantic differences that favor existence of
array-oriented and matrix-oriented packages, even if their
operators are unified.
- whatever the decision is taken, codes using existing
interfaces should not be broken for a very long time.
Therefore not much is lost, and much flexibility retained, if the
semantic flavors of these two sets of operators are not dictated
by the core language. The application packages are responsible
for making the most suitable choice. This is already the case for
NumPy and MatPy which use opposite semantics. Adding new
operators will not break this. See also observation after
subsection 2 in the Examples below.
The issue of numerical precision was raised, but if the semantics
is left to the applications, the actual precisions should also go
there.
Examples
Following are examples of the actual formulas that will appear
using various operators or other representations described above.
1. The matrix inversion formula:
- Using op for object and ~op for element:
b = a.I - a.I * u / (c.I + v/a*u) * v / a
b = a.I - a.I * u * (c.I + v*a.I*u).I * v * a.I
- Using op for element and ~op for object:
b = a.I @- a.I @* u @/ (c.I @+ v@/a@*u) @* v @/ a
b = a.I ~- a.I ~* u ~/ (c.I ~+ v~/a~*u) ~* v ~/ a
b = a.I (-) a.I (*) u (/) (c.I (+) v(/)a(*)u) (*) v (/) a
b = a.I [-] a.I [*] u [/] (c.I [+] v[/]a[*]u) [*] v [/] a
b = a.I <-> a.I <*> u </> (c.I <+> v</>a<*>u) <*> v </> a
b = a.I {-} a.I {*} u {/} (c.I {+} v{/}a{*}u) {*} v {/} a
Observation: For linear algebra using op for object is preferable.
Observation: The ~op type operators look better than (op) type
in complicated formulas.
- using named operators
b = a.I @sub a.I @mul u @div (c.I @add v @div a @mul u) @mul v @div a
b = a.I ~sub a.I ~mul u ~div (c.I ~add v ~div a ~mul u) ~mul v ~div a
Observation: Named operators are not suitable for math formulas.
2. Plotting a 3d graph
- Using op for object and ~op for element:
z = sin(x~**2 ~+ y~**2); plot(x,y,z)
- Using op for element and ~op for object:
z = sin(x**2 + y**2); plot(x,y,z)
Observation: Elementwise operations with broadcasting allows
much more efficient implementation than MatLab.
Observation: It is useful to have two related classes with the
semantics of op and ~op swapped. Using these the ~op
operators would only need to appear in chunks of code where
the other flavor dominates, while maintaining consistent
semantics of the code.
3. Using + and - with automatic broadcasting
a = b - c; d = a.T*a
Observation: This would silently produce hard-to-trace bugs if
one of b or c is row vector while the other is column vector.
Miscellaneous issues:
- Need for the ~+ ~- operators. The objectwise + - are important
because they provide important sanity checks as per linear
algebra. The elementwise + - are important because they allow
broadcasting that are very efficient in applications.
- Left division (solve). For matrix, a*x is not necessarily equal
to x*a. The solution of a*x==b, denoted x=solve(a,b), is
therefore different from the solution of x*a==b, denoted
x=div(b,a). There are discussions about finding a new symbol
for solve. [Background: MatLab use b/a for div(b,a) and a\b for
solve(a,b).]
It is recognized that Python provides a better solution without
requiring a new symbol: the inverse method .I can be made to be
delayed so that a.I*b and b*a.I are equivalent to Mat lab's a\b
and b/a. The implementation is quite simple and the resulting
application code clean.
- Power operator. Python's use of a**b as pow(a,b) has two
perceived disadvantages:
- Most mathematicians are more familiar with a^b for this purpose.
- It results in long augmented assignment operator ~**=.
However, this issue is distinct from the main issue here.
- Additional multiplication operators. Several forms of
multiplications are used in (multi-)linear algebra. Most can be
seen as variations of multiplication in linear algebra sense
(such as Kronecker product). But two forms appear to be more
fundamental: outer product and inner product. However, their
specification includes indices, which can be either
- associated with the operator, or
- associated with the objects.
The latter (the Einstein notation) is used extensively on paper,
and is also the easier one to implement. By implementing a
tensor-with-indices class, a general form of multiplication
would cover both outer and inner products, and specialize to
linear algebra multiplication as well. The index rule can be
defined as class methods, like,
a = b.i(1,2,-1,-2) * c.i(4,-2,3,-1) # a_ijkl = b_ijmn c_lnkm
Therefore one objectwise multiplication is sufficient.
- Bitwise operators.
- The proposed new math operators use the symbol ~ that is
"bitwise not" operator. This poses no compatibility problem
but somewhat complicates implementation.
- The symbol ^ might be better used for pow than bitwise xor.
But this depends on the future of bitwise operators. It does
not immediately impact on the proposed math operator.
- The symbol | was suggested to be used for matrix solve. But
the new solution of using delayed .I is better in several
ways.
- The current proposal fits in a larger and more general
extension that will remove the need for special bitwise
operators. (See elementization below.)
- Alternative to special operator names used in definition,
def "+"(a, b) in place of def __add__(a, b)
This appears to require greater syntactical change, and would
only be useful when arbitrary additional operators are allowed.
Impact on general elementization
The distinction between objectwise and elementwise operations are
meaningful in other contexts as well, where an object can be
conceptually regarded as a collection of elements. It is
important that the current proposal does not preclude possible
future extensions.
One general future extension is to use ~ as a meta operator to
"elementize" a given operator. Several examples are listed here:
1. Bitwise operators. Currently Python assigns six operators to
bitwise operations: and (&), or (|), xor (^), complement (~),
left shift (<<) and right shift (>>), with their own precedence
levels.
Among them, the & | ^ ~ operators can be regarded as
elementwise versions of lattice operators applied to integers
regarded as bit strings.
5 and 6 # 6
5 or 6 # 5
5 ~and 6 # 4
5 ~or 6 # 7
These can be regarded as general elementwise lattice operators,
not restricted to bits in integers.
In order to have named operators for xor ~xor, it is necessary
to make xor a reserved word.
2. List arithmetics.
[1, 2] + [3, 4] # [1, 2, 3, 4]
[1, 2] ~+ [3, 4] # [4, 6]
['a', 'b'] * 2 # ['a', 'b', 'a', 'b']
'ab' * 2 # 'abab'
['a', 'b'] ~* 2 # ['aa', 'bb']
[1, 2] ~* 2 # [2, 4]
It is also consistent to Cartesian product
[1,2]*[3,4] # [(1,3),(1,4),(2,3),(2,4)]
3. List comprehension.
a = [1, 2]; b = [3, 4]
~f(a,b) # [f(x,y) for x, y in zip(a,b)]
~f(a*b) # [f(x,y) for x in a for y in b]
a ~+ b # [x + y for x, y in zip(a,b)]
4. Tuple generation (the zip function in Python 2.0)
[1, 2, 3], [4, 5, 6] # ([1,2, 3], [4, 5, 6])
[1, 2, 3]~,[4, 5, 6] # [(1,4), (2, 5), (3,6)]
5. Using ~ as generic elementwise meta-character to replace map
~f(a, b) # map(f, a, b)
~~f(a, b) # map(lambda *x:map(f, *x), a, b)
More generally,
def ~f(*x): return map(f, *x)
def ~~f(*x): return map(~f, *x)
...
6. Elementwise format operator (with broadcasting)
a = [1,2,3,4,5]
print ["%5d "] ~% a
a = [[1,2],[3,4]]
print ["%5d "] ~~% a
7. Rich comparison
[1, 2, 3] ~< [3, 2, 1] # [1, 0, 0]
[1, 2, 3] ~== [3, 2, 1] # [0, 1, 0]
8. Rich indexing
[a, b, c, d] ~[2, 3, 1] # [c, d, b]
9. Tuple flattening
a = (1,2); b = (3,4)
f(~a, ~b) # f(1,2,3,4)
10. Copy operator
a ~= b # a = b.copy()
There can be specific levels of deep copy
a ~~= b # a = b.copy(2)
Notes:
1. There are probably many other similar situations. This general
approach seems well suited for most of them, in place of
several separated extensions for each of them (parallel and
cross iteration, list comprehension, rich comparison, etc).
2. The semantics of "elementwise" depends on applications. For
example, an element of matrix is two levels down from the
list-of-list point of view. This requires more fundamental
change than the current proposal. In any case, the current
proposal will not negatively impact on future possibilities of
this nature.
Note that this section describes a type of future extensions that
is consistent with current proposal, but may present additional
compatibility or other problems. They are not tied to the current
proposal.
Impact on named operators
The discussions made it generally clear that infix operators is a
scarce resource in Python, not only in numerical computation, but
in other fields as well. Several proposals and ideas were put
forward that would allow infix operators be introduced in ways
similar to named functions. We show here that the current
extension does not negatively impact on future extensions in this
regard.
1. Named infix operators.
Choose a meta character, say @, so that for any identifier
"opname", the combination "@opname" would be a binary infix
operator, and
a @opname b == opname(a,b)
Other representations mentioned include .name ~name~ :name:
(.name) %name% and similar variations. The pure bracket based
operators cannot be used this way.
This requires a change in the parser to recognize @opname, and
parse it into the same structure as a function call. The
precedence of all these operators would have to be fixed at
one level, so the implementation would be different from
additional math operators which keep the precedence of
existing math operators.
The current proposed extension do not limit possible future
extensions of such form in any way.
2. More general symbolic operators.
One additional form of future extension is to use meta
character and operator symbols (symbols that cannot be used in
syntactical structures other than operators). Suppose @ is
the meta character. Then
a + b, a @+ b, a @@+ b, a @+- b
would all be operators with a hierarchy of precedence, defined by
def "+"(a, b)
def "@+"(a, b)
def "@@+"(a, b)
def "@+-"(a, b)
One advantage compared with named operators is greater
flexibility for precedences based on either the meta character
or the ordinary operator symbols. This also allows operator
composition. The disadvantage is that they are more like
"line noise". In any case the current proposal does not
impact its future possibility.
These kinds of future extensions may not be necessary when
Unicode becomes generally available.
Note that this section discusses compatibility of the proposed
extension with possible future extensions. The desirability
or compatibility of these other extensions themselves are
specifically not considered here.
Credits and archives
The discussions mostly happened in July to August of 2000 on news
group comp.lang.python and the mailing list python-dev. There are
altogether several hundred postings, most can be retrieved from
these two pages (and searching word "operator"):
http://www.python.org/pipermail/python-list/2000-July/
http://www.python.org/pipermail/python-list/2000-August/
The names of contributers are too numerous to mention here,
suffice to say that a large proportion of ideas discussed here are
not our own.
Several key postings (from our point of view) that may help to
navigate the discussions include:
http://www.python.org/pipermail/python-list/2000-July/108893.html
http://www.python.org/pipermail/python-list/2000-July/108777.html
http://www.python.org/pipermail/python-list/2000-July/108848.html
http://www.python.org/pipermail/python-list/2000-July/109237.html
http://www.python.org/pipermail/python-list/2000-July/109250.html
http://www.python.org/pipermail/python-list/2000-July/109310.html
http://www.python.org/pipermail/python-list/2000-July/109448.html
http://www.python.org/pipermail/python-list/2000-July/109491.html
http://www.python.org/pipermail/python-list/2000-July/109537.html
http://www.python.org/pipermail/python-list/2000-July/109607.html
http://www.python.org/pipermail/python-list/2000-July/109709.html
http://www.python.org/pipermail/python-list/2000-July/109804.html
http://www.python.org/pipermail/python-list/2000-July/109857.html
http://www.python.org/pipermail/python-list/2000-July/110061.html
http://www.python.org/pipermail/python-list/2000-July/110208.html
http://www.python.org/pipermail/python-list/2000-August/111427.html
http://www.python.org/pipermail/python-list/2000-August/111558.html
http://www.python.org/pipermail/python-list/2000-August/112551.html
http://www.python.org/pipermail/python-list/2000-August/112606.html
http://www.python.org/pipermail/python-list/2000-August/112758.html
http://www.python.org/pipermail/python-dev/2000-July/013243.html
http://www.python.org/pipermail/python-dev/2000-July/013364.html
http://www.python.org/pipermail/python-dev/2000-August/014940.html
These are earlier drafts of this PEP:
http://www.python.org/pipermail/python-list/2000-August/111785.html
http://www.python.org/pipermail/python-list/2000-August/112529.html
http://www.python.org/pipermail/python-dev/2000-August/014906.html
There is an alternative PEP (officially, PEP 211) by Greg Wilson,
titled "Adding New Linear Algebra Operators to Python".
Its first (and current) version is at:
http://www.python.org/pipermail/python-dev/2000-August/014876.html
http://www.python.org/dev/peps/pep-0211/
Additional References
[1] http://MatPy.sourceforge.net/Misc/index.html
pep-0226 Python 2.1 Release Schedule
| PEP: | 226 |
|---|---|
| Title: | Python 2.1 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Final |
| Type: | Informational |
| Created: | 16-Oct-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
This document describes the post Python 2.0 development and
release schedule. According to this schedule, Python 2.1 will be
released in April of 2001. The schedule primarily concerns
itself with PEP-size items. Small bug fixes and changes will
occur up until the first beta release.
Release Schedule
Tentative future release dates
[bugfix release dates go here]
Past release dates:
17-Apr-2001: 2.1 final release
15-Apr-2001: 2.1 release candidate 2
13-Apr-2001: 2.1 release candidate 1
23-Mar-2001: Python 2.1 beta 2 release
02-Mar-2001: First 2.1 beta release
02-Feb-2001: Python 2.1 alpha 2 release
22-Jan-2001: Python 2.1 alpha 1 release
16-Oct-2000: Python 2.0 final release
Open issues for Python 2.0 beta 2
Add a default unit testing framework to the standard library.
Guidelines for making changes for Python 2.1
The guidelines and schedule will be revised based on discussion in
the python-dev@python.org mailing list.
The PEP system was instituted late in the Python 2.0 development
cycle and many changes did not follow the process described in PEP
1. The development process for 2.1, however, will follow the PEP
process as documented.
The first eight weeks following 2.0 final will be the design and
review phase. By the end of this period, any PEP that is proposed
for 2.1 should be ready for review. This means that the PEP is
written and discussion has occurred on the python-dev@python.org
and python-list@python.org mailing lists.
The next six weeks will be spent reviewing the PEPs and
implementing and testing the accepted PEPs. When this period
stops, we will end consideration of any incomplete PEPs. Near the
end of this period, there will be a feature freeze where any small
features not worthy of a PEP will not be accepted.
Before the final release, we will have six weeks of beta testing
and a release candidate or two.
General guidelines for submitting patches and making changes
Use good sense when committing changes. You should know what we
mean by good sense or we wouldn't have given you commit privileges
<0.5 wink>. Some specific examples of good sense include:
- Do whatever the dictator tells you.
- Discuss any controversial changes on python-dev first. If you
get a lot of +1 votes and no -1 votes, make the change. If you
get a some -1 votes, think twice; consider asking Guido what he
thinks.
- If the change is to code you contributed, it probably makes
sense for you to fix it.
- If the change affects code someone else wrote, it probably makes
sense to ask him or her first.
- You can use the SourceForge (SF) Patch Manager to submit a patch
and assign it to someone for review.
Any significant new feature must be described in a PEP and
approved before it is checked in.
Any significant code addition, such as a new module or large
patch, must include test cases for the regression test and
documentation. A patch should not be checked in until the tests
and documentation are ready.
If you fix a bug, you should write a test case that would have
caught the bug.
If you commit a patch from the SF Patch Manager or fix a bug from
the Jitterbug database, be sure to reference the patch/bug number
in the CVS log message. Also be sure to change the status in the
patch manager or bug database (if you have access to the bug
database).
It is not acceptable for any checked in code to cause the
regression test to fail. If a checkin causes a failure, it must
be fixed within 24 hours or it will be backed out.
All contributed C code must be ANSI C. If possible check it with
two different compilers, e.g. gcc and MSVC.
All contributed Python code must follow Guido's Python style
guide. http://www.python.org/doc/essays/styleguide.html
It is understood that any code contributed will be released under
an Open Source license. Do not contribute code if it can't be
released this way.
pep-0227 Statically Nested Scopes
| PEP: | 227 |
|---|---|
| Title: | Statically Nested Scopes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 01-Nov-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
This PEP describes the addition of statically nested scoping
(lexical scoping) for Python 2.2, and as an source level option
for python 2.1. In addition, Python 2.1 will issue warnings about
constructs whose meaning may change when this feature is enabled.
The old language definition (2.0 and before) defines exactly three
namespaces that are used to resolve names -- the local, global,
and built-in namespaces. The addition of nested scopes allows
resolution of unbound local names in enclosing functions'
namespaces.
The most visible consequence of this change is that lambdas (and
other nested functions) can reference variables defined in the
surrounding namespace. Currently, lambdas must often use default
arguments to explicitly creating bindings in the lambda's
namespace.
Introduction
This proposal changes the rules for resolving free variables in
Python functions. The new name resolution semantics will take
effect with Python 2.2. These semantics will also be available in
Python 2.1 by adding "from __future__ import nested_scopes" to the
top of a module. (See PEP 236.)
The Python 2.0 definition specifies exactly three namespaces to
check for each name -- the local namespace, the global namespace,
and the builtin namespace. According to this definition, if a
function A is defined within a function B, the names bound in B
are not visible in A. The proposal changes the rules so that
names bound in B are visible in A (unless A contains a name
binding that hides the binding in B).
This specification introduces rules for lexical scoping that are
common in Algol-like languages. The combination of lexical
scoping and existing support for first-class functions is
reminiscent of Scheme.
The changed scoping rules address two problems -- the limited
utility of lambda expressions (and nested functions in general),
and the frequent confusion of new users familiar with other
languages that support nested lexical scopes, e.g. the inability
to define recursive functions except at the module level.
The lambda expression yields an unnamed function that evaluates a
single expression. It is often used for callback functions. In
the example below (written using the Python 2.0 rules), any name
used in the body of the lambda must be explicitly passed as a
default argument to the lambda.
from Tkinter import *
root = Tk()
Button(root, text="Click here",
command=lambda root=root: root.test.configure(text="..."))
This approach is cumbersome, particularly when there are several
names used in the body of the lambda. The long list of default
arguments obscures the purpose of the code. The proposed
solution, in crude terms, implements the default argument approach
automatically. The "root=root" argument can be omitted.
The new name resolution semantics will cause some programs to
behave differently than they did under Python 2.0. In some cases,
programs will fail to compile. In other cases, names that were
previously resolved using the global namespace will be resolved
using the local namespace of an enclosing function. In Python
2.1, warnings will be issued for all statements that will behave
differently.
Specification
Python is a statically scoped language with block structure, in
the traditional of Algol. A code block or region, such as a
module, class definition, or function body, is the basic unit of a
program.
Names refer to objects. Names are introduced by name binding
operations. Each occurrence of a name in the program text refers
to the binding of that name established in the innermost function
block containing the use.
The name binding operations are argument declaration, assignment,
class and function definition, import statements, for statements,
and except clauses. Each name binding occurs within a block
defined by a class or function definition or at the module level
(the top-level code block).
If a name is bound anywhere within a code block, all uses of the
name within the block are treated as references to the current
block. (Note: This can lead to errors when a name is used within
a block before it is bound.)
If the global statement occurs within a block, all uses of the
name specified in the statement refer to the binding of that name
in the top-level namespace. Names are resolved in the top-level
namespace by searching the global namespace, i.e. the namespace of
the module containing the code block, and in the builtin
namespace, i.e. the namespace of the __builtin__ module. The
global namespace is searched first. If the name is not found
there, the builtin namespace is searched. The global statement
must precede all uses of the name.
If a name is used within a code block, but it is not bound there
and is not declared global, the use is treated as a reference to
the nearest enclosing function region. (Note: If a region is
contained within a class definition, the name bindings that occur
in the class block are not visible to enclosed functions.)
A class definition is an executable statement that may contain
uses and definitions of names. These references follow the normal
rules for name resolution. The namespace of the class definition
becomes the attribute dictionary of the class.
The following operations are name binding operations. If they
occur within a block, they introduce new local names in the
current block unless there is also a global declaration.
Function definition: def name ...
Argument declaration: def f(...name...), lambda ...name...
Class definition: class name ...
Assignment statement: name = ...
Import statement: import name, import module as name,
from module import name
Implicit assignment: names are bound by for statements and except
clauses
There are several cases where Python statements are illegal when
used in conjunction with nested scopes that contain free
variables.
If a variable is referenced in an enclosed scope, it is an error
to delete the name. The compiler will raise a SyntaxError for
'del name'.
If the wild card form of import (import *) is used in a function
and the function contains a nested block with free variables, the
compiler will raise a SyntaxError.
If exec is used in a function and the function contains a nested
block with free variables, the compiler will raise a SyntaxError
unless the exec explicitly specifies the local namespace for the
exec. (In other words, "exec obj" would be illegal, but
"exec obj in ns" would be legal.)
If a name bound in a function scope is also the name of a module
global name or a standard builtin name, and the function contains
a nested function scope that references the name, the compiler
will issue a warning. The name resolution rules will result in
different bindings under Python 2.0 than under Python 2.2. The
warning indicates that the program may not run correctly with all
versions of Python.
Discussion
The specified rules allow names defined in a function to be
referenced in any nested function defined with that function. The
name resolution rules are typical for statically scoped languages,
with three primary exceptions:
- Names in class scope are not accessible.
- The global statement short-circuits the normal rules.
- Variables are not declared.
Names in class scope are not accessible. Names are resolved in
the innermost enclosing function scope. If a class definition
occurs in a chain of nested scopes, the resolution process skips
class definitions. This rule prevents odd interactions between
class attributes and local variable access. If a name binding
operation occurs in a class definition, it creates an attribute on
the resulting class object. To access this variable in a method,
or in a function nested within a method, an attribute reference
must be used, either via self or via the class name.
An alternative would have been to allow name binding in class
scope to behave exactly like name binding in function scope. This
rule would allow class attributes to be referenced either via
attribute reference or simple name. This option was ruled out
because it would have been inconsistent with all other forms of
class and instance attribute access, which always use attribute
references. Code that used simple names would have been obscure.
The global statement short-circuits the normal rules. Under the
proposal, the global statement has exactly the same effect that it
does for Python 2.0. It is also noteworthy because it allows name
binding operations performed in one block to change bindings in
another block (the module).
Variables are not declared. If a name binding operation occurs
anywhere in a function, then that name is treated as local to the
function and all references refer to the local binding. If a
reference occurs before the name is bound, a NameError is raised.
The only kind of declaration is the global statement, which allows
programs to be written using mutable global variables. As a
consequence, it is not possible to rebind a name defined in an
enclosing scope. An assignment operation can only bind a name in
the current scope or in the global scope. The lack of
declarations and the inability to rebind names in enclosing scopes
are unusual for lexically scoped languages; there is typically a
mechanism to create name bindings (e.g. lambda and let in Scheme)
and a mechanism to change the bindings (set! in Scheme).
XXX Alex Martelli suggests comparison with Java, which does not
allow name bindings to hide earlier bindings.
Examples
A few examples are included to illustrate the way the rules work.
XXX Explain the examples
>>> def make_adder(base):
... def adder(x):
... return base + x
... return adder
>>> add5 = make_adder(5)
>>> add5(6)
11
>>> def make_fact():
... def fact(n):
... if n == 1:
... return 1L
... else:
... return n * fact(n - 1)
... return fact
>>> fact = make_fact()
>>> fact(7)
5040L
>>> def make_wrapper(obj):
... class Wrapper:
... def __getattr__(self, attr):
... if attr[0] != '_':
... return getattr(obj, attr)
... else:
... raise AttributeError, attr
... return Wrapper()
>>> class Test:
... public = 2
... _private = 3
>>> w = make_wrapper(Test())
>>> w.public
2
>>> w._private
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: _private
An example from Tim Peters demonstrates the potential pitfalls of
nested scopes in the absence of declarations:
i = 6
def f(x):
def g():
print i
# ...
# skip to the next page
# ...
for i in x: # ah, i *is* local to f, so this is what g sees
pass
g()
The call to g() will refer to the variable i bound in f() by the for
loop. If g() is called before the loop is executed, a NameError will
be raised.
XXX need some counterexamples
Backwards compatibility
There are two kinds of compatibility problems caused by nested
scopes. In one case, code that behaved one way in earlier
versions behaves differently because of nested scopes. In the
other cases, certain constructs interact badly with nested scopes
and will trigger SyntaxErrors at compile time.
The following example from Skip Montanaro illustrates the first
kind of problem:
x = 1
def f1():
x = 2
def inner():
print x
inner()
Under the Python 2.0 rules, the print statement inside inner()
refers to the global variable x and will print 1 if f1() is
called. Under the new rules, it refers to the f1()'s namespace,
the nearest enclosing scope with a binding.
The problem occurs only when a global variable and a local
variable share the same name and a nested function uses that name
to refer to the global variable. This is poor programming
practice, because readers will easily confuse the two different
variables. One example of this problem was found in the Python
standard library during the implementation of nested scopes.
To address this problem, which is unlikely to occur often, the
Python 2.1 compiler (when nested scopes are not enabled) issues a
warning.
The other compatibility problem is caused by the use of 'import *'
and 'exec' in a function body, when that function contains a
nested scope and the contained scope has free variables. For
example:
y = 1
def f():
exec "y = 'gotcha'" # or from module import *
def g():
return y
...
At compile-time, the compiler cannot tell whether an exec that
operates on the local namespace or an import * will introduce
name bindings that shadow the global y. Thus, it is not possible
to tell whether the reference to y in g() should refer to the
global or to a local name in f().
In discussion of the python-list, people argued for both possible
interpretations. On the one hand, some thought that the reference
in g() should be bound to a local y if one exists. One problem
with this interpretation is that it is impossible for a human
reader of the code to determine the binding of y by local
inspection. It seems likely to introduce subtle bugs. The other
interpretation is to treat exec and import * as dynamic features
that do not effect static scoping. Under this interpretation, the
exec and import * would introduce local names, but those names
would never be visible to nested scopes. In the specific example
above, the code would behave exactly as it did in earlier versions
of Python.
Since each interpretation is problematic and the exact meaning
ambiguous, the compiler raises an exception. The Python 2.1
compiler issues a warning when nested scopes are not enabled.
A brief review of three Python projects (the standard library,
Zope, and a beta version of PyXPCOM) found four backwards
compatibility issues in approximately 200,000 lines of code.
There was one example of case #1 (subtle behavior change) and two
examples of import * problems in the standard library.
(The interpretation of the import * and exec restriction that was
implemented in Python 2.1a2 was much more restrictive, based on
language that in the reference manual that had never been
enforced. These restrictions were relaxed following the release.)
Compatibility of C API
The implementation causes several Python C API functions to
change, including PyCode_New(). As a result, C extensions may
need to be updated to work correctly with Python 2.1.
locals() / vars()
These functions return a dictionary containing the current scope's
local variables. Modifications to the dictionary do not affect
the values of variables. Under the current rules, the use of
locals() and globals() allows the program to gain access to all
the namespaces in which names are resolved.
An analogous function will not be provided for nested scopes.
Under this proposal, it will not be possible to gain
dictionary-style access to all visible scopes.
Warnings and Errors
The compiler will issue warnings in Python 2.1 to help identify
programs that may not compile or run correctly under future
versions of Python. Under Python 2.2 or Python 2.1 if the
nested_scopes future statement is used, which are collectively
referred to as "future semantics" in this section, the compiler
will issue SyntaxErrors in some cases.
The warnings typically apply when a function that contains a
nested function that has free variables. For example, if function
F contains a function G and G uses the builtin len(), then F is a
function that contains a nested function (G) with a free variable
(len). The label "free-in-nested" will be used to describe these
functions.
import * used in function scope
The language reference specifies that import * may only occur
in a module scope. (Sec. 6.11) The implementation of C
Python has supported import * at the function scope.
If import * is used in the body of a free-in-nested function,
the compiler will issue a warning. Under future semantics,
the compiler will raise a SyntaxError.
bare exec in function scope
The exec statement allows two optional expressions following
the keyword "in" that specify the namespaces used for locals
and globals. An exec statement that omits both of these
namespaces is a bare exec.
If a bare exec is used in the body of a free-in-nested
function, the compiler will issue a warning. Under future
semantics, the compiler will raise a SyntaxError.
local shadows global
If a free-in-nested function has a binding for a local
variable that (1) is used in a nested function and (2) is the
same as a global variable, the compiler will issue a warning.
Rebinding names in enclosing scopes
There are technical issues that make it difficult to support
rebinding of names in enclosing scopes, but the primary reason
that it is not allowed in the current proposal is that Guido is
opposed to it. His motivation: it is difficult to support,
because it would require a new mechanism that would allow the
programmer to specify that an assignment in a block is supposed to
rebind the name in an enclosing block; presumably a keyword or
special syntax (x := 3) would make this possible. Given that this
would encourage the use of local variables to hold state that is
better stored in a class instance, it's not worth adding new
syntax to make this possible (in Guido's opinion).
The proposed rules allow programmers to achieve the effect of
rebinding, albeit awkwardly. The name that will be effectively
rebound by enclosed functions is bound to a container object. In
place of assignment, the program uses modification of the
container to achieve the desired effect:
def bank_account(initial_balance):
balance = [initial_balance]
def deposit(amount):
balance[0] = balance[0] + amount
return balance
def withdraw(amount):
balance[0] = balance[0] - amount
return balance
return deposit, withdraw
Support for rebinding in nested scopes would make this code
clearer. A class that defines deposit() and withdraw() methods
and the balance as an instance variable would be clearer still.
Since classes seem to achieve the same effect in a more
straightforward manner, they are preferred.
Implementation
XXX Jeremy, is this still the case?
The implementation for C Python uses flat closures [1]. Each def
or lambda expression that is executed will create a closure if the
body of the function or any contained function has free
variables. Using flat closures, the creation of closures is
somewhat expensive but lookup is cheap.
The implementation adds several new opcodes and two new kinds of
names in code objects. A variable can be either a cell variable
or a free variable for a particular code object. A cell variable
is referenced by containing scopes; as a result, the function
where it is defined must allocate separate storage for it on each
invocation. A free variable is referenced via a function's
closure.
The choice of free closures was made based on three factors.
First, nested functions are presumed to be used infrequently,
deeply nested (several levels of nesting) still less frequently.
Second, lookup of names in a nested scope should be fast.
Third, the use of nested scopes, particularly where a function
that access an enclosing scope is returned, should not prevent
unreferenced objects from being reclaimed by the garbage
collector.
XXX Much more to say here
References
[1] Luca Cardelli. Compiling a functional language. In Proc. of
the 1984 ACM Conference on Lisp and Functional Programming,
pp. 208-217, Aug. 1984
http://citeseer.ist.psu.edu/cardelli84compiling.html
Copyright
XXX
pep-0228 Reworking Python's Numeric Model
| PEP: | 228 |
|---|---|
| Title: | Reworking Python's Numeric Model |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Moshe Zadka <moshez at zadka.site.co.il>, Guido van Rossum <guido at python.org> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 4-Nov-2000 |
| Python-Version: | ?? |
| Post-History: |
Withdrawal
This PEP has been withdrawn in favor of PEP 3141.
Abstract
Today, Python's numerical model is similar to the C numeric model:
there are several unrelated numerical types, and when operations
between numerical types are requested, coercions happen. While
the C rationale for the numerical model is that it is very similar
to what happens at the hardware level, that rationale does not
apply to Python. So, while it is acceptable to C programmers that
2/3 == 0, it is surprising to many Python programmers.
NOTE: in the light of recent discussions in the newsgroup, the
motivation in this PEP (and details) need to be extended.
Rationale
In usability studies, one of the least usable aspect of Python was
the fact that integer division returns the floor of the division.
This makes it hard to program correctly, requiring casts to
float() in various parts through the code. Python's numerical
model stems from C, while an model that might be easier to work with
can be based on the mathematical understanding of numbers.
Other Numerical Models
Perl's numerical model is that there is one type of numbers --
floating point numbers. While it is consistent and superficially
non-surprising, it tends to have subtle gotchas. One of these is
that printing numbers is very tricky, and requires correct
rounding. In Perl, there is also a mode where all numbers are
integers. This mode also has its share of problems, which arise
from the fact that there is not even an approximate way of
dividing numbers and getting meaningful answers.
Suggested Interface For Python's Numerical Model
While coercion rules will remain for add-on types and classes, the
built in type system will have exactly one Python type -- a
number. There are several things which can be considered "number
methods":
1. isnatural()
2. isintegral()
3. isrational()
4. isreal()
5. iscomplex()
a. isexact()
Obviously, a number which answers true to a question from 1 to 5, will
also answer true to any following question. If "isexact()" is not true,
then any answer might be wrong.
(But not horribly wrong: it's close to the truth.)
Now, there is two thing the models promises for the field operations
(+, -, /, *):
- If both operands satisfy isexact(), the result satisfies
isexact().
- All field rules are true, except that for not-isexact() numbers,
they might be only approximately true.
One consequence of these two rules is that all exact calcutions
are done as (complex) rationals: since the field laws must hold,
then
(a/b)*b == a
must hold.
There is built-in function, inexact() which takes a number
and returns an inexact number which is a good approximation.
Inexact numbers must be as least as accurate as if they were
using IEEE-754.
Several of the classical Python functions will return exact numbers
even when given inexact numbers: e.g, int().
Coercion
The number type does not define nb_coerce
Any numeric operation slot, when receiving something other then PyNumber,
refuses to implement it.
Inexact Operations
The functions in the "math" module will be allowed to return
inexact results for exact values. However, they will never return
a non-real number. The functions in the "cmath" module are also
allowed to return an inexact result for an exact argument, and are
furthermore allowed to return a complex result for a real
argument.
Numerical Python Issues
People who use Numerical Python do so for high-performance vector
operations. Therefore, NumPy should keep its hardware based
numeric model.
Unresolved Issues
Which number literals will be exact, and which inexact?
How do we deal with IEEE 754 operations? (probably, isnan/isinf should
be methods)
On 64-bit machines, comparisons between ints and floats may be
broken when the comparison involves conversion to float. Ditto
for comparisons between longs and floats. This can be dealt with
by avoiding the conversion to float. (Due to Andrew Koenig.)
Copyright
This document has been placed in the public domain.
pep-0229 Using Distutils to Build Python
| PEP: | 229 |
|---|---|
| Title: | Using Distutils to Build Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 16-Nov-2000 |
| Post-History: |
Introduction
The Modules/Setup mechanism has some flaws:
* People have to remember to uncomment bits of Modules/Setup in
order to get all the possible modules.
* Moving Setup to a new version of Python is tedious; new modules
have been added, so you can't just copy the older version, but
have to reconcile the two versions.
* Users have to figure out where the needed libraries, such as
zlib, are installed.
Proposal
Use the Distutils to build the modules that come with Python.
The changes can be broken up into several pieces:
1. The Distutils needs some Python modules to be able to build
modules. Currently I believe the minimal list is posix, _sre,
and string.
These modules will have to be built before the Distutils can be
used, so they'll simply be hardwired into Modules/Makefile and
be automatically built.
2. A top-level setup.py script will be written that checks the
libraries installed on the system and compiles as many modules
as possible.
3. Modules/Setup will be kept and settings in it will override
setup.py's usual behavior, so you can disable a module known
to be buggy, or specify particular compilation or linker flags.
However, in the common case where setup.py works correctly,
everything in Setup will remain commented out. The other
Setup.* become unnecessary, since nothing will be generating
Setup automatically.
The patch was checked in for Python 2.1, and has been subsequently
modified.
Implementation
Patch #102588 on SourceForge contains the proposed patch.
Currently the patch tries to be conservative and to change as few
files as possible, in order to simplify backing out the patch.
For example, no attempt is made to rip out the existing build
mechanisms. Such simplifications can wait for later in the beta
cycle, when we're certain the patch will be left in, or they can
wait for Python 2.2.
The patch makes the following changes:
* Makes some required changes to distutils/sysconfig (these will
be checked in separately)
* In the top-level Makefile.in, the "sharedmods" target simply
runs "./python setup.py build", and "sharedinstall" runs
"./python setup.py install". The "clobber" target also deletes
the build/ subdirectory where Distutils puts its output.
* Modules/Setup.config.in only contains entries for the gc and thread
modules; the readline, curses, and db modules are removed because
it's now setup.py's job to handle them.
* Modules/Setup.dist now contains entries for only 3 modules --
_sre, posix, and strop.
* The configure script builds setup.cfg from setup.cfg.in. This
is needed for two reasons: to make building in subdirectories
work, and to get the configured installation prefix.
* Adds setup.py to the top directory of the source tree. setup.py
is the largest piece of the puzzle, though not the most
complicated. setup.py contains a subclass of the BuildExt
class, and extends it with a detect_modules() method that does
the work of figuring out when modules can be compiled, and adding
them to the 'exts' list.
Unresolved Issues
Do we need to make it possible to disable the 3 hard-wired modules
without manually hacking the Makefiles? [Answer: No.]
The Distutils always compile modules as shared libraries. How do
we support compiling them statically into the resulting Python
binary?
[Answer: building a Python binary with the Distutils should be
feasible, though no one has implemented it yet. This should be
done someday, but isn't a pressing priority as messing around with
the top-level Makefile.pre.in is good enough.]
Copyright
This document has been placed in the public domain.
pep-0230 Warning Framework
| PEP: | 230 |
|---|---|
| Title: | Warning Framework |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | |
| Python-Version: | 2.1 |
| Post-History: | 05-Nov-2000 |
Abstract
This PEP proposes a C and Python level API, as well as command
line flags, to issue warning messages and control what happens to
them. This is mostly based on GvR's proposal posted to python-dev
on 05-Nov-2000, with some ideas (such as using classes to
categorize warnings) merged in from Paul Prescod's
counter-proposal posted on the same date. Also, an attempt to
implement the proposal caused several small tweaks.
Motivation
With Python 3000 looming, it is necessary to start issuing
warnings about the use of obsolete or deprecated features, in
addition to errors. There are also lots of other reasons to be
able to issue warnings, both from C and from Python code, both at
compile time and at run time.
Warnings aren't fatal, and thus it's possible that a program
triggers the same warning many times during a single execution.
It would be annoying if a program emitted an endless stream of
identical warnings. Therefore, a mechanism is needed that
suppresses multiple identical warnings.
It is also desirable to have user control over which warnings are
printed. While in general it is useful to see all warnings all
the time, there may be times where it is impractical to fix the
code right away in a production program. In this case, there
should be a way to suppress warnings.
It is also useful to be able to suppress specific warnings during
program development, e.g. when a warning is generated by a piece
of 3rd party code that cannot be fixed right away, or when there
is no way to fix the code (possibly a warning message is generated
for a perfectly fine piece of code). It would be unwise to offer
to suppress all warnings in such cases: the developer would miss
warnings about the rest of the code.
On the other hand, there are also situations conceivable where
some or all warnings are better treated as errors. For example,
it may be a local coding standard that a particular deprecated
feature should not be used. In order to enforce this, it is
useful to be able to turn the warning about this particular
feature into an error, raising an exception (without necessarily
turning all warnings into errors).
Therefore, I propose to introduce a flexible "warning filter"
which can filter out warnings or change them into exceptions,
based on:
- Where in the code they are generated (per package, module, or
function)
- The warning category (warning categories are discussed below)
- A specific warning message
The warning filter must be controllable both from the command line
and from Python code.
APIs For Issuing Warnings
- To issue a warning from Python:
import warnings
warnings.warn(message[, category[, stacklevel]])
The category argument, if given, must be a warning category
class (see below); it defaults to warnings.UserWarning. This
may raise an exception if the particular warning issued is
changed into an error by the warnings filter. The stacklevel
can be used by wrapper functions written in Python, like this:
def deprecation(message):
warn(message, DeprecationWarning, level=2)
This makes the warning refer to the deprecation()'s caller,
rather than to the source of deprecation() itself (since the
latter would defeat the purpose of the warning message).
- To issue a warning from C:
int PyErr_Warn(PyObject *category, char *message);
Return 0 normally, 1 if an exception is raised (either because
the warning was transformed into an exception, or because of a
malfunction in the implementation, such as running out of
memory). The category argument must be a warning category class
(see below) or NULL, in which case it defaults to
PyExc_RuntimeWarning. When PyErr_Warn() function returns 1, the
caller should do normal exception handling.
The current C implementation of PyErr_Warn() imports the
warnings module (implemented in Python) and calls its warn()
function. This minimizes the amount of C code that needs to be
added to implement the warning feature.
[XXX Open Issue: what about issuing warnings during lexing or
parsing, which don't have the exception machinery available?]
Warnings Categories
There are a number of built-in exceptions that represent warning
categories. This categorization is useful to be able to filter
out groups of warnings. The following warnings category classes
are currently defined:
- Warning -- this is the base class of all warning category
classes and it itself a subclass of Exception
- UserWarning -- the default category for warnings.warn()
- DeprecationWarning -- base category for warnings about deprecated
features
- SyntaxWarning -- base category for warnings about dubious
syntactic features
- RuntimeWarning -- base category for warnings about dubious
runtime features
[XXX: Other warning categories may be proposed during the review
period for this PEP.]
These standard warning categories are available from C as
PyExc_Warning, PyExc_UserWarning, etc. From Python, they are
available in the __builtin__ module, so no import is necessary.
User code can define additional warning categories by subclassing
one of the standard warning categories. A warning category must
always be a subclass of the Warning class.
The Warnings Filter
The warnings filter control whether warnings are ignored,
displayed, or turned into errors (raising an exception).
There are three sides to the warnings filter:
- The data structures used to efficiently determine the
disposition of a particular warnings.warn() or PyErr_Warn()
call.
- The API to control the filter from Python source code.
- The command line switches to control the filter.
The warnings filter works in several stages. It is optimized for
the (expected to be common) case where the same warning is issued
from the same place in the code over and over.
First, the warning filter collects the module and line number
where the warning is issued; this information is readily available
through sys._getframe().
Conceptually, the warnings filter maintains an ordered list of
filter specifications; any specific warning is matched against
each filter specification in the list in turn until a match is
found; the match determines the disposition of the match. Each
entry is a tuple as follows:
(category, message, module, lineno, action)
- category is a class (a subclass of warnings.Warning) of which
the warning category must be a subclass in order to match
- message is a compiled regular expression that the warning
message must match (the match is case-insensitive)
- module is a compiled regular expression that the module name
must match
- lineno is an integer that the line number where the warning
occurred must match, or 0 to match all line numbers
- action is one of the following strings:
- "error" -- turn matching warnings into exceptions
- "ignore" -- never print matching warnings
- "always" -- always print matching warnings
- "default" -- print the first occurrence of matching warnings
for each location where the warning is issued
- "module" -- print the first occurrence of matching warnings
for each module where the warning is issued
- "once" -- print only the first occurrence of matching
warnings
Since the Warning class is derived from the built-in Exception
class, to turn a warning into an error we simply raise
category(message).
Warnings Output And Formatting Hooks
When the warnings filter decides to issue a warning (but not when
it decides to raise an exception), it passes the information about
the function warnings.showwarning(message, category, filename, lineno).
The default implementation of this function writes the warning text
to sys.stderr, and shows the source line of the filename. It has
an optional 5th argument which can be used to specify a different
file than sys.stderr.
The formatting of warnings is done by a separate function,
warnings.formatwarning(message, category, filename, lineno). This
returns a string (that may contain newlines and ends in a newline)
that can be printed to get the identical effect of the
showwarning() function.
API For Manipulating Warning Filters
warnings.filterwarnings(message, category, module, lineno, action)
This checks the types of the arguments, compiles the message and
module regular expressions, and inserts them as a tuple in front
of the warnings filter.
warnings.resetwarnings()
Reset the warnings filter to empty.
Command Line Syntax
There should be command line options to specify the most common
filtering actions, which I expect to include at least:
- suppress all warnings
- suppress a particular warning message everywhere
- suppress all warnings in a particular module
- turn all warnings into exceptions
I propose the following command line option syntax:
-Waction[:message[:category[:module[:lineno]]]]
Where:
- 'action' is an abbreviation of one of the allowed actions
("error", "default", "ignore", "always", "once", or "module")
- 'message' is a message string; matches warnings whose message
text is an initial substring of 'message' (matching is
case-insensitive)
- 'category' is an abbreviation of a standard warning category
class name *or* a fully-qualified name for a user-defined
warning category class of the form [package.]module.classname
- 'module' is a module name (possibly package.module)
- 'lineno' is an integral line number
All parts except 'action' may be omitted, where an empty value
after stripping whitespace is the same as an omitted value.
The C code that parses the Python command line saves the body of
all -W options in a list of strings, which is made available to
the warnings module as sys.warnoptions. The warnings module
parses these when it is first imported. Errors detected during
the parsing of sys.warnoptions are not fatal; a message is written
to sys.stderr and processing continues with the option.
Examples:
-Werror
Turn all warnings into errors
-Wall
Show all warnings
-Wignore
Ignore all warnings
-Wi:hello
Ignore warnings whose message text starts with "hello"
-We::Deprecation
Turn deprecation warnings into errors
-Wi:::spam:10
Ignore all warnings on line 10 of module spam
-Wi:::spam -Wd:::spam:10
Ignore all warnings in module spam except on line 10
-We::Deprecation -Wd::Deprecation:spam
Turn deprecation warnings into errors except in module spam
Open Issues
Some open issues off the top of my head:
- What about issuing warnings during lexing or parsing, which
don't have the exception machinery available?
- The proposed command line syntax is a bit ugly (although the
simple cases aren't so bad: -Werror, -Wignore, etc.). Anybody
got a better idea?
- I'm a bit worried that the filter specifications are too
complex. Perhaps filtering only on category and module (not on
message text and line number) would be enough?
- There's a bit of confusion between module names and file names.
The reporting uses file names, but the filter specification uses
module names. Maybe it should allow filenames as well?
- I'm not at all convinced that packages are handled right.
- Do we need more standard warning categories? Fewer?
- In order to minimize the start-up overhead, the warnings module
is imported by the first call to PyErr_Warn(). It does the
command line parsing for -W options upon import. Therefore, it
is possible that warning-free programs will not complain about
invalid -W options.
Rejected Concerns
Paul Prescod, Barry Warsaw and Fred Drake have brought up several
additional concerns that I feel aren't critical. I address them
here (the concerns are paraphrased, not exactly their words):
- Paul: warn() should be a built-in or a statement to make it easily
available.
Response: "from warnings import warn" is easy enough.
- Paul: What if I have a speed-critical module that triggers
warnings in an inner loop. It should be possible to disable the
overhead for detecting the warning (not just suppress the
warning).
Response: rewrite the inner loop to avoid triggering the
warning.
- Paul: What if I want to see the full context of a warning?
Response: use -Werror to turn it into an exception.
- Paul: I prefer ":*:*:" to ":::" for leaving parts of the warning
spec out.
Response: I don't.
- Barry: It would be nice if lineno can be a range specification.
Response: Too much complexity already.
- Barry: I'd like to add my own warning action. Maybe if `action'
could be a callable as well as a string. Then in my IDE, I
could set that to "mygui.popupWarningsDialog".
Response: For that purpose you would override
warnings.showwarning().
- Fred: why do the Warning category classes have to be in
__builtin__?
Response: that's the simplest implementation, given that the
warning categories must be available in C before the first
PyErr_Warn() call, which imports the warnings module. I see no
problem with making them available as built-ins.
Implementation
Here's a prototype implementation:
http://sourceforge.net/patch/?func=detailpatch&patch_id=102715&group_id=5470
pep-0231 __findattr__()
| PEP: | 231 |
|---|---|
| Title: | __findattr__() |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 30-Nov-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Introduction
This PEP describes an extension to instance attribute lookup and
modification machinery, which allows pure-Python implementations
of many interesting programming models. This PEP tracks the
status and ownership of this feature. It contains a description
of the feature and outlines changes necessary to support the
feature. This PEP summarizes discussions held in mailing list
forums, and provides URLs for further information, where
appropriate. The CVS revision history of this file contains the
definitive historical record.
Background
The semantics for Python instances allow the programmer to
customize some aspects of attribute lookup and attribute
modification, through the special methods __getattr__() and
__setattr__() [1].
However, because of certain restrictions imposed by these methods,
there are useful programming techniques that can not be written in
Python alone, e.g. strict Java Bean-like[2] interfaces and Zope
style acquisitions[3]. In the latter case, Zope solves this by
including a C extension called ExtensionClass[5] which modifies
the standard class semantics, and uses a metaclass hook in
Python's class model called alternatively the "Don Beaudry Hook"
or "Don Beaudry Hack"[6].
While Zope's approach works, it has several disadvantages. First,
it requires a C extension. Second it employs a very arcane, but
truck-sized loophole in the Python machinery. Third, it can be
difficult for other programmers to use and understand (the
metaclass has well-known brain exploding properties). And fourth,
because ExtensionClass instances aren't "real" Python instances,
some aspects of the Python runtime system don't work with
ExtensionClass instances.
Proposals for fixing this problem have often been lumped under the
rubric of fixing the "class/type dichotomy"; that is, eliminating
the difference between built-in types and classes[7]. While a
laudable goal itself, repairing this rift is not necessary in
order to achieve the types of programming constructs described
above. This proposal provides an 80% solution with a minimum of
modification to Python's class and instance objects. It does
nothing to address the type/class dichotomy.
Proposal
This proposal adds a new special method called __findattr__() with
the following semantics:
* If defined in a class, it will be called on all instance
attribute resolutions instead of __getattr__() and
__setattr__().
* __findattr__() is never called recursively. That is, when a
specific instance's __findattr__() is on the call stack, further
attribute accesses for that instance will use the standard
__getattr__() and __setattr__() methods.
* __findattr__() is called for both attribute access (`getting')
and attribute modification (`setting'). It is not called for
attribute deletion.
* When called for getting, it is passed a single argument (not
counting `self'): the name of the attribute being accessed.
* When called for setting, it is called with third argument, which
is the value to set the attribute to.
* __findattr__() methods have the same caching semantics as
__getattr__() and __setattr__(); i.e. if they are present in the
class at class definition time, they are used, but if they are
subsequently added to a class later they are not.
Key Differences with the Existing Protocol
__findattr__()'s semantics are different from the existing
protocol in key ways:
First, __getattr__() is never called if the attribute is found in
the instance's __dict__. This is done for efficiency reasons, and
because otherwise, __setattr__() would have no way to get to the
instance's attributes.
Second, __setattr__() cannot use "normal" syntax for setting
instance attributes, e.g. "self.name = foo" because that would
cause recursive calls to __setattr__().
__findattr__() is always called regardless of whether the
attribute is in __dict__ or not, and a flag in the instance object
prevents recursive calls to __findattr__(). This gives the class
a chance to perform some action for every attribute access. And
because it is called for both gets and sets, it is easy to write
similar policy for all attribute access. Further, efficiency is
not a problem because it is only paid when the extended mechanism
is used.
Related Work
PEP 213 [9] describes a different approach to hooking into
attribute access and modification. The semantics proposed in PEP
213 can be implemented using the __findattr__() hook described
here, with one caveat. The current reference implementation of
__findattr__() does not support hooking on attribute deletion.
This could be added if it's found desirable. See example below.
Examples
One programming style that this proposal allows is a Java
Bean-like interface to objects, where unadorned attribute access
and modification is transparently mapped to a functional
interface. E.g.
class Bean:
def __init__(self, x):
self.__myfoo = x
def __findattr__(self, name, *args):
if name.startswith('_'):
# Private names
if args: setattr(self, name, args[0])
else: return getattr(self, name)
else:
# Public names
if args: name = '_set_' + name
else: name = '_get_' + name
return getattr(self, name)(*args)
def _set_foo(self, x):
self.__myfoo = x
def _get_foo(self):
return self.__myfoo
b = Bean(3)
print b.foo
b.foo = 9
print b.foo
A second, more elaborate example is the implementation of both
implicit and explicit acquisition in pure Python:
import types
class MethodWrapper:
def __init__(self, container, method):
self.__container = container
self.__method = method
def __call__(self, *args, **kws):
return self.__method.im_func(self.__container, *args, **kws)
class WrapperImplicit:
def __init__(self, contained, container):
self.__contained = contained
self.__container = container
def __repr__(self):
return '<Wrapper: [%s | %s]>' % (self.__container,
self.__contained)
def __findattr__(self, name, *args):
# Some things are our own
if name.startswith('_WrapperImplicit__'):
if args: return setattr(self, name, *args)
else: return getattr(self, name)
# setattr stores the name on the contained object directly
if args:
return setattr(self.__contained, name, args[0])
# Other special names
if name == 'aq_parent':
return self.__container
elif name == 'aq_self':
return self.__contained
elif name == 'aq_base':
base = self.__contained
try:
while 1:
base = base.aq_self
except AttributeError:
return base
# no acquisition for _ names
if name.startswith('_'):
return getattr(self.__contained, name)
# Everything else gets wrapped
missing = []
which = self.__contained
obj = getattr(which, name, missing)
if obj is missing:
which = self.__container
obj = getattr(which, name, missing)
if obj is missing:
raise AttributeError, name
of = getattr(obj, '__of__', missing)
if of is not missing:
return of(self)
elif type(obj) == types.MethodType:
return MethodWrapper(self, obj)
return obj
class WrapperExplicit:
def __init__(self, contained, container):
self.__contained = contained
self.__container = container
def __repr__(self):
return '<Wrapper: [%s | %s]>' % (self.__container,
self.__contained)
def __findattr__(self, name, *args):
# Some things are our own
if name.startswith('_WrapperExplicit__'):
if args: return setattr(self, name, *args)
else: return getattr(self, name)
# setattr stores the name on the contained object directly
if args:
return setattr(self.__contained, name, args[0])
# Other special names
if name == 'aq_parent':
return self.__container
elif name == 'aq_self':
return self.__contained
elif name == 'aq_base':
base = self.__contained
try:
while 1:
base = base.aq_self
except AttributeError:
return base
elif name == 'aq_acquire':
return self.aq_acquire
# explicit acquisition only
obj = getattr(self.__contained, name)
if type(obj) == types.MethodType:
return MethodWrapper(self, obj)
return obj
def aq_acquire(self, name):
# Everything else gets wrapped
missing = []
which = self.__contained
obj = getattr(which, name, missing)
if obj is missing:
which = self.__container
obj = getattr(which, name, missing)
if obj is missing:
raise AttributeError, name
of = getattr(obj, '__of__', missing)
if of is not missing:
return of(self)
elif type(obj) == types.MethodType:
return MethodWrapper(self, obj)
return obj
class Implicit:
def __of__(self, container):
return WrapperImplicit(self, container)
def __findattr__(self, name, *args):
# ignore setattrs
if args:
return setattr(self, name, args[0])
obj = getattr(self, name)
missing = []
of = getattr(obj, '__of__', missing)
if of is not missing:
return of(self)
return obj
class Explicit(Implicit):
def __of__(self, container):
return WrapperExplicit(self, container)
# tests
class C(Implicit):
color = 'red'
class A(Implicit):
def report(self):
return self.color
# simple implicit acquisition
c = C()
a = A()
c.a = a
assert c.a.report() == 'red'
d = C()
d.color = 'green'
d.a = a
assert d.a.report() == 'green'
try:
a.report()
except AttributeError:
pass
else:
assert 0, 'AttributeError expected'
# special names
assert c.a.aq_parent is c
assert c.a.aq_self is a
c.a.d = d
assert c.a.d.aq_base is d
assert c.a is not a
# no acquisiton on _ names
class E(Implicit):
_color = 'purple'
class F(Implicit):
def report(self):
return self._color
e = E()
f = F()
e.f = f
try:
e.f.report()
except AttributeError:
pass
else:
assert 0, 'AttributeError expected'
# explicit
class G(Explicit):
color = 'pink'
class H(Explicit):
def report(self):
return self.aq_acquire('color')
def barf(self):
return self.color
g = G()
h = H()
g.h = h
assert g.h.report() == 'pink'
i = G()
i.color = 'cyan'
i.h = h
assert i.h.report() == 'cyan'
try:
g.i.barf()
except AttributeError:
pass
else:
assert 0, 'AttributeError expected'
C++-like access control can also be accomplished, although less
cleanly because of the difficulty of figuring out what method is
being called from the runtime call stack:
import sys
import types
PUBLIC = 0
PROTECTED = 1
PRIVATE = 2
try:
getframe = sys._getframe
except ImportError:
def getframe(n):
try: raise Exception
except Exception:
frame = sys.exc_info()[2].tb_frame
while n > 0:
frame = frame.f_back
if frame is None:
raise ValueError, 'call stack is not deep enough'
return frame
class AccessViolation(Exception):
pass
class Access:
def __findattr__(self, name, *args):
methcache = self.__dict__.setdefault('__cache__', {})
missing = []
obj = getattr(self, name, missing)
# if obj is missing we better be doing a setattr for
# the first time
if obj is not missing and type(obj) == types.MethodType:
# Digusting hack because there's no way to
# dynamically figure out what the method being
# called is from the stack frame.
methcache[obj.im_func.func_code] = obj.im_class
#
# What's the access permissions for this name?
access, klass = getattr(self, '__access__', {}).get(
name, (PUBLIC, 0))
if access is not PUBLIC:
# Now try to see which method is calling us
frame = getframe(0).f_back
if frame is None:
raise AccessViolation
# Get the class of the method that's accessing
# this attribute, by using the code object cache
if frame.f_code.co_name == '__init__':
# There aren't entries in the cache for ctors,
# because the calling mechanism doesn't go
# through __findattr__(). Are there other
# methods that might have the same behavior?
# Since we can't know who's __init__ we're in,
# for now we'll assume that only protected and
# public attrs can be accessed.
if access is PRIVATE:
raise AccessViolation
else:
methclass = self.__cache__.get(frame.f_code)
if not methclass:
raise AccessViolation
if access is PRIVATE and methclass is not klass:
raise AccessViolation
if access is PROTECTED and not issubclass(methclass,
klass):
raise AccessViolation
# If we got here, it must be okay to access the attribute
if args:
return setattr(self, name, *args)
return obj
# tests
class A(Access):
def __init__(self, foo=0, name='A'):
self._foo = foo
# can't set private names in __init__
self.__initprivate(name)
def __initprivate(self, name):
self._name = name
def getfoo(self):
return self._foo
def setfoo(self, newfoo):
self._foo = newfoo
def getname(self):
return self._name
A.__access__ = {'_foo' : (PROTECTED, A),
'_name' : (PRIVATE, A),
'__dict__' : (PRIVATE, A),
'__access__': (PRIVATE, A),
}
class B(A):
def setfoo(self, newfoo):
self._foo = newfoo + 3
def setname(self, name):
self._name = name
b = B(1)
b.getfoo()
a = A(1)
assert a.getfoo() == 1
a.setfoo(2)
assert a.getfoo() == 2
try:
a._foo
except AccessViolation:
pass
else:
assert 0, 'AccessViolation expected'
try:
a._foo = 3
except AccessViolation:
pass
else:
assert 0, 'AccessViolation expected'
try:
a.__dict__['_foo']
except AccessViolation:
pass
else:
assert 0, 'AccessViolation expected'
b = B()
assert b.getfoo() == 0
b.setfoo(2)
assert b.getfoo() == 5
try:
b.setname('B')
except AccessViolation:
pass
else:
assert 0, 'AccessViolation expected'
assert b.getname() == 'A'
Here's an implementation of the attribute hook described in PEP
213 (except that hooking on attribute deletion isn't supported by
the current reference implementation).
class Pep213:
def __findattr__(self, name, *args):
hookname = '__attr_%s__' % name
if args:
op = 'set'
else:
op = 'get'
# XXX: op = 'del' currently not supported
missing = []
meth = getattr(self, hookname, missing)
if meth is missing:
if op == 'set':
return setattr(self, name, *args)
else:
return getattr(self, name)
else:
return meth(op, *args)
def computation(i):
print 'doing computation:', i
return i + 3
def rev_computation(i):
print 'doing rev_computation:', i
return i - 3
class X(Pep213):
def __init__(self, foo=0):
self.__foo = foo
def __attr_foo__(self, op, val=None):
if op == 'get':
return computation(self.__foo)
elif op == 'set':
self.__foo = rev_computation(val)
# XXX: 'del' not yet supported
x = X()
fooval = x.foo
print fooval
x.foo = fooval + 5
print x.foo
# del x.foo
Reference Implementation
The reference implementation, as a patch to the Python core, can be found at this URL: http://sourceforge.net/patch/?func=detailpatch&patch_id=102613&group_id=5470
References
[1] http://docs.python.org/reference/datamodel.html#customizing-attribute-access
[2] http://www.javasoft.com/products/javabeans/
[3] http://www.digicool.com/releases/ExtensionClass/Acquisition.html
[5] http://www.digicool.com/releases/ExtensionClass
[6] http://www.python.org/doc/essays/metaclasses/
[7] http://www.foretec.com/python/workshops/1998-11/dd-ascher-sum.html
[8] http://docs.python.org/howto/regex.html
[9] PEP 213, Attribute Access Handlers, Prescod
http://www.python.org/dev/peps/pep-0213/
Rejection
There are serious problems with the recursion-protection feature.
As described here it's not thread-safe, and a thread-safe solution
has other problems. In general, it's not clear how helpful the
recursion-protection feature is; it makes it hard to write code
that needs to be callable inside __findattr__ as well as outside
it. But without the recursion-protection, it's hard to implement
__findattr__ at all (since __findattr__ would invoke itself
recursively for every attribute it tries to access). There seems
to be no good solution here.
It's also dubious how useful it is to support __findattr__ both
for getting and for setting attributes -- __setattr__ gets called
in all cases alrady.
The examples can all be implemented using __getattr__ if care is
taken not to store instance variables under their own names.
Copyright
This document has been placed in the Public Domain.
pep-0232 Function Attributes
| PEP: | 232 |
|---|---|
| Title: | Function Attributes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 02-Dec-2000 |
| Python-Version: | 2.1 |
| Post-History: | 20-Feb-2001 |
Introduction
This PEP describes an extension to Python, adding attribute
dictionaries to functions and methods. This PEP tracks the status
and ownership of this feature. It contains a description of the
feature and outlines changes necessary to support the feature.
This PEP summarizes discussions held in mailing list forums, and
provides URLs for further information, where appropriate. The CVS
revision history of this file contains the definitive historical
record.
Background
Functions already have a number of attributes, some of which are
writable, e.g. func_doc, a.k.a. func.__doc__. func_doc has the
interesting property that there is special syntax in function (and
method) definitions for implicitly setting the attribute. This
convenience has been exploited over and over again, overloading
docstrings with additional semantics.
For example, John Aycock has written a system where docstrings are
used to define parsing rules[1]. Zope's ZPublisher ORB[2] uses
docstrings to signal "publishable" methods, i.e. methods that can
be called through the web.
The problem with this approach is that the overloaded semantics
may conflict with each other. For example, if we wanted to add a
doctest unit test to a Zope method that should not be publishable
through the web.
Proposal
This proposal adds a new dictionary to function objects, called
func_dict (a.k.a. __dict__). This dictionary can be set and get
using ordinary attribute set and get syntax.
Methods also gain `getter' syntax, and they currently access the
attribute through the dictionary of the underlying function
object. It is not possible to set attributes on bound or unbound
methods, except by doing so explicitly on the underlying function
object. See the `Future Directions' discussion below for
approaches in subsequent versions of Python.
A function object's __dict__ can also be set, but only to a
dictionary object. Deleting a function's __dict__, or setting it
to anything other than a concrete dictionary object results in a
TypeError. If no function attributes have ever been set, the
function's __dict__ will be empty.
Examples
Here are some examples of what you can do with this feature.
def a():
pass
a.publish = 1
a.unittest = '''...'''
if a.publish:
print a()
if hasattr(a, 'unittest'):
testframework.execute(a.unittest)
class C:
def a(self):
'just a docstring'
a.publish = 1
c = C()
if c.a.publish:
publish(c.a())
Other Uses
Paul Prescod enumerated a bunch of other uses:
http://mail.python.org/pipermail/python-dev/2000-April/003364.html
Future Directions
Here are a number of future directions to consider. Any adoption
of these ideas would require a new PEP, which referenced this one,
and would have to be targeted at a Python version subsequent to
the 2.1 release.
- A previous version of this PEP allowed for both setter and
getter of attributes on unbound methods, and only getter on
bound methods. A number of problems were discovered with this
policy.
Because method attributes were stored in the underlying
function, this caused several potentially surprising results:
class C:
def a(self): pass
c1 = C()
c2 = C()
c1.a.publish = 1
# c2.a.publish would now be == 1 also!
Because a change to `a' bound c1 also caused a change to `a'
bound to c2, setting of attributes on bound methods was
disallowed. However, even allowing setting of attributes on
unbound methods has its ambiguities:
class D(C): pass
class E(C): pass
D.a.publish = 1
# E.a.publish would now be == 1 also!
For this reason, the current PEP disallows setting attributes on
either bound or unbound methods, but does allow for getting
attributes on either -- both return the attribute value on the
underlying function object.
A future PEP might propose to implement setting (bound or
unbound) method attributes by setting attributes on the instance
or class, using special naming conventions. I.e.
class C:
def a(self): pass
C.a.publish = 1
C.__a_publish__ == 1 # true
c = C()
c.a.publish = 2
c.__a_publish__ == 2 # true
d = C()
d.__a_publish__ == 1 # true
Here, a lookup on the instance would look to the instance's
dictionary first, followed by a lookup on the class's
dictionary, and finally a lookup on the function object's
dictionary.
- Currently, Python supports function attributes only on Python
functions (i.e. those that are written in Python, not those that
are built-in). Should it be worthwhile, a separate patch can be
crafted that will add function attributes to built-ins.
- __doc__ is the only function attribute that currently has
syntactic support for conveniently setting. It may be
worthwhile to eventually enhance the language for supporting
easy function attribute setting. Here are some syntaxes
suggested by PEP reviewers:
def a {
'publish' : 1,
'unittest': '''...''',
}
(args):
# ...
def a(args):
"""The usual docstring."""
{'publish' : 1,
'unittest': '''...''',
# etc.
}
def a(args) having (publish = 1):
# see reference [3]
pass
The BDFL is currently against any such special syntactic support
for setting arbitrary function attributes. Any syntax proposals
would have to be outlined in new PEPs.
Dissenting Opinion
When this was discussed on the python-dev mailing list in April
2000, a number of dissenting opinions were voiced. For
completeness, the discussion thread starts here:
http://mail.python.org/pipermail/python-dev/2000-April/003361.html
The dissenting arguments appear to fall under the following
categories:
- no clear purpose (what does it buy you?)
- other ways to do it (e.g. mappings as class attributes)
- useless until syntactic support is included
Countering some of these arguments is the observation that with
vanilla Python 2.0, __doc__ can in fact be set to any type of
object, so some semblance of writable function attributes are
already feasible. But that approach is yet another corruption of
__doc__.
And while it is of course possible to add mappings to class
objects (or in the case of function attributes, to the function's
module), it is more difficult and less obvious how to extract the
attribute values for inspection.
Finally, it may be desirable to add syntactic support, much the
same way that __doc__ syntactic support exists. This can be
considered separately from the ability to actually set and get
function attributes.
Reference Implementation
This PEP has been accepted and the implementation has been
integrated into Python 2.1.
References
[1] Aycock, "Compiling Little Languages in Python",
http://www.foretec.com/python/workshops/1998-11/proceedings/papers/aycock-little/aycock-little.html
[2] http://classic.zope.org:8080/Documentation/Reference/ORB
[3] Hudson, Michael, SourceForge patch implementing this syntax,
http://sourceforge.net/tracker/index.php?func=detail&aid=403441&group_id=5470&atid=305470
Copyright
This document has been placed in the Public Domain.
pep-0233 Python Online Help
| PEP: | 233 |
|---|---|
| Title: | Python Online Help |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Prescod <paul at prescod.net> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 11-Dec-2000 |
| Python-Version: | 2.1 |
| Post-History: |
Abstract
This PEP describes a command-line driven online help facility for
Python. The facility should be able to build on existing
documentation facilities such as the Python documentation and
docstrings. It should also be extensible for new types and
modules.
Interactive use:
Simply typing "help" describes the help function (through repr()
overloading).
"help" can also be used as a function:
The function takes the following forms of input:
help( "string" ) -- built-in topic or global
help( <ob> ) -- docstring from object or type
help( "doc:filename" ) -- filename from Python documentation
If you ask for a global, it can be a fully-qualified name such as
help("xml.dom").
You can also use the facility from a command-line
python --help if
In either situation, the output does paging similar to the "more"
command.
Implementation
The help function is implemented in an onlinehelp module which is
demand-loaded.
There should be options for fetching help information from
environments other than the command line through the onlinehelp
module:
onlinehelp.gethelp(object_or_string) -> string
It should also be possible to override the help display function
by assigning to onlinehelp.displayhelp(object_or_string).
The module should be able to extract module information from
either the HTML or LaTeX versions of the Python documentation.
Links should be accommodated in a "lynx-like" manner.
Over time, it should also be able to recognize when docstrings are
in "special" syntaxes like structured text, HTML and LaTeX and
decode them appropriately.
A prototype implementation is available with the Python source
distribution as nondist/sandbox/doctools/onlinehelp.py.
Built-in Topics
help( "intro" ) - What is Python? Read this first!
help( "keywords" ) - What are the keywords?
help( "syntax" ) - What is the overall syntax?
help( "operators" ) - What operators are available?
help( "builtins" ) - What functions, types, etc. are built-in?
help( "modules" ) - What modules are in the standard library?
help( "copyright" ) - Who owns Python?
help( "moreinfo" ) - Where is there more information?
help( "changes" ) - What changed in Python 2.0?
help( "extensions" ) - What extensions are installed?
help( "faq" ) - What questions are frequently asked?
help( "ack" ) - Who has done work on Python lately?
Security Issues
This module will attempt to import modules with the same names as
requested topics. Don't use the modules if you are not confident
that everything in your PYTHONPATH is from a trusted source.
pep-0234 Iterators
| PEP: | 234 |
|---|---|
| Title: | Iterators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee <ping at zesty.ca>, Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 30-Jan-2001 |
| Python-Version: | 2.1 |
| Post-History: | 30-Apr-2001 |
Abstract
This document proposes an iteration interface that objects can
provide to control the behaviour of 'for' loops. Looping is
customized by providing a method that produces an iterator object.
The iterator provides a 'get next value' operation that produces
the next item in the sequence each time it is called, raising an
exception when no more items are available.
In addition, specific iterators over the keys of a dictionary and
over the lines of a file are proposed, and a proposal is made to
allow spelling dict.has_key(key) as "key in dict".
Note: this is an almost complete rewrite of this PEP by the second
author, describing the actual implementation checked into the
trunk of the Python 2.2 CVS tree. It is still open for
discussion. Some of the more esoteric proposals in the original
version of this PEP have been withdrawn for now; these may be the
subject of a separate PEP in the future.
C API Specification
A new exception is defined, StopIteration, which can be used to
signal the end of an iteration.
A new slot named tp_iter for requesting an iterator is added to
the type object structure. This should be a function of one
PyObject * argument returning a PyObject *, or NULL. To use this
slot, a new C API function PyObject_GetIter() is added, with the
same signature as the tp_iter slot function.
Another new slot, named tp_iternext, is added to the type
structure, for obtaining the next value in the iteration. To use
this slot, a new C API function PyIter_Next() is added. The
signature for both the slot and the API function is as follows,
although the NULL return conditions differ: the argument is a
PyObject * and so is the return value. When the return value is
non-NULL, it is the next value in the iteration. When it is NULL,
then for the tp_iternext slot there are three possibilities:
- No exception is set; this implies the end of the iteration.
- The StopIteration exception (or a derived exception class) is
set; this implies the end of the iteration.
- Some other exception is set; this means that an error occurred
that should be propagated normally.
The higher-level PyIter_Next() function clears the StopIteration
exception (or derived exception) when it occurs, so its NULL return
conditions are simpler:
- No exception is set; this means iteration has ended.
- Some exception is set; this means an error occurred, and should
be propagated normally.
Iterators implemented in C should *not* implement a next() method
with similar semantics as the tp_iternext slot! When the type's
dictionary is initialized (by PyType_Ready()), the presence of a
tp_iternext slot causes a method next() wrapping that slot to be
added to the type's tp_dict. (Exception: if the type doesn't use
PyObject_GenericGetAttr() to access instance attributes, the
next() method in the type's tp_dict may not be seen.) (Due to a
misunderstanding in the original text of this PEP, in Python 2.2,
all iterator types implemented a next() method that was overridden
by the wrapper; this has been fixed in Python 2.3.)
To ensure binary backwards compatibility, a new flag
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
field, and to the default flags macro. This flag must be tested
before accessing the tp_iter or tp_iternext slots. The macro
PyIter_Check() tests whether an object has the appropriate flag
set and has a non-NULL tp_iternext slot. There is no such macro
for the tp_iter slot (since the only place where this slot is
referenced should be PyObject_GetIter(), and this can check for
the Py_TPFLAGS_HAVE_ITER flag directly).
(Note: the tp_iter slot can be present on any object; the
tp_iternext slot should only be present on objects that act as
iterators.)
For backwards compatibility, the PyObject_GetIter() function
implements fallback semantics when its argument is a sequence that
does not implement a tp_iter function: a lightweight sequence
iterator object is constructed in that case which iterates over
the items of the sequence in the natural order.
The Python bytecode generated for 'for' loops is changed to use
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
rather than the sequence protocol to get the next value for the
loop variable. This makes it possible to use a 'for' loop to loop
over non-sequence objects that support the tp_iter slot. Other
places where the interpreter loops over the values of a sequence
should also be changed to use iterators.
Iterators ought to implement the tp_iter slot as returning a
reference to themselves; this is needed to make it possible to
use an iterator (as opposed to a sequence) in a for loop.
Iterator implementations (in C or in Python) should guarantee that
once the iterator has signalled its exhaustion, subsequent calls
to tp_iternext or to the next() method will continue to do so. It
is not specified whether an iterator should enter the exhausted
state when an exception (other than StopIteration) is raised.
Note that Python cannot guarantee that user-defined or 3rd party
iterators implement this requirement correctly.
Python API Specification
The StopIteration exception is made visible as one of the
standard exceptions. It is derived from Exception.
A new built-in function is defined, iter(), which can be called in
two ways:
- iter(obj) calls PyObject_GetIter(obj).
- iter(callable, sentinel) returns a special kind of iterator that
calls the callable to produce a new value, and compares the
return value to the sentinel value. If the return value equals
the sentinel, this signals the end of the iteration and
StopIteration is raised rather than returning normal; if the
return value does not equal the sentinel, it is returned as the
next value from the iterator. If the callable raises an
exception, this is propagated normally; in particular, the
function is allowed to raise StopIteration as an alternative way
to end the iteration. (This functionality is available from the
C API as PyCallIter_New(callable, sentinel).)
Iterator objects returned by either form of iter() have a next()
method. This method either returns the next value in the
iteration, or raises StopIteration (or a derived exception class)
to signal the end of the iteration. Any other exception should be
considered to signify an error and should be propagated normally,
not taken to mean the end of the iteration.
Classes can define how they are iterated over by defining an
__iter__() method; this should take no additional arguments and
return a valid iterator object. A class that wants to be an
iterator should implement two methods: a next() method that behaves
as described above, and an __iter__() method that returns self.
The two methods correspond to two distinct protocols:
1. An object can be iterated over with "for" if it implements
__iter__() or __getitem__().
2. An object can function as an iterator if it implements next().
Container-like objects usually support protocol 1. Iterators are
currently required to support both protocols. The semantics of
iteration come only from protocol 2; protocol 1 is present to make
iterators behave like sequences; in particular so that code
receiving an iterator can use a for-loop over the iterator.
Dictionary Iterators
- Dictionaries implement a sq_contains slot that implements the
same test as the has_key() method. This means that we can write
if k in dict: ...
which is equivalent to
if dict.has_key(k): ...
- Dictionaries implement a tp_iter slot that returns an efficient
iterator that iterates over the keys of the dictionary. During
such an iteration, the dictionary should not be modified, except
that setting the value for an existing key is allowed (deletions
or additions are not, nor is the update() method). This means
that we can write
for k in dict: ...
which is equivalent to, but much faster than
for k in dict.keys(): ...
as long as the restriction on modifications to the dictionary
(either by the loop or by another thread) are not violated.
- Add methods to dictionaries that return different kinds of
iterators explicitly:
for key in dict.iterkeys(): ...
for value in dict.itervalues(): ...
for key, value in dict.iteritems(): ...
This means that "for x in dict" is shorthand for "for x in
dict.iterkeys()".
Other mappings, if they support iterators at all, should also
iterate over the keys. However, this should not be taken as an
absolute rule; specific applications may have different
requirements.
File Iterators
The following proposal is useful because it provides us with a
good answer to the complaint that the common idiom to iterate over
the lines of a file is ugly and slow.
- Files implement a tp_iter slot that is equivalent to
iter(f.readline, ""). This means that we can write
for line in file:
...
as a shorthand for
for line in iter(file.readline, ""):
...
which is equivalent to, but faster than
while 1:
line = file.readline()
if not line:
break
...
This also shows that some iterators are destructive: they consume
all the values and a second iterator cannot easily be created that
iterates independently over the same values. You could open the
file for a second time, or seek() to the beginning, but these
solutions don't work for all file types, e.g. they don't work when
the open file object really represents a pipe or a stream socket.
Because the file iterator uses an internal buffer, mixing this
with other file operations (e.g. file.readline()) doesn't work
right. Also, the following code:
for line in file:
if line == "\n":
break
for line in file:
print line,
doesn't work as you might expect, because the iterator created by
the second for-loop doesn't take the buffer read-ahead by the
first for-loop into account. A correct way to write this is:
it = iter(file)
for line in it:
if line == "\n":
break
for line in it:
print line,
(The rationale for these restrictions are that "for line in file"
ought to become the recommended, standard way to iterate over the
lines of a file, and this should be as fast as can be. The
iterator version is considerable faster than calling readline(),
due to the internal buffer in the iterator.)
Rationale
If all the parts of the proposal are included, this addresses many
concerns in a consistent and flexible fashion. Among its chief
virtues are the following four -- no, five -- no, six -- points:
1. It provides an extensible iterator interface.
2. It allows performance enhancements to list iteration.
3. It allows big performance enhancements to dictionary iteration.
4. It allows one to provide an interface for just iteration
without pretending to provide random access to elements.
5. It is backward-compatible with all existing user-defined
classes and extension objects that emulate sequences and
mappings, even mappings that only implement a subset of
{__getitem__, keys, values, items}.
6. It makes code iterating over non-sequence collections more
concise and readable.
Resolved Issues
The following topics have been decided by consensus or BDFL
pronouncement.
- Two alternative spellings for next() have been proposed but
rejected: __next__(), because it corresponds to a type object
slot (tp_iternext); and __call__(), because this is the only
operation.
Arguments against __next__(): while many iterators are used in
for loops, it is expected that user code will also call next()
directly, so having to write __next__() is ugly; also, a
possible extension of the protocol would be to allow for prev(),
current() and reset() operations; surely we don't want to use
__prev__(), __current__(), __reset__().
Arguments against __call__() (the original proposal): taken out
of context, x() is not very readable, while x.next() is clear;
there's a danger that every special-purpose object wants to use
__call__() for its most common operation, causing more confusion
than clarity.
(In retrospect, it might have been better to go for __next__()
and have a new built-in, next(it), which calls it.__next__().
But alas, it's too late; this has been deployed in Python 2.2
since December 2001.)
- Some folks have requested the ability to restart an iterator.
This should be dealt with by calling iter() on a sequence
repeatedly, not by the iterator protocol itself. (See also
requested extensions below.)
- It has been questioned whether an exception to signal the end of
the iteration isn't too expensive. Several alternatives for the
StopIteration exception have been proposed: a special value End
to signal the end, a function end() to test whether the iterator
is finished, even reusing the IndexError exception.
- A special value has the problem that if a sequence ever
contains that special value, a loop over that sequence will
end prematurely without any warning. If the experience with
null-terminated C strings hasn't taught us the problems this
can cause, imagine the trouble a Python introspection tool
would have iterating over a list of all built-in names,
assuming that the special End value was a built-in name!
- Calling an end() function would require two calls per
iteration. Two calls is much more expensive than one call
plus a test for an exception. Especially the time-critical
for loop can test very cheaply for an exception.
- Reusing IndexError can cause confusion because it can be a
genuine error, which would be masked by ending the loop
prematurely.
- Some have asked for a standard iterator type. Presumably all
iterators would have to be derived from this type. But this is
not the Python way: dictionaries are mappings because they
support __getitem__() and a handful other operations, not
because they are derived from an abstract mapping type.
- Regarding "if key in dict": there is no doubt that the
dict.has_key(x) interpretation of "x in dict" is by far the
most useful interpretation, probably the only useful one. There
has been resistance against this because "x in list" checks
whether x is present among the values, while the proposal makes
"x in dict" check whether x is present among the keys. Given
that the symmetry between lists and dictionaries is very weak,
this argument does not have much weight.
- The name iter() is an abbreviation. Alternatives proposed
include iterate(), traverse(), but these appear too long.
Python has a history of using abbrs for common builtins,
e.g. repr(), str(), len().
Resolution: iter() it is.
- Using the same name for two different operations (getting an
iterator from an object and making an iterator for a function
with an sentinel value) is somewhat ugly. I haven't seen a
better name for the second operation though, and since they both
return an iterator, it's easy to remember.
Resolution: the builtin iter() takes an optional argument, which
is the sentinel to look for.
- Once a particular iterator object has raised StopIteration, will
it also raise StopIteration on all subsequent next() calls?
Some say that it would be useful to require this, others say
that it is useful to leave this open to individual iterators.
Note that this may require an additional state bit for some
iterator implementations (e.g. function-wrapping iterators).
Resolution: once StopIteration is raised, calling it.next()
continues to raise StopIteration.
Note: this was in fact not implemented in Python 2.2; there are
many cases where an iterator's next() method can raise
StopIteration on one call but not on the next. This has been
remedied in Python 2.3.
- It has been proposed that a file object should be its own
iterator, with a next() method returning the next line. This
has certain advantages, and makes it even clearer that this
iterator is destructive. The disadvantage is that this would
make it even more painful to implement the "sticky
StopIteration" feature proposed in the previous bullet.
Resolution: tentatively rejected (though there are still people
arguing for this).
- Some folks have requested extensions of the iterator protocol,
e.g. prev() to get the previous item, current() to get the
current item again, finished() to test whether the iterator is
finished, and maybe even others, like rewind(), __len__(),
position().
While some of these are useful, many of these cannot easily be
implemented for all iterator types without adding arbitrary
buffering, and sometimes they can't be implemented at all (or
not reasonably). E.g. anything to do with reversing directions
can't be done when iterating over a file or function. Maybe a
separate PEP can be drafted to standardize the names for such
operations when the are implementable.
Resolution: rejected.
- There has been a long discussion about whether
for x in dict: ...
should assign x the successive keys, values, or items of the
dictionary. The symmetry between "if x in y" and "for x in y"
suggests that it should iterate over keys. This symmetry has been
observed by many independently and has even been used to "explain"
one using the other. This is because for sequences, "if x in y"
iterates over y comparing the iterated values to x. If we adopt
both of the above proposals, this will also hold for
dictionaries.
The argument against making "for x in dict" iterate over the keys
comes mostly from a practicality point of view: scans of the
standard library show that there are about as many uses of "for x
in dict.items()" as there are of "for x in dict.keys()", with the
items() version having a small majority. Presumably many of the
loops using keys() use the corresponding value anyway, by writing
dict[x], so (the argument goes) by making both the key and value
available, we could support the largest number of cases. While
this is true, I (Guido) find the correspondence between "for x in
dict" and "if x in dict" too compelling to break, and there's not
much overhead in having to write dict[x] to explicitly get the
value.
For fast iteration over items, use "for key, value in
dict.iteritems()". I've timed the difference between
for key in dict: dict[key]
and
for key, value in dict.iteritems(): pass
and found that the latter is only about 7% faster.
Resolution: By BDFL pronouncement, "for x in dict" iterates over
the keys, and dictionaries have iteritems(), iterkeys(), and
itervalues() to return the different flavors of dictionary
iterators.
Mailing Lists
The iterator protocol has been discussed extensively in a mailing
list on SourceForge:
http://lists.sourceforge.net/lists/listinfo/python-iterators
Initially, some of the discussion was carried out at Yahoo;
archives are still accessible:
http://groups.yahoo.com/group/python-iter
Copyright
This document is in the public domain.
pep-0235 Import on Case-Insensitive Platforms
| PEP: | 235 |
|---|---|
| Title: | Import on Case-Insensitive Platforms |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tim Peters <tim at zope.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | |
| Python-Version: | 2.1 |
| Post-History: | 16 February 2001 |
Note
This is essentially a retroactive PEP: the issue came up too late
in the 2.1 release process to solicit wide opinion before deciding
what to do, and can't be put off until 2.2 without also delaying
the Cygwin and MacOS X ports.
Motivation
File systems vary across platforms in whether or not they preserve
the case of filenames, and in whether or not the platform C
library file-opening functions do or don't insist on
case-sensitive matches:
case-preserving case-destroying
+-------------------+------------------+
case-sensitive | most Unix flavors | brrrrrrrrrr |
+-------------------+------------------+
case-insensitive | Windows | some unfortunate |
| MacOSX HFS+ | network schemes |
| Cygwin | |
| | OpenVMS |
+-------------------+------------------+
In the upper left box, if you create "fiLe" it's stored as "fiLe",
and only open("fiLe") will open it (open("file") will not, nor
will the 14 other variations on that theme).
In the lower right box, if you create "fiLe", there's no telling
what it's stored as -- but most likely as "FILE" -- and any of the
16 obvious variations on open("FilE") will open it.
The lower left box is a mix: creating "fiLe" stores "fiLe" in the
platform directory, but you don't have to match case when opening
it; any of the 16 obvious variations on open("FILe") work.
NONE OF THAT IS CHANGING! Python will continue to follow platform
conventions w.r.t. whether case is preserved when creating a file,
and w.r.t. whether open() requires a case-sensitive match. In
practice, you should always code as if matches were
case-sensitive, else your program won't be portable.
What's proposed is to change the semantics of Python "import"
statements, and there *only* in the lower left box.
Current Lower-Left Semantics
Support for MacOSX HFS+, and for Cygwin, is new in 2.1, so nothing
is changing there. What's changing is Windows behavior. Here are
the current rules for import on Windows:
1. Despite that the filesystem is case-insensitive, Python insists
on a case-sensitive match. But not in the way the upper left
box works: if you have two files, FiLe.py and file.py on
sys.path, and do
import file
then if Python finds FiLe.py first, it raises a NameError. It
does *not* go on to find file.py; indeed, it's impossible to
import any but the first case-insensitive match on sys.path,
and then only if case matches exactly in the first
case-insensitive match.
2. An ugly exception: if the first case-insensitive match on
sys.path is for a file whose name is entirely in upper case
(FILE.PY or FILE.PYC or FILE.PYO), then the import silently
grabs that, no matter what mixture of case was used in the
import statement. This is apparently to cater to miserable old
filesystems that really fit in the lower right box. But this
exception is unique to Windows, for reasons that may or may not
exist.
3. And another exception: if the environment variable PYTHONCASEOK
exists, Python silently grabs the first case-insensitive match
of any kind.
So these Windows rules are pretty complicated, and neither match
the Unix rules nor provide semantics natural for the native
filesystem. That makes them hard to explain to Unix *or* Windows
users. Nevertheless, they've worked fine for years, and in
isolation there's no compelling reason to change them.
However, that was before the MacOSX HFS+ and Cygwin ports arrived.
They also have case-preserving case-insensitive filesystems, but
the people doing the ports despised the Windows rules. Indeed, a
patch to make HFS+ act like Unix for imports got past a reviewer
and into the code base, which incidentally made Cygwin also act
like Unix (but this met the unbounded approval of the Cygwin
folks, so they sure didn't complain -- they had patches of their
own pending to do this, but the reviewer for those balked).
At a higher level, we want to keep Python consistent, by following
the same rules on *all* platforms with case-preserving
case-insensitive filesystems.
Proposed Semantics
The proposed new semantics for the lower left box:
A. If the PYTHONCASEOK environment variable exists, same as
before: silently accept the first case-insensitive match of any
kind; raise ImportError if none found.
B. Else search sys.path for the first case-sensitive match; raise
ImportError if none found.
#B is the same rule as is used on Unix, so this will improve cross-
platform portability. That's good. #B is also the rule the Mac
and Cygwin folks want (and wanted enough to implement themselves,
multiple times, which is a powerful argument in PythonLand). It
can't cause any existing non-exceptional Windows import to fail,
because any existing non-exceptional Windows import finds a
case-sensitive match first in the path -- and it still will. An
exceptional Windows import currently blows up with a NameError or
ImportError, in which latter case it still will, or in which
former case will continue searching, and either succeed or blow up
with an ImportError.
#A is needed to cater to case-destroying filesystems mounted on Windows,
and *may* also be used by people so enamored of "natural" Windows
behavior that they're willing to set an environment variable to
get it. I don't intend to implement #A for Unix too, but that's
just because I'm not clear on how I *could* do so efficiently (I'm
not going to slow imports under Unix just for theoretical purity).
The potential damage is here: #2 (matching on ALLCAPS.PY) is
proposed to be dropped. Case-destroying filesystems are a
vanishing breed, and support for them is ugly. We're already
supporting (and will continue to support) PYTHONCASEOK for their
benefit, but they don't deserve multiple hacks in 2001.
pep-0236 Back to the __future__
| PEP: | 236 |
|---|---|
| Title: | Back to the __future__ |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tim Peters <tim at zope.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 26-Feb-2001 |
| Python-Version: | 2.1 |
| Post-History: | 26-Feb-2001 |
Motivation
From time to time, Python makes an incompatible change to the
advertised semantics of core language constructs, or changes their
accidental (implementation-dependent) behavior in some way. While this
is never done capriciously, and is always done with the aim of
improving the language over the long term, over the short term it's
contentious and disrupting.
PEP 5, Guidelines for Language Evolution[1] suggests ways to ease
the pain, and this PEP introduces some machinery in support of that.
PEP 227, Statically Nested Scopes[2] is the first application, and
will be used as an example here.
Intent
[Note: This is policy, and so should eventually move into PEP 5 [1]]
When an incompatible change to core language syntax or semantics is
being made:
1. The release C that introduces the change does not change the
syntax or semantics by default.
2. A future release R is identified in which the new syntax or semantics
will be enforced.
3. The mechanisms described in PEP 3, Warning Framework[3] are
used to generate warnings, whenever possible, about constructs
or operations whose meaning may[4] change in release R.
4. The new future_statement (see below) can be explicitly included in a
module M to request that the code in module M use the new syntax or
semantics in the current release C.
So old code continues to work by default, for at least one release,
although it may start to generate new warning messages. Migration to
the new syntax or semantics can proceed during that time, using the
future_statement to make modules containing it act as if the new syntax
or semantics were already being enforced.
Note that there is no need to involve the future_statement machinery
in new features unless they can break existing code; fully backward-
compatible additions can-- and should --be introduced without a
corresponding future_statement.
Syntax
A future_statement is simply a from/import statement using the reserved
module name __future__:
future_statement: "from" "__future__" "import" feature ["as" name]
("," feature ["as" name])*
feature: identifier
name: identifier
In addition, all future_statments must appear near the top of the
module. The only lines that can appear before a future_statement are:
+ The module docstring (if any).
+ Comments.
+ Blank lines.
+ Other future_statements.
Example:
"""This is a module docstring."""
# This is a comment, preceded by a blank line and followed by
# a future_statement.
from __future__ import nested_scopes
from math import sin
from __future__ import alabaster_weenoblobs # compile-time error!
# That was an error because preceded by a non-future_statement.
Semantics
A future_statement is recognized and treated specially at compile time:
changes to the semantics of core constructs are often implemented by
generating different code. It may even be the case that a new feature
introduces new incompatible syntax (such as a new reserved word), in
which case the compiler may need to parse the module differently. Such
decisions cannot be pushed off until runtime.
For any given release, the compiler knows which feature names have been
defined, and raises a compile-time error if a future_statement contains
a feature not known to it[5].
The direct runtime semantics are the same as for any import statement:
there is a standard module __future__.py, described later, and it will
be imported in the usual way at the time the future_statement is
executed.
The *interesting* runtime semantics depend on the specific feature(s)
"imported" by the future_statement(s) appearing in the module.
Note that there is nothing special about the statement:
import __future__ [as name]
That is not a future_statement; it's an ordinary import statement, with
no special semantics or syntax restrictions.
Example
Consider this code, in file scope.py:
x = 42
def f():
x = 666
def g():
print "x is", x
g()
f()
Under 2.0, it prints:
x is 42
Nested scopes[2] are being introduced in 2.1. But under 2.1, it still
prints
x is 42
and also generates a warning.
In 2.2, and also in 2.1 *if* "from __future__ import nested_scopes" is
included at the top of scope.py, it prints
x is 666
Standard Module __future__.py
Lib/__future__.py is a real module, and serves three purposes:
1. To avoid confusing existing tools that analyze import statements and
expect to find the modules they're importing.
2. To ensure that future_statements run under releases prior to 2.1
at least yield runtime exceptions (the import of __future__ will
fail, because there was no module of that name prior to 2.1).
3. To document when incompatible changes were introduced, and when they
will be-- or were --made mandatory. This is a form of executable
documentation, and can be inspected programatically via importing
__future__ and examining its contents.
Each statement in __future__.py is of the form:
FeatureName = "_Feature(" OptionalRelease "," MandatoryRelease ")"
where, normally, OptionalRelease < MandatoryRelease, and both are
5-tuples of the same form as sys.version_info:
(PY_MAJOR_VERSION, # the 2 in 2.1.0a3; an int
PY_MINOR_VERSION, # the 1; an int
PY_MICRO_VERSION, # the 0; an int
PY_RELEASE_LEVEL, # "alpha", "beta", "candidate" or "final"; string
PY_RELEASE_SERIAL # the 3; an int
)
OptionalRelease records the first release in which
from __future__ import FeatureName
was accepted.
In the case of MandatoryReleases that have not yet occurred,
MandatoryRelease predicts the release in which the feature will become
part of the language.
Else MandatoryRelease records when the feature became part of the
language; in releases at or after that, modules no longer need
from __future__ import FeatureName
to use the feature in question, but may continue to use such imports.
MandatoryRelease may also be None, meaning that a planned feature got
dropped.
Instances of class _Feature have two corresponding methods,
.getOptionalRelease() and .getMandatoryRelease().
No feature line will ever be deleted from __future__.py.
Example line:
nested_scopes = _Feature((2, 1, 0, "beta", 1), (2, 2, 0, "final", 0))
This means that
from __future__ import nested_scopes
will work in all releases at or after 2.1b1, and that nested_scopes are
intended to be enforced starting in release 2.2.
Resolved Problem: Runtime Compilation
Several Python features can compile code during a module's runtime:
1. The exec statement.
2. The execfile() function.
3. The compile() function.
4. The eval() function.
5. The input() function.
Since a module M containing a future_statement naming feature F
explicitly requests that the current release act like a future release
with respect to F, any code compiled dynamically from text passed to
one of these from within M should probably also use the new syntax or
semantics associated with F. The 2.1 release does behave this way.
This isn't always desired, though. For example, doctest.testmod(M)
compiles examples taken from strings in M, and those examples should
use M's choices, not necessarily the doctest module's choices. In the
2.1 release, this isn't possible, and no scheme has yet been suggested
for working around this. NOTE: PEP 264 later addressed this in a
flexible way, by adding optional arguments to compile().
In any case, a future_statement appearing "near the top" (see Syntax
above) of text compiled dynamically by an exec, execfile() or compile()
applies to the code block generated, but has no further effect on the
module that executes such an exec, execfile() or compile(). This
can't be used to affect eval() or input(), however, because they only
allow expression input, and a future_statement is not an expression.
Resolved Problem: Native Interactive Shells
There are two ways to get an interactive shell:
1. By invoking Python from a command line without a script argument.
2. By invoking Python from a command line with the -i switch and with a
script argument.
An interactive shell can be seen as an extreme case of runtime
compilation (see above): in effect, each statement typed at an
interactive shell prompt runs a new instance of exec, compile() or
execfile(). A future_statement typed at an interactive shell applies to
the rest of the shell session's life, as if the future_statement had
appeared at the top of a module.
Resolved Problem: Simulated Interactive Shells
Interactive shells "built by hand" (by tools such as IDLE and the Emacs
Python-mode) should behave like native interactive shells (see above).
However, the machinery used internally by native interactive shells has
not been exposed, and there isn't a clear way for tools building their
own interactive shells to achieve the desired behavior.
NOTE: PEP 264 later addressed this, by adding intelligence to the
standard codeop.py. Simulated shells that don't use the standard
library shell helpers can get a similar effect by exploiting the
new optional arguments to compile() added by PEP 264.
Questions and Answers
Q: What about a "from __past__" version, to get back *old* behavior?
A: Outside the scope of this PEP. Seems unlikely to the author,
though. Write a PEP if you want to pursue it.
Q: What about incompatibilities due to changes in the Python virtual
machine?
A: Outside the scope of this PEP, although PEP 5 [1] suggests a grace
period there too, and the future_statement may also have a role to
play there.
Q: What about incompatibilities due to changes in Python's C API?
A: Outside the scope of this PEP.
Q: I want to wrap future_statements in try/except blocks, so I can
use different code depending on which version of Python I'm running.
Why can't I?
A: Sorry! try/except is a runtime feature; future_statements are
primarily compile-time gimmicks, and your try/except happens long
after the compiler is done. That is, by the time you do
try/except, the semantics in effect for the module are already a
done deal. Since the try/except wouldn't accomplish what it
*looks* like it should accomplish, it's simply not allowed. We
also want to keep these special statements very easy to find and to
recognize.
Note that you *can* import __future__ directly, and use the
information in it, along with sys.version_info, to figure out where
the release you're running under stands in relation to a given
feature's status.
Q: Going back to the nested_scopes example, what if release 2.2 comes
along and I still haven't changed my code? How can I keep the 2.1
behavior then?
A: By continuing to use 2.1, and not moving to 2.2 until you do change
your code. The purpose of future_statement is to make life easier
for people who keep current with the latest release in a timely
fashion. We don't hate you if you don't, but your problems are
much harder to solve, and somebody with those problems will need to
write a PEP addressing them. future_statement is aimed at a
different audience.
Q: Overloading "import" sucks. Why not introduce a new statement for
this?
A: Like maybe "lambda lambda nested_scopes"? That is, unless we
introduce a new keyword, we can't introduce an entirely new
statement. But if we introduce a new keyword, that in itself
would break old code. That would be too ironic to bear. Yes,
overloading "import" does suck, but not as energetically as the
alternatives -- as is, future_statements are 100% backward
compatible.
Copyright
This document has been placed in the public domain.
References and Footnotes
[1] PEP 5, Guidelines for Language Evolution, Prescod
http://www.python.org/dev/peps/pep-0005/
[2] PEP 227, Statically Nested Scopes, Hylton
http://www.python.org/dev/peps/pep-0227/
[3] PEP 230, Warning Framework, Van Rossum
http://www.python.org/dev/peps/pep-0230/
[4] Note that this is "may" and not "will": better safe than sorry. Of
course spurious warnings won't be generated when avoidable with
reasonable cost.
[5] This ensures that a future_statement run under a release prior to
the first one in which a given feature is known (but >= 2.1) will
raise a compile-time error rather than silently do a wrong thing.
If transported to a release prior to 2.1, a runtime error will be
raised because of the failure to import __future__ (no such module
existed in the standard distribution before the 2.1 release, and
the double underscores make it a reserved name).
pep-0237 Unifying Long Integers and Integers
| PEP: | 237 |
|---|---|
| Title: | Unifying Long Integers and Integers |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Moshe Zadka, Guido van Rossum |
| Status: | Final |
| Type: | Standards Track |
| Created: | 11-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 16-Mar-2001, 14-Aug-2001, 23-Aug-2001 |
Abstract
Python currently distinguishes between two kinds of integers
(ints): regular or short ints, limited by the size of a C long
(typically 32 or 64 bits), and long ints, which are limited only
by available memory. When operations on short ints yield results
that don't fit in a C long, they raise an error. There are some
other distinctions too. This PEP proposes to do away with most of
the differences in semantics, unifying the two types from the
perspective of the Python user.
Rationale
Many programs find a need to deal with larger numbers after the
fact, and changing the algorithms later is bothersome. It can
hinder performance in the normal case, when all arithmetic is
performed using long ints whether or not they are needed.
Having the machine word size exposed to the language hinders
portability. For examples Python source files and .pyc's are not
portable between 32-bit and 64-bit machines because of this.
There is also the general desire to hide unnecessary details from
the Python user when they are irrelevant for most applications.
An example is memory allocation, which is explicit in C but
automatic in Python, giving us the convenience of unlimited sizes
on strings, lists, etc. It makes sense to extend this convenience
to numbers.
It will give new Python programmers (whether they are new to
programming in general or not) one less thing to learn before they
can start using the language.
Implementation
Initially, two alternative implementations were proposed (one by
each author):
1. The PyInt type's slot for a C long will be turned into a
union {
long i;
struct {
unsigned long length;
digit digits[1];
} bignum;
};
Only the n-1 lower bits of the long have any meaning; the top
bit is always set. This distinguishes the union. All PyInt
functions will check this bit before deciding which types of
operations to use.
2. The existing short and long int types remain, but operations
return a long int instead of raising OverflowError when a
result cannot be represented as a short int. A new type,
integer, may be introduced that is an abstract base type of
which both the int and long implementation types are
subclassed. This is useful so that programs can check
integer-ness with a single test:
if isinstance(i, integer): ...
After some consideration, the second implementation plan was
selected, since it is far easier to implement, is backwards
compatible at the C API level, and in addition can be implemented
partially as a transitional measure.
Incompatibilities
The following operations have (usually subtly) different semantics
for short and for long integers, and one or the other will have to
be changed somehow. This is intended to be an exhaustive list.
If you know of any other operation that differ in outcome
depending on whether a short or a long int with the same value is
passed, please write the second author.
- Currently, all arithmetic operators on short ints except <<
raise OverflowError if the result cannot be represented as a
short int. This will be changed to return a long int instead.
The following operators can currently raise OverflowError: x+y,
x-y, x*y, x**y, divmod(x, y), x/y, x%y, and -x. (The last four
can only overflow when the value -sys.maxint-1 is involved.)
- Currently, x<<n can lose bits for short ints. This will be
changed to return a long int containing all the shifted-out
bits, if returning a short int would lose bits (where changing
sign is considered a special case of losing bits).
- Currently, hex and oct literals for short ints may specify
negative values; for example 0xffffffff == -1 on a 32-bit
machine. This will be changed to equal 0xffffffffL (2**32-1).
- Currently, the '%u', '%x', '%X' and '%o' string formatting
operators and the hex() and oct() built-in functions behave
differently for negative numbers: negative short ints are
formatted as unsigned C long, while negative long ints are
formatted with a minus sign. This will be changed to use the
long int semantics in all cases (but without the trailing 'L'
that currently distinguishes the output of hex() and oct() for
long ints). Note that this means that '%u' becomes an alias for
'%d'. It will eventually be removed.
- Currently, repr() of a long int returns a string ending in 'L'
while repr() of a short int doesn't. The 'L' will be dropped;
but not before Python 3.0.
- Currently, an operation with long operands will never return a
short int. This *may* change, since it allows some
optimization. (No changes have been made in this area yet, and
none are planned.)
- The expression type(x).__name__ depends on whether x is a short
or a long int. Since implementation alternative 2 is chosen,
this difference will remain. (In Python 3.0, we *may* be able
to deploy a trick to hide the difference, because it *is*
annoying to reveal the difference to user code, and more so as
the difference between the two types is less visible.)
- Long and short ints are handled different by the marshal module,
and by the pickle and cPickle modules. This difference will
remain (at least until Python 3.0).
- Short ints with small values (typically between -1 and 99
inclusive) are "interned" -- whenever a result has such a value,
an existing short int with the same value is returned. This is
not done for long ints with the same values. This difference
will remain. (Since there is no guarantee of this interning, is
is debatable whether this is a semantic difference -- but code
may exist that uses 'is' for comparisons of short ints and
happens to work because of this interning. Such code may fail
if used with long ints.)
Literals
A trailing 'L' at the end of an integer literal will stop having
any meaning, and will be eventually become illegal. The compiler
will choose the appropriate type solely based on the value.
(Until Python 3.0, it will force the literal to be a long; but
literals without a trailing 'L' may also be long, if they are not
representable as short ints.)
Built-in Functions
The function int() will return a short or a long int depending on
the argument value. In Python 3.0, the function long() will call
the function int(); before then, it will continue to force the
result to be a long int, but otherwise work the same way as int().
The built-in name 'long' will remain in the language to represent
the long implementation type (unless it is completely eradicated
in Python 3.0), but using the int() function is still recommended,
since it will automatically return a long when needed.
C API
The C API remains unchanged; C code will still need to be aware of
the difference between short and long ints. (The Python 3.0 C API
will probably be completely incompatible.)
The PyArg_Parse*() APIs already accept long ints, as long as they
are within the range representable by C ints or longs, so that
functions taking C int or long argument won't have to worry about
dealing with Python longs.
Transition
There are three major phases to the transition:
A. Short int operations that currently raise OverflowError return
a long int value instead. This is the only change in this
phase. Literals will still distinguish between short and long
ints. The other semantic differences listed above (including
the behavior of <<) will remain. Because this phase only
changes situations that currently raise OverflowError, it is
assumed that this won't break existing code. (Code that
depends on this exception would have to be too convoluted to be
concerned about it.) For those concerned about extreme
backwards compatibility, a command line option (or a call to
the warnings module) will allow a warning or an error to be
issued at this point, but this is off by default.
B. The remaining semantic differences are addressed. In all cases
the long int semantics will prevail. Since this will introduce
backwards incompatibilities which will break some old code,
this phase may require a future statement and/or warnings, and
a prolonged transition phase. The trailing 'L' will continue
to be used for longs as input and by repr().
C. The trailing 'L' is dropped from repr(), and made illegal on
input. (If possible, the 'long' type completely disappears.)
The trailing 'L' is also dropped from hex() and oct().
Phase A will be implemented in Python 2.2.
Phase B will be implemented gradually in Python 2.3 and Python
2.4. Envisioned stages of phase B:
B0. Warnings are enabled about operations that will change their
numeric outcome in stage B1, in particular hex() and oct(),
'%u', '%x', '%X' and '%o', hex and oct literals in the
(inclusive) range [sys.maxint+1, sys.maxint*2+1], and left
shifts losing bits.
B1. The new semantic for these operations are implemented.
Operations that give different results than before will *not*
issue a warning.
We propose the following timeline:
B0. Python 2.3.
B1. Python 2.4.
Phase C will be implemented in Python 3.0 (at least two years
after Python 2.4 is released).
OverflowWarning
Here are the rules that guide warnings generated in situations
that currently raise OverflowError. This applies to transition
phase A. Historical note: despite that phase A was completed in
Python 2.2, and phase B0 in Python 2.3, nobody noticed that
OverflowWarning was still generated in Python 2.3. It was finally
disabled in Python 2.4. The Python builtin OverflowWarning, and
the corresponding C API PyExc_OverflowWarning, are no longer
generated or used in Python 2.4, but will remain for the (unlikely)
case of user code until Python 2.5.
- A new warning category is introduced, OverflowWarning. This is
a built-in name.
- If an int result overflows, an OverflowWarning warning is
issued, with a message argument indicating the operation,
e.g. "integer addition". This may or may not cause a warning
message to be displayed on sys.stderr, or may cause an exception
to be raised, all under control of the -W command line and the
warnings module.
- The OverflowWarning warning is ignored by default.
- The OverflowWarning warning can be controlled like all warnings,
via the -W command line option or via the
warnings.filterwarnings() call. For example:
python -Wdefault::OverflowWarning
cause the OverflowWarning to be displayed the first time it
occurs at a particular source line, and
python -Werror::OverflowWarning
cause the OverflowWarning to be turned into an exception
whenever it happens. The following code enables the warning
from inside the program:
import warnings
warnings.filterwarnings("default", "", OverflowWarning)
See the python man page for the -W option and the warnings
module documentation for filterwarnings().
- If the OverflowWarning warning is turned into an error,
OverflowError is substituted. This is needed for backwards
compatibility.
- Unless the warning is turned into an exceptions, the result of
the operation (e.g., x+y) is recomputed after converting the
arguments to long ints.
Example
If you pass a long int to a C function or built-in operation that
takes an integer, it will be treated the same as as a short int as
long as the value fits (by virtue of how PyArg_ParseTuple() is
implemented). If the long value doesn't fit, it will still raise
an OverflowError. For example:
def fact(n):
if n <= 1:
return 1
return n*fact(n-1)
A = "ABCDEFGHIJKLMNOPQ"
n = input("Gimme an int: ")
print A[fact(n)%17]
For n >= 13, this currently raises OverflowError (unless the user
enters a trailing 'L' as part of their input), even though the
calculated index would always be in range(17). With the new
approach this code will do the right thing: the index will be
calculated as a long int, but its value will be in range.
Resolved Issues
These issues, previously open, have been resolved.
- hex() and oct() applied to longs will continue to produce a
trailing 'L' until Python 3000. The original text above wasn't
clear about this, but since it didn't happen in Python 2.4 it
was thought better to leave it alone. BDFL pronouncement here:
http://mail.python.org/pipermail/python-dev/2006-June/065918.html
- What to do about sys.maxint? Leave it in, since it is still
relevant whenever the distinction between short and long ints is
still relevant (e.g. when inspecting the type of a value).
- Should we remove '%u' completely? Remove it.
- Should we warn about << not truncating integers? Yes.
- Should the overflow warning be on a portable maximum size? No.
Implementation
The implementation work for the Python 2.x line is completed;
phase A was released with Python 2.2, phase B0 with Python 2.3,
and phase B1 will be released with Python 2.4 (and is already in
CVS).
Copyright
This document has been placed in the public domain.
pep-0238 Changing the Division Operator
| PEP: | 238 |
|---|---|
| Title: | Changing the Division Operator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Moshe Zadka <moshez at zadka.site.co.il>, Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 11-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 16-Mar-2001, 26-Jul-2001, 27-Jul-2001 |
Abstract
The current division (/) operator has an ambiguous meaning for
numerical arguments: it returns the floor of the mathematical
result of division if the arguments are ints or longs, but it
returns a reasonable approximation of the division result if the
arguments are floats or complex. This makes expressions expecting
float or complex results error-prone when integers are not
expected but possible as inputs.
We propose to fix this by introducing different operators for
different operations: x/y to return a reasonable approximation of
the mathematical result of the division ("true division"), x//y to
return the floor ("floor division"). We call the current, mixed
meaning of x/y "classic division".
Because of severe backwards compatibility issues, not to mention a
major flamewar on c.l.py, we propose the following transitional
measures (starting with Python 2.2):
- Classic division will remain the default in the Python 2.x
series; true division will be standard in Python 3.0.
- The // operator will be available to request floor division
unambiguously.
- The future division statement, spelled "from __future__ import
division", will change the / operator to mean true division
throughout the module.
- A command line option will enable run-time warnings for classic
division applied to int or long arguments; another command line
option will make true division the default.
- The standard library will use the future division statement and
the // operator when appropriate, so as to completely avoid
classic division.
Motivation
The classic division operator makes it hard to write numerical
expressions that are supposed to give correct results from
arbitrary numerical inputs. For all other operators, one can
write down a formula such as x*y**2 + z, and the calculated result
will be close to the mathematical result (within the limits of
numerical accuracy, of course) for any numerical input type (int,
long, float, or complex). But division poses a problem: if the
expressions for both arguments happen to have an integral type, it
implements floor division rather than true division.
The problem is unique to dynamically typed languages: in a
statically typed language like C, the inputs, typically function
arguments, would be declared as double or float, and when a call
passes an integer argument, it is converted to double or float at
the time of the call. Python doesn't have argument type
declarations, so integer arguments can easily find their way into
an expression.
The problem is particularly pernicious since ints are perfect
substitutes for floats in all other circumstances: math.sqrt(2)
returns the same value as math.sqrt(2.0), 3.14*100 and 3.14*100.0
return the same value, and so on. Thus, the author of a numerical
routine may only use floating point numbers to test his code, and
believe that it works correctly, and a user may accidentally pass
in an integer input value and get incorrect results.
Another way to look at this is that classic division makes it
difficult to write polymorphic functions that work well with
either float or int arguments; all other operators already do the
right thing. No algorithm that works for both ints and floats has
a need for truncating division in one case and true division in
the other.
The correct work-around is subtle: casting an argument to float()
is wrong if it could be a complex number; adding 0.0 to an
argument doesn't preserve the sign of the argument if it was minus
zero. The only solution without either downside is multiplying an
argument (typically the first) by 1.0. This leaves the value and
sign unchanged for float and complex, and turns int and long into
a float with the corresponding value.
It is the opinion of the authors that this is a real design bug in
Python, and that it should be fixed sooner rather than later.
Assuming Python usage will continue to grow, the cost of leaving
this bug in the language will eventually outweigh the cost of
fixing old code -- there is an upper bound to the amount of code
to be fixed, but the amount of code that might be affected by the
bug in the future is unbounded.
Another reason for this change is the desire to ultimately unify
Python's numeric model. This is the subject of PEP 228[0] (which
is currently incomplete). A unified numeric model removes most of
the user's need to be aware of different numerical types. This is
good for beginners, but also takes away concerns about different
numeric behavior for advanced programmers. (Of course, it won't
remove concerns about numerical stability and accuracy.)
In a unified numeric model, the different types (int, long, float,
complex, and possibly others, such as a new rational type) serve
mostly as storage optimizations, and to some extent to indicate
orthogonal properties such as inexactness or complexity. In a
unified model, the integer 1 should be indistinguishable from the
floating point number 1.0 (except for its inexactness), and both
should behave the same in all numeric contexts. Clearly, in a
unified numeric model, if a==b and c==d, a/c should equal b/d
(taking some liberties due to rounding for inexact numbers), and
since everybody agrees that 1.0/2.0 equals 0.5, 1/2 should also
equal 0.5. Likewise, since 1//2 equals zero, 1.0//2.0 should also
equal zero.
Variations
Aesthetically, x//y doesn't please everyone, and hence several
variations have been proposed. They are addressed here:
- x div y. This would introduce a new keyword. Since div is a
popular identifier, this would break a fair amount of existing
code, unless the new keyword was only recognized under a future
division statement. Since it is expected that the majority of
code that needs to be converted is dividing integers, this would
greatly increase the need for the future division statement.
Even with a future statement, the general sentiment against
adding new keywords unless absolutely necessary argues against
this.
- div(x, y). This makes the conversion of old code much harder.
Replacing x/y with x//y or x div y can be done with a simple
query replace; in most cases the programmer can easily verify
that a particular module only works with integers so all
occurrences of x/y can be replaced. (The query replace is still
needed to weed out slashes occurring in comments or string
literals.) Replacing x/y with div(x, y) would require a much
more intelligent tool, since the extent of the expressions to
the left and right of the / must be analyzed before the
placement of the "div(" and ")" part can be decided.
- x \ y. The backslash is already a token, meaning line
continuation, and in general it suggests an "escape" to Unix
eyes. In addition (this due to Terry Reedy) this would make
things like eval("x\y") harder to get right.
Alternatives
In order to reduce the amount of old code that needs to be
converted, several alternative proposals have been put forth.
Here is a brief discussion of each proposal (or category of
proposals). If you know of an alternative that was discussed on
c.l.py that isn't mentioned here, please mail the second author.
- Let / keep its classic semantics; introduce // for true
division. This still leaves a broken operator in the language,
and invites to use the broken behavior. It also shuts off the
road to a unified numeric model a la PEP 228[0].
- Let int division return a special "portmanteau" type that
behaves as an integer in integer context, but like a float in a
float context. The problem with this is that after a few
operations, the int and the float value could be miles apart,
it's unclear which value should be used in comparisons, and of
course many contexts (like conversion to string) don't have a
clear integer or float preference.
- Use a directive to use specific division semantics in a module,
rather than a future statement. This retains classic division
as a permanent wart in the language, requiring future
generations of Python programmers to be aware of the problem and
the remedies.
- Use "from __past__ import division" to use classic division
semantics in a module. This also retains the classic division
as a permanent wart, or at least for a long time (eventually the
past division statement could raise an ImportError).
- Use a directive (or some other way) to specify the Python
version for which a specific piece of code was developed. This
requires future Python interpreters to be able to emulate
*exactly* several previous versions of Python, and moreover to
do so for multiple versions within the same interpreter. This
is way too much work. A much simpler solution is to keep
multiple interpreters installed. Another argument against this
is that the version directive is almost always overspecified:
most code written for Python X.Y, works for Python X.(Y-1) and
X.(Y+1) as well, so specifying X.Y as a version is more
constraining than it needs to be. At the same time, there's no
way to know at which future or past version the code will break.
API Changes
During the transitional phase, we have to support *three* division
operators within the same program: classic division (for / in
modules without a future division statement), true division (for /
in modules with a future division statement), and floor division
(for //). Each operator comes in two flavors: regular, and as an
augmented assignment operator (/= or //=).
The names associated with these variations are:
- Overloaded operator methods:
__div__(), __floordiv__(), __truediv__();
__idiv__(), __ifloordiv__(), __itruediv__().
- Abstract API C functions:
PyNumber_Divide(), PyNumber_FloorDivide(),
PyNumber_TrueDivide();
PyNumber_InPlaceDivide(), PyNumber_InPlaceFloorDivide(),
PyNumber_InPlaceTrueDivide().
- Byte code opcodes:
BINARY_DIVIDE, BINARY_FLOOR_DIVIDE, BINARY_TRUE_DIVIDE;
INPLACE_DIVIDE, INPLACE_FLOOR_DIVIDE, INPLACE_TRUE_DIVIDE.
- PyNumberMethod slots:
nb_divide, nb_floor_divide, nb_true_divide,
nb_inplace_divide, nb_inplace_floor_divide,
nb_inplace_true_divide.
The added PyNumberMethod slots require an additional flag in
tp_flags; this flag will be named Py_TPFLAGS_HAVE_NEWDIVIDE and
will be included in Py_TPFLAGS_DEFAULT.
The true and floor division APIs will look for the corresponding
slots and call that; when that slot is NULL, they will raise an
exception. There is no fallback to the classic divide slot.
In Python 3.0, the classic division semantics will be removed; the
classic division APIs will become synonymous with true division.
Command Line Option
The -Q command line option takes a string argument that can take
four values: "old", "warn", "warnall", or "new". The default is
"old" in Python 2.2 but will change to "warn" in later 2.x
versions. The "old" value means the classic division operator
acts as described. The "warn" value means the classic division
operator issues a warning (a DeprecationWarning using the standard
warning framework) when applied to ints or longs. The "warnall"
value also issues warnings for classic division when applied to
floats or complex; this is for use by the fixdiv.py conversion
script mentioned below. The "new" value changes the default
globally so that the / operator is always interpreted as true
division. The "new" option is only intended for use in certain
educational environments, where true division is required, but
asking the students to include the future division statement in
all their code would be a problem.
This option will not be supported in Python 3.0; Python 3.0 will
always interpret / as true division.
(This option was originally proposed as -D, but that turned out to
be an existing option for Jython, hence the Q -- mnemonic for
Quotient. Other names have been proposed, like -Qclassic,
-Qclassic-warn, -Qtrue, or -Qold_division etc.; these seem more
verbose to me without much advantage. After all the term classic
division is not used in the language at all (only in the PEP), and
the term true division is rarely used in the language -- only in
__truediv__.)
Semantics of Floor Division
Floor division will be implemented in all the Python numeric
types, and will have the semantics of
a // b == floor(a/b)
except that the result type will be the common type into which a
and b are coerced before the operation.
Specifically, if a and b are of the same type, a//b will be of
that type too. If the inputs are of different types, they are
first coerced to a common type using the same rules used for all
other arithmetic operators.
In particular, if a and b are both ints or longs, the result has
the same type and value as for classic division on these types
(including the case of mixed input types; int//long and long//int
will both return a long).
For floating point inputs, the result is a float. For example:
3.5//2.0 == 1.0
For complex numbers, // raises an exception, since floor() of a
complex number is not allowed.
For user-defined classes and extension types, all semantics are up
to the implementation of the class or type.
Semantics of True Division
True division for ints and longs will convert the arguments to
float and then apply a float division. That is, even 2/1 will
return a float (2.0), not an int. For floats and complex, it will
be the same as classic division.
The 2.2 implementation of true division acts as if the float type
had unbounded range, so that overflow doesn't occur unless the
magnitude of the mathematical *result* is too large to represent
as a float. For example, after "x = 1L << 40000", float(x) raises
OverflowError (note that this is also new in 2.2: previously the
outcome was platform-dependent, most commonly a float infinity). But
x/x returns 1.0 without exception, while x/1 raises OverflowError.
Note that for int and long arguments, true division may lose
information; this is in the nature of true division (as long as
rationals are not in the language). Algorithms that consciously
use longs should consider using //, as true division of longs
retains no more than 53 bits of precision (on most platforms).
If and when a rational type is added to Python (see PEP 239[2]),
true division for ints and longs should probably return a
rational. This avoids the problem with true division of ints and
longs losing information. But until then, for consistency, float is
the only choice for true division.
The Future Division Statement
If "from __future__ import division" is present in a module, or if
-Qnew is used, the / and /= operators are translated to true
division opcodes; otherwise they are translated to classic
division (until Python 3.0 comes along, where they are always
translated to true division).
The future division statement has no effect on the recognition or
translation of // and //=.
See PEP 236[4] for the general rules for future statements.
(It has been proposed to use a longer phrase, like "true_division"
or "modern_division". These don't seem to add much information.)
Open Issues
We expect that these issues will be resolved over time, as more
feedback is received or we gather more experience with the initial
implementation.
- It has been proposed to call // the quotient operator, and the /
operator the ratio operator. I'm not sure about this -- for
some people quotient is just a synonym for division, and ratio
suggests rational numbers, which is wrong. I prefer the
terminology to be slightly awkward if that avoids unambiguity.
Also, for some folks "quotient" suggests truncation towards
zero, not towards infinity as "floor division" says explicitly.
- It has been argued that a command line option to change the
default is evil. It can certainly be dangerous in the wrong
hands: for example, it would be impossible to combine a 3rd
party library package that requires -Qnew with another one that
requires -Qold. But I believe that the VPython folks need a way
to enable true division by default, and other educators might
need the same. These usually have enough control over the
library packages available in their environment.
- For classes to have to support all three of __div__(),
__floordiv__() and __truediv__() seems painful; and what to do
in 3.0? Maybe we only need __div__() and __floordiv__(), or
maybe at least true division should try __truediv__() first and
__div__() second.
Resolved Issues
- Issue: For very large long integers, the definition of true
division as returning a float causes problems, since the range of
Python longs is much larger than that of Python floats. This
problem will disappear if and when rational numbers are supported.
Resolution: For long true division, Python uses an internal
float type with native double precision but unbounded range, so
that OverflowError doesn't occur unless the quotient is too large
to represent as a native double.
- Issue: In the interim, maybe the long-to-float conversion could be
made to raise OverflowError if the long is out of range.
Resolution: This has been implemented, but, as above, the
magnitude of the inputs to long true division doesn't matter; only
the magnitude of the quotient matters.
- Issue: Tim Peters will make sure that whenever an in-range float
is returned, decent precision is guaranteed.
Resolution: Provided the quotient of long true division is
representable as a float, it suffers no more than 3 rounding
errors: one each for converting the inputs to an internal float
type with native double precision but unbounded range, and
one more for the division. However, note that if the magnitude
of the quotient is too *small* to represent as a native double,
0.0 is returned without exception ("silent underflow").
FAQ
Q. When will Python 3.0 be released?
A. We don't plan that long ahead, so we can't say for sure. We
want to allow at least two years for the transition. If Python
3.0 comes out sooner, we'll keep the 2.x line alive for
backwards compatibility until at least two years from the
release of Python 2.2. In practice, you will be able to
continue to use the Python 2.x line for several years after
Python 3.0 is released, so you can take your time with the
transition. Sites are expected to have both Python 2.x and
Python 3.x installed simultaneously.
Q. Why isn't true division called float division?
A. Because I want to keep the door open to *possibly* introducing
rationals and making 1/2 return a rational rather than a
float. See PEP 239[2].
Q. Why is there a need for __truediv__ and __itruediv__?
A. We don't want to make user-defined classes second-class
citizens. Certainly not with the type/class unification going
on.
Q. How do I write code that works under the classic rules as well
as under the new rules without using // or a future division
statement?
A. Use x*1.0/y for true division, divmod(x, y)[0] for int
division. Especially the latter is best hidden inside a
function. You may also write float(x)/y for true division if
you are sure that you don't expect complex numbers. If you
know your integers are never negative, you can use int(x/y) --
while the documentation of int() says that int() can round or
truncate depending on the C implementation, we know of no C
implementation that doesn't truncate, and we're going to change
the spec for int() to promise truncation. Note that classic
division (and floor division) round towards negative infinity,
while int() rounds towards zero, giving different answers for
negative numbers.
Q. How do I specify the division semantics for input(), compile(),
execfile(), eval() and exec?
A. They inherit the choice from the invoking module. PEP 236[4]
now lists this as a resolved problem, referring to PEP 264[5].
Q. What about code compiled by the codeop module?
A. This is dealt with properly; see PEP 264[5].
Q. Will there be conversion tools or aids?
A. Certainly. While these are outside the scope of the PEP, I
should point out two simple tools that will be released with
Python 2.2a3: Tools/scripts/finddiv.py finds division operators
(slightly smarter than "grep /") and Tools/scripts/fixdiv.py
can produce patches based on run-time analysis.
Q. Why is my question not answered here?
A. Because we weren't aware of it. If it's been discussed on
c.l.py and you believe the answer is of general interest,
please notify the second author. (We don't have the time or
inclination to answer every question sent in private email,
hence the requirement that it be discussed on c.l.py first.)
Implementation
Essentially everything mentioned here is implemented in CVS and
will be released with Python 2.2a3; most of it was already
released with Python 2.2a2.
References
[0] PEP 228, Reworking Python's Numeric Model
http://www.python.org/dev/peps/pep-0228/
[1] PEP 237, Unifying Long Integers and Integers, Zadka,
http://www.python.org/dev/peps/pep-0237/
[2] PEP 239, Adding a Rational Type to Python, Zadka,
http://www.python.org/dev/peps/pep-0239/
[3] PEP 240, Adding a Rational Literal to Python, Zadka,
http://www.python.org/dev/peps/pep-0240/
[4] PEP 236, Back to the __future__, Peters,
http://www.python.org/dev/peps/pep-0236/
[5] PEP 264, Future statements in simulated shells
http://www.python.org/dev/peps/pep-0236/
Copyright
This document has been placed in the public domain.
pep-0239 Adding a Rational Type to Python
| PEP: | 239 |
|---|---|
| Title: | Adding a Rational Type to Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christopher A. Craig <python-pep at ccraig.org>, Moshe Zadka <moshez at zadka.site.co.il> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 11-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 16-Mar-2001 |
Abstract
Python has no numeric type with the semantics of an unboundedly
precise rational number. This proposal explains the semantics of
such a type, and suggests builtin functions and literals to
support such a type. This PEP suggests no literals for rational
numbers; that is left for another PEP[1].
BDFL Pronouncement
This PEP is rejected. The needs outlined in the rationale section
have been addressed to some extent by the acceptance of PEP 327
for decimal arithmetic. Guido also noted, "Rational arithmetic
was the default 'exact' arithmetic in ABC and it did not work out as
expected". See the python-dev discussion on 17 June 2005.
*Postscript:* With the acceptance of PEP 3141, "A Type Hierarchy
for Numbers", a 'Rational' numeric abstract base class was added
with a concrete implementation in the 'fractions' module.
Rationale
While sometimes slower and more memory intensive (in general,
unboundedly so) rational arithmetic captures more closely the
mathematical ideal of numbers, and tends to have behavior which is
less surprising to newbies. Though many Python implementations of
rational numbers have been written, none of these exist in the
core, or are documented in any way. This has made them much less
accessible to people who are less Python-savvy.
RationalType
There will be a new numeric type added called RationalType. Its
unary operators will do the obvious thing. Binary operators will
coerce integers and long integers to rationals, and rationals to
floats and complexes.
The following attributes will be supported: .numerator and
.denominator. The language definition will promise that
r.denominator * r == r.numerator
that the GCD of the numerator and the denominator is 1 and that
the denominator is positive.
The method r.trim(max_denominator) will return the closest
rational s to r such that abs(s.denominator) <= max_denominator.
The rational() Builtin
This function will have the signature rational(n, d=1). n and d
must both be integers, long integers or rationals. A guarantee is
made that
rational(n, d) * d == n
Open Issues
- Maybe the type should be called rat instead of rational.
Somebody proposed that we have "abstract" pure mathematical
types named complex, real, rational, integer, and "concrete"
representation types with names like float, rat, long, int.
- Should a rational number with an integer value be allowed as a
sequence index? For example, should s[5/3 - 2/3] be equivalent
to s[1]?
- Should shift and mask operators be allowed for rational numbers?
For rational numbers with integer values?
- Marcin 'Qrczak' Kowalczyk summarized the arguments for and
against unifying ints with rationals nicely on c.l.py:
Arguments for unifying ints with rationals:
- Since 2 == 2/1 and maybe str(2/1) == '2', it reduces surprises
where objects seem equal but behave differently.
- / can be freely used for integer division when I *know* that
there is no remainder (if I am wrong and there is a remainder,
there will probably be some exception later).
Arguments against:
- When I use the result of / as a sequence index, it's usually
an error which should not be hidden by making the program
working for some data, since it will break for other data.
- (this assumes that after unification int and rational would be
different types:) Types should rarely depend on values. It's
easier to reason when the type of a variable is known: I know
how I can use it. I can determine that something is an int and
expect that other objects used in this place will be ints too.
- (this assumes the same type for them:) Int is a good type in
itself, not to be mixed with rationals. The fact that
something is an integer should be expressible as a statement
about its type. Many operations require ints and don't accept
rationals. It's natural to think about them as about different
types.
References
[1] PEP 240, Adding a Rational Literal to Python, Zadka,
http://www.python.org/dev/peps/pep-0240/
Copyright
This document has been placed in the public domain.
pep-0240 Adding a Rational Literal to Python
| PEP: | 240 |
|---|---|
| Title: | Adding a Rational Literal to Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christopher A. Craig <python-pep at ccraig.org>, Moshe Zadka <moshez at zadka.site.co.il> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 11-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 16-Mar-2001 |
Abstract
A different PEP[1] suggests adding a builtin rational type to
Python. This PEP suggests changing the ddd.ddd float literal to a
rational in Python, and modifying non-integer division to return
it.
BDFL Pronouncement
This PEP is rejected. The needs outlined in the rationale section
have been addressed to some extent by the acceptance of PEP 327
for decimal arithmetic. Guido also noted, "Rational arithmetic
was the default 'exact' arithmetic in ABC and it did not work out as
expected". See the python-dev discussion on 17 June 2005.
Rationale
Rational numbers are useful for exact and unsurprising arithmetic.
They give the correct results people have been taught in various
math classes. Making the "obvious" non-integer type one with more
predictable semantics will surprise new programmers less then
using floating point numbers. As quite a few posts on c.l.py and
on tutor@python.org have shown, people often get bit by strange
semantics of floating point numbers: for example, round(0.98, 2)
still gives 0.97999999999999998.
Proposal
Literals conforming to the regular expression '\d*.\d*' will be
rational numbers.
Backwards Compatibility
The only backwards compatible issue is the type of literals
mentioned above. The following migration is suggested:
1. The next Python after approval will allow
"from __future__ import rational_literals"
to cause all such literals to be treated as rational numbers.
2. Python 3.0 will have a warning, turned on by default, about
such literals in the absence of a __future__ statement. The
warning message will contain information about the __future__
statement, and indicate that to get floating point literals,
they should be suffixed with "e0".
3. Python 3.1 will have the warning turned off by default. This
warning will stay in place for 24 months, at which time the
literals will be rationals and the warning will be removed.
Common Objections
Rationals are slow and memory intensive!
(Relax, I'm not taking floats away, I'm just adding two more characters.
1e0 will still be a float)
Rationals must present themselves as a decimal float or they will be
horrible for users expecting decimals (i.e. str(.5) should return '.5' and
not '1/2'). This means that many rationals must be truncated at some
point, which gives us a new loss of precision.
References
[1] PEP 239, Adding a Rational Type to Python, Zadka,
http://www.python.org/dev/peps/pep-0239/
Copyright
This document has been placed in the public domain.
pep-0241 Metadata for Python Software Packages
| PEP: | 241 |
|---|---|
| Title: | Metadata for Python Software Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 12-Mar-2001 |
| Post-History: | 19-Mar-2001 |
Introduction
This PEP describes a mechanism for adding metadata to Python packages. It includes specifics of the field names, and their semantics and usage.
Including Metadata in Packages
The Distutils 'sdist' command will be modified to extract the
metadata fields from the arguments and write them to a file in the
generated zipfile or tarball. This file will be named PKG-INFO
and will be placed in the top directory of the source
distribution (where the README, INSTALL, and other files usually
go).
Developers may not provide their own PKG-INFO file. The "sdist"
command will, if it detects an existing PKG-INFO file, terminate
with an appropriate error message. This should prevent confusion
caused by the PKG-INFO and setup.py files being out of sync.
The PKG-INFO file format is a single set of RFC-822 headers
parseable by the rfc822.py module. The field names listed in the
following section are used as the header names. There's no
extension mechanism in this simple format; the Catalog and Distutils
SIGs will aim at getting a more flexible format ready for Python 2.2.
Fields
This section specifies the names and semantics of each of the
supported metadata fields.
Fields marked with "(Multiple use)" may be specified multiple
times in a single PKG-INFO file. Other fields may only occur
once in a PKG-INFO file. Fields marked with "(optional)" are
not required to appear in a valid PKG-INFO file, all other
fields must be present.
Metadata-Version
Version of the file format; currently "1.0" is the only
legal value here.
Example:
Metadata-Version: 1.0
Name
The name of the package.
Example:
Name: BeagleVote
Version
A string containing the package's version number. This
field should be parseable by one of the Version classes
(StrictVersion or LooseVersion) in the distutils.version
module.
Example:
Version: 1.0a2
Platform (multiple use)
A comma-separated list of platform specifications, summarizing
the operating systems supported by the package. The major
supported platforms are listed below, but this list is
necessarily incomplete.
POSIX, MacOS, Windows, BeOS, PalmOS.
Binary distributions will use the Supported-Platform field in
their metadata to specify the OS and CPU for which the binary
package was compiled. The semantics of the Supported-Platform
are not specified in this PEP.
Example:
Platform: POSIX, Windows
Summary
A one-line summary of what the package does.
Example:
Summary: A module for collecting votes from beagles.
Description (optional)
A longer description of the package that can run to several
paragraphs. (Software that deals with metadata should not
assume any maximum size for this field, though one hopes that
people won't include their instruction manual as the
long-description.)
Example:
Description: This module collects votes from beagles
in order to determine their electoral wishes.
Do NOT try to use this module with basset hounds;
it makes them grumpy.
Keywords (optional)
A list of additional keywords to be used to assist searching
for the package in a larger catalog.
Example:
Keywords: dog puppy voting election
Home-page (optional)
A string containing the URL for the package's home page.
Example:
Home-page: http://www.example.com/~cschultz/bvote/
Author (optional)
A string containing at a minimum the author's name. Contact
information can also be added, separating each line with
newlines.
Example:
Author: C. Schultz
Universal Features Syndicate
Los Angeles, CA
Author-email
A string containing the author's e-mail address. It can contain
a name and e-mail address in the legal forms for a RFC-822
'From:' header. It's not optional because cataloging systems
can use the e-mail portion of this field as a unique key
representing the author. A catalog might provide authors the
ability to store their GPG key, personal home page, and other
additional metadata *about the author*, and optionally the
ability to associate several e-mail addresses with the same
person. Author-related metadata fields are not covered by this
PEP.
Example:
Author-email: "C. Schultz" <cschultz@example.com>
License
A string selected from a short list of choices, specifying the
license covering the package. Some licenses result in the
software being freely redistributable, so packagers and
resellers can automatically know that they're free to
redistribute the software. Other licenses will require
a careful reading by a human to determine how the software can be
repackaged and resold.
The choices are:
Artistic, BSD, DFSG, GNU GPL, GNU LGPL, "MIT",
Mozilla PL, "public domain", Python, Qt PL, Zope PL, unknown,
nocommercial, nosell, nosource, shareware, other
Definitions of some of the licenses are:
DFSG The license conforms to the Debian Free Software
Guidelines, but does not use one of the other
DFSG conforming licenses listed here.
More information is available at:
http://www.debian.org/social_contract#guidelines
Python Python 1.6 or higher license. Version 1.5.2 and
earlier are under the MIT license.
public domain Software is public domain, not copyrighted.
unknown Status is not known
nocommercial Free private use but commercial use not permitted
nosell Free use but distribution for profit by arrangement
nosource Freely distributable but no source code
shareware Payment is requested if software is used
other General category for other non-DFSG licenses
Some of these licenses can be interpreted to mean the software is
freely redistributable. The list of redistributable licenses is:
Artistic, BSD, DFSG, GNU GPL, GNU LGPL, "MIT",
Mozilla PL, "public domain", Python, Qt PL, Zope PL,
nosource, shareware
Note that being redistributable does not mean a package
qualifies as free software, 'nosource' and 'shareware' being
examples.
Example:
License: MIT
Acknowledgements
Many changes and rewrites to this document were suggested by the
readers of the Distutils SIG. In particular, Sean Reifschneider
often contributed actual text for inclusion in this PEP.
The list of licenses was compiled using the SourceForge license
list and the CTAN license list compiled by Graham Williams; Carey
Evans also offered several useful suggestions on this list.
Copyright
This document has been placed in the public domain.
pep-0242 Numeric Kinds
| PEP: | 242 |
|---|---|
| Title: | Numeric Kinds |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul F. Dubois <paul at pfdubois.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 17-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 17-Apr-2001 |
Abstract
This proposal gives the user optional control over the precision
and range of numeric computations so that a computation can be
written once and run anywhere with at least the desired precision
and range. It is backward compatible with existing code. The
meaning of decimal literals is clarified.
Rationale
Currently it is impossible in every language except Fortran 90 to
write a program in a portable way that uses floating point and
gets roughly the same answer regardless of platform -- or refuses
to compile if that is not possible. Python currently has only one
floating point type, equal to a C double in the C implementation.
No type exists corresponding to single or quad floats. It would
complicate the language to try to introduce such types directly
and their subsequent use would not be portable. This proposal is
similar to the Fortran 90 "kind" solution, adapted to the Python
environment. With this facility an entire calculation can be
switched from one level of precision to another by changing a
single line. If the desired precision does not exist on a
particular machine, the program will fail rather than get the
wrong answer. Since coding in this style would involve an early
call to the routine that will fail, this is the next best thing to
not compiling.
Supported Kinds of Ints and Floats
Complex numbers are treated separately below, since Python can be
built without them.
Each Python compiler may define as many "kinds" of integer and
floating point numbers as it likes, except that it must support at
least two kinds of integer corresponding to the existing int and
long, and must support at least one kind of floating point number,
equivalent to the present float.
The range and precision of these required kinds are processor
dependent, as at present, except for the "long integer" kind,
which can hold an arbitrary integer.
The built-in functions int(), long(), and float() convert inputs
to these default kinds as they do at present. (Note that a
Unicode string is actually a different "kind" of string and that a
sufficiently knowledgeable person might be able to expand this PEP
to cover that case.)
Within each type (integer, floating) the compiler supports a
linearly-ordered set of kinds, with the ordering determined by the
ability to hold numbers of an increased range and/or precision.
Kind Objects
Two new standard functions are defined in a module named "kinds".
They return callable objects called kind objects. Each int or
floating kind object f has the signature result = f(x), and each
complex kind object has the signature result = f(x, y=0.).
int_kind(n)
For an integer argument n >= 1, return a callable object whose
result is an integer kind that will hold an integer number in
the open interval (-10**n,10**n). The kind object accepts
arguments that are integers including longs. If n == 0,
returns the kind object corresponding to the Python literal 0.
float_kind(nd, n)
For nd >= 0 and n >= 1, return a callable object whose result
is a floating point kind that will hold a floating-point
number with at least nd digits of precision and a base-10
exponent in the closed interval [-n, n]. The kind object
accepts arguments that are integer or float.
If nd and n are both zero, returns the kind object
corresponding to the Python literal 0.0.
The compiler will return a kind object corresponding to the least
of its available set of kinds for that type that has the desired
properties. If no kind with the desired qualities exists in a
given implementation an OverflowError exception is thrown. A kind
function converts its argument to the target kind, but if the
result does not fit in the target kind's range, an OverflowError
exception is thrown.
Besides their callable behavior, kind objects have attributes
giving the traits of the kind in question.
1. name is the name of the kind. The standard kinds are called
int, long, double.
2. typecode is a single-letter string that would be appropriate
for use with Numeric or module array to form an array of this
kind. The standard types' typecodes are 'i', 'O', 'd'
respectively.
3. Integer kinds have these additional attributes: MAX, equal to
the maximum permissible integer of this kind, or None for the
long kind. MIN, equal to the most negative permissible integer
of this kind, or None for the long kind.
4. Float kinds have these additional attributes whose properties
are equal to the corresponding value for the corresponding C
type in the standard header file "float.h". MAX, MIN, DIG,
MANT_DIG, EPSILON, MAX_EXP, MAX_10_EXP, MIN_EXP, MIN_10_EXP,
RADIX, ROUNDS (== FLT_RADIX, FLT_ROUNDS in float.h). These
values are of type integer except for MAX, MIN, and EPSILON,
which are of the Python floating type to which the kind
corresponds.
Attributes of Module kinds
int_kinds is a list of the available integer kinds, sorted from lowest
to highest kind. By definition, int_kinds[-1] is the
long kind.
float_kinds is a list of the available floating point kinds, sorted
from lowest to highest kind.
default_int_kind is the kind object corresponding to the Python
literal 0
default_long_kind is the kind object corresponding to the Python
literal 0L
default_float_kind is the kind object corresponding to the Python
literal 0.0
Complex Numbers
If supported, complex numbers have real and imaginary parts that
are floating-point numbers with the same kind. A Python compiler
must support a complex analog of each floating point kind it
supports, if it supports complex numbers at all.
If complex numbers are supported, the following are available in
module kinds:
complex_kind(nd, n)
Return a callable object whose result is a complex kind that
will hold a complex number each of whose components (.real,
.imag) is of kind float_kind(nd, n). The kind object will
accept one argument that is of any integer, real, or complex
kind, or two arguments, each integer or real.
complex_kinds is a list of the available complex kinds, sorted
from lowest to highest kind.
default_complex_kind is the kind object corresponding to the
Python literal 0.0j. The name of this kind
is doublecomplex, and its typecode is 'D'.
Complex kind objects have these addition attributes:
floatkind is the kind object of the corresponding float type.
Examples
In module myprecision.py:
import kinds
tinyint = kinds.int_kind(1)
single = kinds.float_kind(6, 90)
double = kinds.float_kind(15, 300)
csingle = kinds.complex_kind(6, 90)
In the rest of my code:
from myprecision import tinyint, single, double, csingle
n = tinyint(3)
x = double(1.e20)
z = 1.2
# builtin float gets you the default float kind, properties unknown
w = x * float(x)
# but in the following case we know w has kind "double".
w = x * double(z)
u = csingle(x + z * 1.0j)
u2 = csingle(x+z, 1.0)
Note how that entire code can then be changed to a higher
precision by changing the arguments in myprecision.py.
Comment: note that you aren't promised that single != double; but
you are promised that double(1.e20) will hold a number with 15
decimal digits of precision and a range up to 10**300 or that the
float_kind call will fail.
Open Issues
No open issues have been raised at this time.
Rejection
This PEP has been closed by the author. The kinds module will not
be added to the standard library.
There was no opposition to the proposal but only mild interest in
using it, not enough to justify adding the module to the standard
library. Instead, it will be made available as a separate
distribution item at the Numerical Python site. At the next
release of Numerical Python, it will no longer be a part of the
Numeric distribution.
Copyright
This document has been placed in the public domain.
pep-0243 Module Repository Upload Mechanism
| PEP: | 243 |
|---|---|
| Title: | Module Repository Upload Mechanism |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Sean Reifschneider <jafo-pep at tummy.com> |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 18-Mar-2001 |
| Python-Version: | 2.1 |
| Post-History: | 20-Mar-2001, 24-Mar-2001 |
Abstract
For a module repository system (such as Perl's CPAN) to be
successful, it must be as easy as possible for module authors to
submit their work. An obvious place for this submit to happen is
in the Distutils tools after the distribution archive has been
successfully created. For example, after a module author has
tested their software (verifying the results of "setup.py sdist"),
they might type "setup.py sdist --submit". This would flag
Distutils to submit the source distribution to the archive server
for inclusion and distribution to the mirrors.
This PEP only deals with the mechanism for submitting the software
distributions to the archive, and does not deal with the actual
archive/catalog server.
Upload Process
The upload will include the Distutils "PKG-INFO" meta-data
information (as specified in PEP-241 [1]), the actual software
distribution, and other optional information. This information
will be uploaded as a multi-part form encoded the same as a
regular HTML file upload request. This form is posted using
ENCTYPE="multipart/form-data" encoding [2].
The upload will be made to the host "www.python.org" on port
80/tcp (POST http://www.python.org:80/pypi). The form
will consist of the following fields:
distribution -- The file containing the module software (for
example, a .tar.gz or .zip file).
distmd5sum -- The MD5 hash of the uploaded distribution,
encoded in ASCII representing the hexadecimal representation
of the digest ("for byte in digest: s = s + ('%02x' %
ord(byte))").
pkginfo (optional) -- The file containing the distribution
meta-data (as specified in PEP-241 [1]). Note that if this is
not included, the distribution file is expected to be in .tar
format (gzipped and bzipped compreesed are allowed) or .zip
format, with a "PKG-INFO" file in the top-level directory it
extracts ("package-1.00/PKG-INFO").
infomd5sum (required if pkginfo field is present) -- The MD5 hash
of the uploaded meta-data, encoded in ASCII representing the
hexadecimal representation of the digest ("for byte in digest:
s = s + ('%02x' % ord(byte))").
platform (optional) -- A string representing the target
platform for this distribution. This is only for binary
distributions. It is encoded as
"<os_name>-<os_version>-<platform architecture>-<python
version>".
signature (optional) -- A OpenPGP-compatible signature [3] of
the uploaded distribution as signed by the author. This may
be used by the cataloging system to automate acceptance of
uploads.
protocol_version -- A string indicating the protocol version that
the client supports. This document describes protocol version "1".
Return Data
The status of the upload will be reported using HTTP non-standard
("X-*)" headers. The "X-Swalow-Status" header may have the following
values:
SUCCESS -- Indicates that the upload has succeeded.
FAILURE -- The upload is, for some reason, unable to be
processed.
TRYAGAIN -- The server is unable to accept the upload at this
time, but the client should try again at a later time.
Potential causes of this are resource shortages on the server,
administrative down-time, etc...
Optionally, there may be a "X-Swalow-Reason" header which includes a
human-readable string which provides more detailed information about
the "X-Swalow-Status".
If there is no "X-Swalow-Status" header, or it does not contain one of
the three strings above, it should be treated as a temporary failure.
Example:
>>> f = urllib.urlopen('http://www.python.org:80/pypi')
>>> s = f.headers['x-swalow-status']
>>> s = s + ': ' + f.headers.get('x-swalow-reason', '<None>')
>>> print s
FAILURE: Required field "distribution" missing.
Sample Form
The upload client must submit the page in the same form as
Netscape Navigator version 4.76 for Linux produces when presented
with the following form:
<H1>Upload file</H1>
<FORM NAME="fileupload" METHOD="POST" ACTION="pypi"
ENCTYPE="multipart/form-data">
<INPUT TYPE="file" NAME="distribution"><BR>
<INPUT TYPE="text" NAME="distmd5sum"><BR>
<INPUT TYPE="file" NAME="pkginfo"><BR>
<INPUT TYPE="text" NAME="infomd5sum"><BR>
<INPUT TYPE="text" NAME="platform"><BR>
<INPUT TYPE="text" NAME="signature"><BR>
<INPUT TYPE="hidden" NAME="protocol_version" VALUE="1"><BR>
<INPUT TYPE="SUBMIT" VALUE="Upload">
</FORM>
Platforms
The following are valid os names:
aix beos debian dos freebsd hpux mac macos mandrake netbsd
openbsd qnx redhat solaris suse windows yellowdog
The above include a number of different types of distributions of
Linux. Because of versioning issues these must be split out, and
it is expected that when it makes sense for one system to use
distributions made on other similar systems, the download client
will make the distinction.
Version is the official version string specified by the vendor for
the particular release. For example, "2000" and "nt" (Windows),
"9.04" (HP-UX), "7.0" (RedHat, Mandrake).
The following are valid architectures:
alpha hppa ix86 powerpc sparc ultrasparc
Status
I currently have a proof-of-concept client and server implemented.
I plan to have the Distutils patches ready for the 2.1 release.
Combined with Andrew's PEP-241 [1] for specifying distribution
meta-data, I hope to have a platform which will allow us to gather
real-world data for finalizing the catalog system for the 2.2
release.
References
[1] Metadata for Python Software Package, Kuchling,
http://www.python.org/dev/peps/pep-0241/
[2] RFC 1867, Form-based File Upload in HTML
http://www.faqs.org/rfcs/rfc1867.html
[3] RFC 2440, OpenPGP Message Format
http://www.faqs.org/rfcs/rfc2440.html
Copyright
This document has been placed in the public domain.
pep-0244 The `directive' statement
| PEP: | 244 |
|---|---|
| Title: | The `directive' statement |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 20-Mar-2001 |
| Python-Version: | 2.1 |
| Post-History: |
Motivation
From time to time, Python makes an incompatible change to the
advertised semantics of core language constructs, or changes their
accidental (implementation-dependent) behavior in some way. While
this is never done capriciously, and is always done with the aim
of improving the language over the long term, over the short term
it's contentious and disrupting.
PEP 1, Guidelines for Language Evolution[1] suggests ways to ease
the pain, and this PEP introduces some machinery in support of
that.
PEP 2, Statically Nested Scopes[2] is the first application, and
will be used as an example here.
When a new, potentially incompatible language feature is added,
some modules and libraries may chose to use it, while others may
not. This specification introduces a syntax where a module author
can denote whether a certain language feature is used in the
module or not.
In discussion of this PEP, readers commented that there are two
kinds of "settable" language features:
- those that are designed to eventually become the only option, at
which time specifying use of them is not necessary anymore. The
features for which the syntax of the "Back to the __future__"
PEP 236, Back to the __future__[3] was proposed fall into this
category. This PEP supports declaring such features, and
supports phasing out the "old" meaning of constructs whose
semantics has changed under the new feature. However, it
defines no policy as to what features must be phased out
eventually.
- those which are designed to stay optional forever, e.g. if they
change some default setting in the interpreter. An example for
such settings might be the request to always emit line-number
instructions for a certain module; no specific flags of that
kind are proposed in this specification.
Since a primary goal of this PEP is to support new language
constructs without immediately breaking old libraries, special
care was taken not to break old libraries by introducing the new
syntax.
Syntax
A directive_statement is a statement of the form
directive_statement: 'directive' NAME [atom] [';'] NEWLINE
The name in the directive indicates the kind of the directive; it
defines whether the optional atom can be present, and whether
there are further syntactical or semantical restrictions to the
atom. In addition, depending on the name of the directive,
certain additional syntactical or semantical restrictions may be
placed on the directive (e.g. placement of the directive in the
module may be restricted to the top of the module).
In the directive_statement, 'directive' is a new
keyword. According to [1], this keyword is initially considered as
a keyword only when used in a directive statement, see "Backwards
Compatibility" below.
Semantics
A directive statement instructs the Python interpreter to process
a source file in a different way; the specific details of that
processing depend on the directive name. The optional atom is
typically interpreted when the source code is processed; details
of that interpretation depend on the directive.
Specific Directives: transitional
If a syntactical or semantical change is added to Python which is
incompatible, [1] mandates a transitional evolution of the
language, where the new feature is initially available alongside
with the old one. Such a transition is possible by means of the
transitional directive.
In a transitional directive, the NAME is 'transitional'. The atom
MUST be present, and it MUST be a NAME. The possible values for
that name are defined when the language change is defined. One
example for such a directive is
directive transitional nested_scopes
The transitional directive MUST occur at before any other
statement in a module, except for the documentation string
(i.e. it may appear as the second statement of a module only if
the first statement is a STRING+).
Backwards Compatibility
Introducing 'directive' as a new keyword might cause
incompatibilities with existing code. Following the guideline in
[1], in the initial implementation of this specification,
directive is a new keyword only if it was used in a valid
directive_statement (i.e. if it appeared as the first non-string
token in a module).
Unresolved Problems: directive as the first identifier
Using directive in a module as
directive = 1
(i.e. the name directive appears as the first thing in a module)
will treat it as keyword, not as identifier. It would be possible
to classify it as a NAME with an additional look-ahead token, but
such look-ahead is not available in the Python tokenizer.
Questions and Answers
Q: It looks like this PEP was written to allow definition of source
code character sets. Is that true?
A: No. Even though the directive facility can be extended to
allow source code encodings, no specific directive is proposed.
Q: Then why was this PEP written at all?
A: It acts as a counter-proposal to [3], which proposes to
overload the import statement with a new meaning. This PEP
allows to solve the problem in a more general way.
Q: But isn't mixing source encodings and language changes like
mixing apples and oranges?
A: Perhaps. To address the difference, the predefined
"transitional" directive has been defined.
References and Footnotes
[1] PEP 5, Guidelines for Language Evolution, Prescod
http://www.python.org/dev/peps/pep-0005/
[2] PEP 227, Statically Nested Scopes, Hylton
http://www.python.org/dev/peps/pep-0227/
[3] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
Copyright
This document has been placed in the public domain.
pep-0245 Python Interface Syntax
| PEP: | 245 |
|---|---|
| Title: | Python Interface Syntax |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michel Pelletier <michel at users.sourceforge.net> |
| Discussions-To: | http://www.zope.org/Wikis/Interfaces |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 11-Jan-2001 |
| Python-Version: | 2.2 |
| Post-History: | 21-Mar-2001 |
Rejection Notice
I'm rejecting this PEP. It's been five years now. While at some
point I expect that Python will have interfaces, it would be naive
to expect it to resemble the syntax in this PEP. Also, PEP 246 is
being rejected in favor of something completely different; interfaces
won't play a role in adaptation or whatever will replace it. GvR.
Introduction
This PEP describes a proposed syntax for creating interface
objects in Python.
Overview
In addition to thinking about adding a static type system to
Python, the Types-SIG was also charged to devise an interface
system for Python. In December of 1998, Jim Fulton released a
prototype interfaces system based on discussions from the SIG.
Many of the issues and background information on this discussion
and prototype can be found in the SIG archives[1].
Around the end of 2000, Digital Creations began thinking about
better component model designs for Zope[2]. Zope's future
component model relies heavily on interface objects. This led to
further development of Jim's "Scarecrow" interfaces prototype.
Starting with version 2.3, Zope comes with an Interface package as
standard software. Zope's Interface package is used as the
reference implementation for this PEP.
The syntax proposed by this PEP relies on syntax enhancements
describe in PEP 232 [3] and describes an underlying framework
which PEP 233 [4] could be based upon. There is some work being
done with regard to interface objects and Proxy objects, so for
those optional parts of this PEP you may want to see[5].
The Problem
Interfaces are important because they solve a number of problems
that arise while developing software:
- There are many implied interfaces in Python, commonly referred
to as "protocols". Currently determining those protocols is
based on implementation introspection, but often that also
fails. For example, defining __getitem__ implies both a
sequence and a mapping (the former with sequential, integer
keys). There is no way for the developer to be explict about
which protocols the object intends to implement.
- Python is limited, from the developer's point of view, by the
split between types and classes. When types are expected, the
consumer uses code like 'type(foo) == type("")' to determine if
'foo' is a string. When instances of classes are expected, the
consumer uses 'isinstance(foo, MyString)' to determine if 'foo'
is an instance of the 'MyString' class. There is no unified
model for determining if an object can be used in a certain,
valid way.
- Python's dynamic typing is very flexible and powerful, but it
does not have the advantage of static typed languages that
provide type checking. Static typed langauges provide you with
much more type saftey, but are often overly verbose because
objects can only be generalized by common subclassing and used
specificly with casting (for example, in Java).
There are also a number of documentation problems that interfaces
try to solve.
- Developers waste a lot of time looking at the source code of
your system to figure out how objects work.
- Developers who are new to your system may misunderstand how your
objects work, causing, and possibly propagating, usage errors.
- Because a lack of interfaces means usage is inferred from the
source, developers may end up using methods and attributes that
are meant for "internal use only".
- Code inspection can be hard, and very discouraging to novice
programmers trying to properly understand code written by gurus.
- A lot of time is wasted when many people try very hard to
understand obscurity (like undocumented software). Effort spend
up front documenting interfaces will save much of this time in
the end.
Interfaces try to solve these problems by providing a way for you
to specify a contractual obligation for your object, documentation
on how to use an object, and a built-in mechanism for discovering
the contract and the documentation.
Python has very useful introspection features. It is well known
that this makes exploring concepts in the interactive interpreter
easier, because Python gives you the ability to look at all kinds
of information about the objects: the type, doc strings, instance
dictionaries, base classes, unbound methods and more.
Many of these features are oriented toward introspecting, using
and changing the implementation of software, and one of them ("doc
strings") is oriented toward providing documentation. This
proposal describes an extension to this natural introspection
framework that describes an object's interface.
Overview of the Interface Syntax
For the most part, the syntax of interfaces is very much like the
syntax of classes, but future needs, or needs brought up in
discussion, may define new possibilities for interface syntax.
A formal BNF description of the syntax is givena later in the PEP,
for the purposes of illustration, here is an example of two
different interfaces created with the proposed syntax:
interface CountFishInterface:
"Fish counting interface"
def oneFish():
"Increments the fish count by one"
def twoFish():
"Increments the fish count by two"
def getFishCount():
"Returns the fish count"
interface ColorFishInterface:
"Fish coloring interface"
def redFish():
"Sets the current fish color to red"
def blueFish():
"Sets the current fish color to blue"
def getFishColor():
"This returns the current fish color"
This code, when evaluated, will create two interfaces called
`CountFishInterface' and `ColorFishInterface'. These interfaces
are defined by the `interface' statement.
The prose documentation for the interfaces and their methods come
from doc strings. The method signature information comes from the
signatures of the `def' statements. Notice how there is no body
for the def statements. The interface does not implement a
service to anything; it merely describes one. Documentation
strings on interfaces and interface methods are mandatory, a
'pass' statement cannot be provided. The interface equivalent of
a pass statement is an empty doc string.
You can also create interfaces that "extend" other interfaces.
Here, you can see a new type of Interface that extends the
CountFishInterface and ColorFishInterface:
interface FishMarketInterface(CountFishInterface, ColorFishInterface):
"This is the documentation for the FishMarketInterface"
def getFishMonger():
"Returns the fish monger you can interact with"
def hireNewFishMonger(name):
"Hire a new fish monger"
def buySomeFish(quantity=1):
"Buy some fish at the market"
The FishMarketInteface extends upon the CountFishInterface and
ColorfishInterface.
Interface Assertion
The next step is to put classes and interfaces together by
creating a concrete Python class that asserts that it implements
an interface. Here is an example FishMarket component that might
do this:
class FishError(Error):
pass
class FishMarket implements FishMarketInterface:
number = 0
color = None
monger_name = 'Crusty Barnacles'
def __init__(self, number, color):
self.number = number
self.color = color
def oneFish(self):
self.number += 1
def twoFish(self):
self.number += 2
def redFish(self):
self.color = 'red'
def blueFish(self):
self.color = 'blue'
def getFishCount(self):
return self.number
def getFishColor(self):
return self.color
def getFishMonger(self):
return self.monger_name
def hireNewFishMonger(self, name):
self.monger_name = name
def buySomeFish(self, quantity=1):
if quantity > self.count:
raise FishError("There's not enough fish")
self.count -= quantity
return quantity
This new class, FishMarket defines a concrete class which
implements the FishMarketInterface. The object following the
`implements' statement is called an "interface assertion". An
interface assertion can be either an interface object, or tuple of
interface assertions.
The interface assertion provided in a `class' statement like this
is stored in the class's `__implements__' class attribute. After
interpreting the above example, you would have a class statement
that can be examined like this with an 'implements' built-in
function:
>>> FishMarket
<class FishMarket at 8140f50>
>>> FishMarket.__implements__
(<Interface FishMarketInterface at 81006f0>,)
>>> f = FishMarket(6, 'red')
>>> implements(f, FishMarketInterface)
1
>>>
A class can realize more than one interface. For example, say you
had an interface called `ItemInterface' that described how an
object worked as an item in a container object. If you wanted to
assert that FishMarket instances realized the ItemInterface
interface as well as the FishMarketInterface, you can provide an
interface assertion that contained a tuple of interface objects to
the FishMarket class:
class FishMarket implements FishMarketInterface, ItemInterface:
# ...
Interface assertions can also be used if you want to assert that
one class implements an interface, and all of the interfaces that
another class implements:
class MyFishMarket implements FishMarketInterface, ItemInterface:
# ...
class YourFishMarket implements FooInterface, MyFishMarket.__implements__:
# ...
This new class YourFishMarket, asserts that it implements the
FooInterface, as well as the interfaces implemented by the
MyFishMarket class.
It's worth going into a little bit more detail about interface
assertions. An interface assertion is either an interface object,
or a tuple of interface assertions. For example:
FooInterface
FooInterface, (BarInteface, BobInterface)
FooInterface, (BarInterface, (BobInterface, MyClass.__implements__))
Are all valid interface assertions. When two interfaces define
the same attributes, the order in which information is preferred
in the assertion is from top-to-bottom, left-to-right.
There are other interface proposals that, in the need for
simplicity, have combined the notion of class and interface to
provide simple interface enforcement. Interface objects have a
`deferred' method that returns a deferred class that implements
this behavior:
>>> FM = FishMarketInterface.deferred()
>>> class MyFM(FM): pass
>>> f = MyFM()
>>> f.getFishMonger()
Traceback (innermost last):
File "<stdin>", line 1, in ?
Interface.Exceptions.BrokenImplementation:
An object has failed to implement interface FishMarketInterface
The getFishMonger attribute was not provided.
>>>
This provides for a bit of passive interface enforcement by
telling you what you forgot to do to implement that interface.
Formal Interface Syntax
Python syntax is defined in a modified BNF grammer notation
described in the Python Reference Manual [8]. This section
describes the proposed interface syntax using this grammar:
interfacedef: "interface" interfacename [extends] ":" suite
extends: "(" [expression_list] ")"
interfacename: identifier
An interface definition is an executable statement. It first
evaluates the extends list, if present. Each item in the extends
list should evaluate to an interface object.
The interface's suite is then executed in a new execution frame
(see the Python Reference Manual, section 4.1), using a newly
created local namespace and the original global namespace. When
the interface's suite finishes execution, its execution frame is
discarded but its local namespace is saved as interface elements.
An interface object is then created using the extends list for the
base interfaces and the saved interface elements. The interface
name is bound to this interface object in the original local
namespace.
This PEP also proposes an extension to Python's 'class' statement:
classdef: "class" classname [inheritance] [implements] ":" suite
implements: "implements" implist
implist: expression-list
classname,
inheritance,
suite,
expression-list: see the Python Reference Manual
Before a class' suite is executed, the 'inheritance' and
'implements' statements are evaluated, if present. The
'inheritance' behavior is unchanged as defined in Section 7.6 of
the Language Reference.
The 'implements', if present, is evaluated after inheritance.
This must evaluate to an interface specification, which is either
an interface, or a tuple of interface specifications. If a valid
interface specification is present, the assertion is assigned to
the class object's '__implements__' attribute, as a tuple.
This PEP does not propose any changes to the syntax of function
definitions or assignments.
Classes and Interfaces
The example interfaces above do not describe any kind of behavior
for their methods, they just describe an interface that a typical
FishMarket object would realize.
You may notice a similarity between interfaces extending from
other interfaces and classes sub-classing from other classes.
This is a similar concept. However it is important to note that
interfaces extend interfaces and classes subclass classes. You
cannot extend a class or subclass an interface. Classes and
interfaces are separate.
The purpose of a class is to share the implementation of how an
object works. The purpose of an interface is to document how to
work with an object, not how the object is implemented. It is
possible to have several different classes with very different
implementations realize the same interface.
It's also possible to implement one interface with many classes
that mix in pieces the functionality of the interface or,
conversely, it's possible to have one class implement many
interfaces. Because of this, interfaces and classes should not be
confused or intermingled.
Interface-aware built-ins
A useful extension to Python's list of built-in functions in the
light of interface objects would be `implements()'. This builtin
would expect two arguments, an object and an interface, and return
a true value if the object implements the interface, false
otherwise. For example:
>>> interface FooInterface: pass
>>> class Foo implements FooInterface: pass
>>> f = Foo()
>>> implements(f, FooInterface)
1
Currently, this functionality exists in the reference
implementation as functions in the `Interface' package, requiring
an "import Interface" to use it. Its existence as a built-in
would be purely for a convenience, and not necessary for using
interfaces, and analogous to `isinstance()' for classes.
Backward Compatibility
The proposed interface model does not introduce any backward
compatibility issues in Python. The proposed syntax, however,
does.
Any existing code that uses `interface' as an identifier will
break. There may be other kinds of backwards incompatibility that
defining `interface' as a new keyword will introduce. This
extension to Python's syntax does not change any existing syntax
in any backward incompatible way.
The new `from __future__' Python syntax[6], and the new warning
framework [7] is ideal for resolving this backward
incompatibility. To use interface syntax now, a developer could
use the statement:
from __future__ import interfaces
In addition, any code that uses the keyword `interface' as an
identifier will be issued a warning from Python. After the
appropriate period of time, the interface syntax would become
standard, the above import statement would do nothing, and any
identifiers named `interface' would raise an exception. This
period of time is proposed to be 24 months.
Summary of Proposed Changes to Python
Adding new `interface' keyword and extending class syntax with
`implements'.
Extending class interface to include __implements__.
Add 'implements(obj, interface)' built-in.
Risks
This PEP proposes adding one new keyword to the Python language,
`interface'. This will break code.
Open Issues
Goals
Syntax
Architecture
Dissenting Opinion
This PEP has not yet been discussed on python-dev.
References
[1] http://mail.python.org/pipermail/types-sig/1998-December/date.html
[2] http://www.zope.org
[3] PEP 232, Function Attributes, Warsaw
http://www.python.org/dev/peps/pep-0232/
[4] PEP 233, Python Online Help, Prescod
http://www.python.org/dev/peps/pep-0233/
[5] http://www.lemburg.com/files/python/mxProxy.html
[6] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
[7] PEP 230, Warning Framework, van Rossum
http://www.python.org/dev/peps/pep-0236/
Copyright
This document has been placed in the public domain.
pep-0246 Object Adaptation
| PEP: | 246 |
|---|---|
| Title: | Object Adaptation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Alex Martelli <aleaxit at gmail.com>, Clark C. Evans <cce at clarkevans.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 21-Mar-2001 |
| Python-Version: | 2.5 |
| Post-History: | 29-Mar-2001, 10-Jan-2005 |
Rejection Notice
I'm rejecting this PEP. Something much better is about to happen;
it's too early to say exactly what, but it's not going to resemble
the proposal in this PEP too closely so it's better to start a new
PEP. GvR.
Abstract
This proposal puts forth an extensible cooperative mechanism for
the adaptation of an incoming object to a context which expects an
object supporting a specific protocol (say a specific type, class,
or interface).
This proposal provides a built-in "adapt" function that, for any
object X and any protocol Y, can be used to ask the Python
environment for a version of X compliant with Y. Behind the
scenes, the mechanism asks object X: "Are you now, or do you know
how to wrap yourself to provide, a supporter of protocol Y?".
And, if this request fails, the function then asks protocol Y:
"Does object X support you, or do you know how to wrap it to
obtain such a supporter?" This duality is important, because
protocols can be developed after objects are, or vice-versa, and
this PEP lets either case be supported non-invasively with regard
to the pre-existing component[s].
Lastly, if neither the object nor the protocol know about each
other, the mechanism may check a registry of adapter factories,
where callables able to adapt certain objects to certain protocols
can be registered dynamically. This part of the proposal is
optional: the same effect could be obtained by ensuring that
certain kinds of protocols and/or objects can accept dynamic
registration of adapter factories, for example via suitable custom
metaclasses. However, this optional part allows adaptation to be
made more flexible and powerful in a way that is not invasive to
either protocols or other objects, thereby gaining for adaptation
much the same kind of advantage that Python standard library's
"copy_reg" module offers for serialization and persistence.
This proposal does not specifically constrain what a protocol
_is_, what "compliance to a protocol" exactly _means_, nor what
precisely a wrapper is supposed to do. These omissions are
intended to leave this proposal compatible with both existing
categories of protocols, such as the existing system of type and
classes, as well as the many concepts for "interfaces" as such
which have been proposed or implemented for Python, such as the
one in PEP 245 [1], the one in Zope3 [2], or the ones discussed in
the BDFL's Artima blog in late 2004 and early 2005 [3]. However,
some reflections on these subjects, intended to be suggestive and
not normative, are also included.
Motivation
Currently there is no standardized mechanism in Python for
checking if an object supports a particular protocol. Typically,
existence of certain methods, particularly special methods such as
__getitem__, is used as an indicator of support for a particular
protocol. This technique works well for a few specific protocols
blessed by the BDFL (Benevolent Dictator for Life). The same can
be said for the alternative technique based on checking
'isinstance' (the built-in class "basestring" exists specifically
to let you use 'isinstance' to check if an object "is a [built-in]
string"). Neither approach is easily and generally extensible to
other protocols, defined by applications and third party
frameworks, outside of the standard Python core.
Even more important than checking if an object already supports a
given protocol can be the task of obtaining a suitable adapter
(wrapper or proxy) for the object, if the support is not already
there. For example, a string does not support the file protocol,
but you can wrap it into a StringIO instance to obtain an object
which does support that protocol and gets its data from the string
it wraps; that way, you can pass the string (suitably wrapped) to
subsystems which require as their arguments objects that are
readable as files. Unfortunately, there is currently no general,
standardized way to automate this extremely important kind of
"adaptation by wrapping" operations.
Typically, today, when you pass objects to a context expecting a
particular protocol, either the object knows about the context and
provides its own wrapper or the context knows about the object and
wraps it appropriately. The difficulty with these approaches is
that such adaptations are one-offs, are not centralized in a
single place of the users code, and are not executed with a common
technique, etc. This lack of standardization increases code
duplication with the same adapter occurring in more than one place
or it encourages classes to be re-written instead of adapted. In
either case, maintainability suffers.
It would be very nice to have a standard function that can be
called upon to verify an object's compliance with a particular
protocol and provide for a wrapper if one is readily available --
all without having to hunt through each library's documentation
for the incantation appropriate to that particular, specific case.
Requirements
When considering an object's compliance with a protocol, there are
several cases to be examined:
a) When the protocol is a type or class, and the object has
exactly that type or is an instance of exactly that class (not
a subclass). In this case, compliance is automatic.
b) When the object knows about the protocol, and either considers
itself compliant, or knows how to wrap itself suitably.
c) When the protocol knows about the object, and either the object
already complies or the protocol knows how to suitably wrap the
object.
d) When the protocol is a type or class, and the object is a
member of a subclass. This is distinct from the first case (a)
above, since inheritance (unfortunately) does not necessarily
imply substitutability, and thus must be handled carefully.
e) When the context knows about the object and the protocol and
knows how to adapt the object so that the required protocol is
satisfied. This could use an adapter registry or similar
approaches.
The fourth case above is subtle. A break of substitutability can
occur when a subclass changes a method's signature, or restricts
the domains accepted for a method's argument ("co-variance" on
arguments types), or extends the co-domain to include return
values which the base class may never produce ("contra-variance"
on return types). While compliance based on class inheritance
_should_ be automatic, this proposal allows an object to signal
that it is not compliant with a base class protocol.
If Python gains some standard "official" mechanism for interfaces,
however, then the "fast-path" case (a) can and should be extended
to the protocol being an interface, and the object an instance of
a type or class claiming compliance with that interface. For
example, if the "interface" keyword discussed in [3] is adopted
into Python, the "fast path" of case (a) could be used, since
instantiable classes implementing an interface would not be
allowed to break substitutability.
Specification
This proposal introduces a new built-in function, adapt(), which
is the basis for supporting these requirements.
The adapt() function has three parameters:
- `obj', the object to be adapted
- `protocol', the protocol requested of the object
- `alternate', an optional object to return if the object could
not be adapted
A successful result of the adapt() function returns either the
object passed `obj', if the object is already compliant with the
protocol, or a secondary object `wrapper', which provides a view
of the object compliant with the protocol. The definition of
wrapper is deliberately vague, and a wrapper is allowed to be a
full object with its own state if necessary. However, the design
intention is that an adaptation wrapper should hold a reference to
the original object it wraps, plus (if needed) a minimum of extra
state which it cannot delegate to the wrapper object.
An excellent example of adaptation wrapper is an instance of
StringIO which adapts an incoming string to be read as if it was a
textfile: the wrapper holds a reference to the string, but deals
by itself with the "current point of reading" (from _where_ in the
wrapped strings will the characters for the next, e.g., "readline"
call come from), because it cannot delegate it to the wrapped
object (a string has no concept of "current point of reading" nor
anything else even remotely related to that concept).
A failure to adapt the object to the protocol raises an
AdaptationError (which is a subclass of TypeError), unless the
alternate parameter is used, in this case the alternate argument
is returned instead.
To enable the first case listed in the requirements, the adapt()
function first checks to see if the object's type or the object's
class are identical to the protocol. If so, then the adapt()
function returns the object directly without further ado.
To enable the second case, when the object knows about the
protocol, the object must have a __conform__() method. This
optional method takes two arguments:
- `self', the object being adapted
- `protocol, the protocol requested
Just like any other special method in today's Python, __conform__
is meant to be taken from the object's class, not from the object
itself (for all objects, except instances of "classic classes" as
long as we must still support the latter). This enables a
possible 'tp_conform' slot to be added to Python's type objects in
the future, if desired.
The object may return itself as the result of __conform__ to
indicate compliance. Alternatively, the object also has the
option of returning a wrapper object compliant with the protocol.
If the object knows it is not compliant although it belongs to a
type which is a subclass of the protocol, then __conform__ should
raise a LiskovViolation exception (a subclass of AdaptationError).
Finally, if the object cannot determine its compliance, it should
return None to enable the remaining mechanisms. If __conform__
raises any other exception, "adapt" just propagates it.
To enable the third case, when the protocol knows about the
object, the protocol must have an __adapt__() method. This
optional method takes two arguments:
- `self', the protocol requested
- `obj', the object being adapted
If the protocol finds the object to be compliant, it can return
obj directly. Alternatively, the method may return a wrapper
compliant with the protocol. If the protocol knows the object is
not compliant although it belongs to a type which is a subclass of
the protocol, then __adapt__ should raise a LiskovViolation
exception (a subclass of AdaptationError). Finally, when
compliance cannot be determined, this method should return None to
enable the remaining mechanisms. If __adapt__ raises any other
exception, "adapt" just propagates it.
The fourth case, when the object's class is a sub-class of the
protocol, is handled by the built-in adapt() function. Under
normal circumstances, if "isinstance(object, protocol)" then
adapt() returns the object directly. However, if the object is
not substitutable, either the __conform__() or __adapt__()
methods, as above mentioned, may raise an LiskovViolation (a
subclass of AdaptationError) to prevent this default behavior.
If none of the first four mechanisms worked, as a last-ditch
attempt, 'adapt' falls back to checking a registry of adapter
factories, indexed by the protocol and the type of `obj', to meet
the fifth case. Adapter factories may be dynamically registered
and removed from that registry to provide "third party adaptation"
of objects and protocols that have no knowledge of each other, in
a way that is not invasive to either the object or the protocols.
Intended Use
The typical intended use of adapt is in code which has received
some object X "from the outside", either as an argument or as the
result of calling some function, and needs to use that object
according to a certain protocol Y. A "protocol" such as Y is
meant to indicate an interface, usually enriched with some
semantics constraints (such as are typically used in the "design
by contract" approach), and often also some pragmatical
expectation (such as "the running time of a certain operation
should be no worse than O(N)", or the like); this proposal does
not specify how protocols are designed as such, nor how or whether
compliance to a protocol is checked, nor what the consequences may
be of claiming compliance but not actually delivering it (lack of
"syntactic" compliance -- names and signatures of methods -- will
often lead to exceptions being raised; lack of "semantic"
compliance may lead to subtle and perhaps occasional errors
[imagine a method claiming to be threadsafe but being in fact
subject to some subtle race condition, for example]; lack of
"pragmatic" compliance will generally lead to code that runs
``correctly'', but too slowly for practical use, or sometimes to
exhaustion of resources such as memory or disk space).
When protocol Y is a concrete type or class, compliance to it is
intended to mean that an object allows all of the operations that
could be performed on instances of Y, with "comparable" semantics
and pragmatics. For example, a hypothetical object X that is a
singly-linked list should not claim compliance with protocol
'list', even if it implements all of list's methods: the fact that
indexing X[n] takes time O(n), while the same operation would be
O(1) on a list, makes a difference. On the other hand, an
instance of StringIO.StringIO does comply with protocol 'file',
even though some operations (such as those of module 'marshal')
may not allow substituting one for the other because they perform
explicit type-checks: such type-checks are "beyond the pale" from
the point of view of protocol compliance.
While this convention makes it feasible to use a concrete type or
class as a protocol for purposes of this proposal, such use will
often not be optimal. Rarely will the code calling 'adapt' need
ALL of the features of a certain concrete type, particularly for
such rich types as file, list, dict; rarely can all those features
be provided by a wrapper with good pragmatics, as well as syntax
and semantics that are really the same as a concrete type's.
Rather, once this proposal is accepted, a design effort needs to
start to identify the essential characteristics of those protocols
which are currently used in Python, particularly within the
standard library, and to formalize them using some kind of
"interface" construct (not necessarily requiring any new syntax: a
simple custom metaclass would let us get started, and the results
of the effort could later be migrated to whatever "interface"
construct is eventually accepted into the Python language). With
such a palette of more formally designed protocols, the code using
'adapt' will be able to ask for, say, adaptation into "a filelike
object that is readable and seekable", or whatever else it
specifically needs with some decent level of "granularity", rather
than too-generically asking for compliance to the 'file' protocol.
Adaptation is NOT "casting". When object X itself does not
conform to protocol Y, adapting X to Y means using some kind of
wrapper object Z, which holds a reference to X, and implements
whatever operation Y requires, mostly by delegating to X in
appropriate ways. For example, if X is a string and Y is 'file',
the proper way to adapt X to Y is to make a StringIO(X), *NOT* to
call file(X) [which would try to open a file named by X].
Numeric types and protocols may need to be an exception to this
"adaptation is not casting" mantra, however.
Guido's "Optional Static Typing: Stop the Flames" Blog Entry
A typical simple use case of adaptation would be:
def f(X):
X = adapt(X, Y)
# continue by using X according to protocol Y
In [4], the BDFL has proposed introducing the syntax:
def f(X: Y):
# continue by using X according to protocol Y
to be a handy shortcut for exactly this typical use of adapt, and,
as a basis for experimentation until the parser has been modified
to accept this new syntax, a semantically equivalent decorator:
@arguments(Y)
def f(X):
# continue by using X according to protocol Y
These BDFL ideas are fully compatible with this proposal, as are
other of Guido's suggestions in the same blog.
Reference Implementation and Test Cases
The following reference implementation does not deal with classic
classes: it consider only new-style classes. If classic classes
need to be supported, the additions should be pretty clear, though
a bit messy (x.__class__ vs type(x), getting boundmethods directly
from the object rather than from the type, and so on).
-----------------------------------------------------------------
adapt.py
-----------------------------------------------------------------
class AdaptationError(TypeError):
pass
class LiskovViolation(AdaptationError):
pass
_adapter_factory_registry = {}
def registerAdapterFactory(objtype, protocol, factory):
_adapter_factory_registry[objtype, protocol] = factory
def unregisterAdapterFactory(objtype, protocol):
del _adapter_factory_registry[objtype, protocol]
def _adapt_by_registry(obj, protocol, alternate):
factory = _adapter_factory_registry.get((type(obj), protocol))
if factory is None:
adapter = alternate
else:
adapter = factory(obj, protocol, alternate)
if adapter is AdaptationError:
raise AdaptationError
else:
return adapter
def adapt(obj, protocol, alternate=AdaptationError):
t = type(obj)
# (a) first check to see if object has the exact protocol
if t is protocol:
return obj
try:
# (b) next check if t.__conform__ exists & likes protocol
conform = getattr(t, '__conform__', None)
if conform is not None:
result = conform(obj, protocol)
if result is not None:
return result
# (c) then check if protocol.__adapt__ exists & likes obj
adapt = getattr(type(protocol), '__adapt__', None)
if adapt is not None:
result = adapt(protocol, obj)
if result is not None:
return result
except LiskovViolation:
pass
else:
# (d) check if object is instance of protocol
if isinstance(obj, protocol):
return obj
# (e) last chance: try the registry
return _adapt_by_registry(obj, protocol, alternate)
-----------------------------------------------------------------
test.py
-----------------------------------------------------------------
from adapt import AdaptationError, LiskovViolation, adapt
from adapt import registerAdapterFactory, unregisterAdapterFactory
import doctest
class A(object):
'''
>>> a = A()
>>> a is adapt(a, A) # case (a)
True
'''
class B(A):
'''
>>> b = B()
>>> b is adapt(b, A) # case (d)
True
'''
class C(object):
'''
>>> c = C()
>>> c is adapt(c, B) # case (b)
True
>>> c is adapt(c, A) # a failure case
Traceback (most recent call last):
...
AdaptationError
'''
def __conform__(self, protocol):
if protocol is B:
return self
class D(C):
'''
>>> d = D()
>>> d is adapt(d, D) # case (a)
True
>>> d is adapt(d, C) # case (d) explicitly blocked
Traceback (most recent call last):
...
AdaptationError
'''
def __conform__(self, protocol):
if protocol is C:
raise LiskovViolation
class MetaAdaptingProtocol(type):
def __adapt__(cls, obj):
return cls.adapt(obj)
class AdaptingProtocol:
__metaclass__ = MetaAdaptingProtocol
@classmethod
def adapt(cls, obj):
pass
class E(AdaptingProtocol):
'''
>>> a = A()
>>> a is adapt(a, E) # case (c)
True
>>> b = A()
>>> b is adapt(b, E) # case (c)
True
>>> c = C()
>>> c is adapt(c, E) # a failure case
Traceback (most recent call last):
...
AdaptationError
'''
@classmethod
def adapt(cls, obj):
if isinstance(obj, A):
return obj
class F(object):
pass
def adapt_F_to_A(obj, protocol, alternate):
if isinstance(obj, F) and issubclass(protocol, A):
return obj
else:
return alternate
def test_registry():
'''
>>> f = F()
>>> f is adapt(f, A) # a failure case
Traceback (most recent call last):
...
AdaptationError
>>> registerAdapterFactory(F, A, adapt_F_to_A)
>>> f is adapt(f, A) # case (e)
True
>>> unregisterAdapterFactory(F, A)
>>> f is adapt(f, A) # a failure case again
Traceback (most recent call last):
...
AdaptationError
>>> registerAdapterFactory(F, A, adapt_F_to_A)
'''
doctest.testmod()
Relationship To Microsoft's QueryInterface
Although this proposal has some similarities to Microsoft's (COM)
QueryInterface, it differs by a number of aspects.
First, adaptation in this proposal is bi-directional, allowing the
interface (protocol) to be queried as well, which gives more
dynamic abilities (more Pythonic). Second, there is no special
"IUnknown" interface which can be used to check or obtain the
original unwrapped object identity, although this could be
proposed as one of those "special" blessed interface protocol
identifiers. Third, with QueryInterface, once an object supports
a particular interface it must always there after support this
interface; this proposal makes no such guarantee, since, in
particular, adapter factories can be dynamically added to the
registried and removed again later.
Fourth, implementations of Microsoft's QueryInterface must support
a kind of equivalence relation -- they must be reflexive,
symmetrical, and transitive, in specific senses. The equivalent
conditions for protocol adaptation according to this proposal
would also represent desirable properties:
# given, to start with, a successful adaptation:
X_as_Y = adapt(X, Y)
# reflexive:
assert adapt(X_as_Y, Y) is X_as_Y
# transitive:
X_as_Z = adapt(X, Z, None)
X_as_Y_as_Z = adapt(X_as_Y, Z, None)
assert (X_as_Y_as_Z is None) == (X_as_Z is None)
# symmetrical:
X_as_Z_as_Y = adapt(X_as_Z, Y, None)
assert (X_as_Y_as_Z is None) == (X_as_Z_as_Y is None)
However, while these properties are desirable, it may not be
possible to guarantee them in all cases. QueryInterface can
impose their equivalents because it dictates, to some extent, how
objects, interfaces, and adapters are to be coded; this proposal
is meant to be not necessarily invasive, usable and to "retrofit"
adaptation between two frameworks coded in mutual ignorance of
each other without having to modify either framework.
Transitivity of adaptation is in fact somewhat controversial, as
is the relationship (if any) between adaptation and inheritance.
The latter would not be controversial if we knew that inheritance
always implies Liskov substitutability, which, unfortunately we
don't. If some special form, such as the interfaces proposed in
[4], could indeed ensure Liskov substitutability, then for that
kind of inheritance, only, we could perhaps assert that if X
conforms to Y and Y inherits from Z then X conforms to Z... but
only if substitutability was taken in a very strong sense to
include semantics and pragmatics, which seems doubtful. (For what
it's worth: in QueryInterface, inheritance does not require nor
imply conformance). This proposal does not include any "strong"
effects of inheritance, beyond the small ones specifically
detailed above.
Similarly, transitivity might imply multiple "internal" adaptation
passes to get the result of adapt(X, Z) via some intermediate Y,
intrinsically like adapt(adapt(X, Y), Z), for some suitable and
automatically chosen Y. Again, this may perhaps be feasible under
suitably strong constraints, but the practical implications of
such a scheme are still unclear to this proposal's authors. Thus,
this proposal does not include any automatic or implicit
transitivity of adaptation, under whatever circumstances.
For an implementation of the original version of this proposal
which performs more advanced processing in terms of transitivity,
and of the effects of inheritance, see Phillip J. Eby's
PyProtocols [5]. The documentation accompanying PyProtocols is
well worth studying for its considerations on how adapters should
be coded and used, and on how adaptation can remove any need for
typechecking in application code.
Questions and Answers
Q: What benefit does this proposal provide?
A: The typical Python programmer is an integrator, someone who is
connecting components from various suppliers. Often, to
interface between these components, one needs intermediate
adapters. Usually the burden falls upon the programmer to
study the interface exposed by one component and required by
another, determine if they are directly compatible, or develop
an adapter. Sometimes a supplier may even include the
appropriate adapter, but even then searching for the adapter
and figuring out how to deploy the adapter takes time.
This technique enables suppliers to work with each other
directly, by implementing __conform__ or __adapt__ as
necessary. This frees the integrator from making their own
adapters. In essence, this allows the components to have a
simple dialogue among themselves. The integrator simply
connects one component to another, and if the types don't
automatically match an adapting mechanism is built-in.
Moreover, thanks to the adapter registry, a "fourth party" may
supply adapters to allow interoperation of frameworks which
are totally unaware of each other, non-invasively, and without
requiring the integrator to do anything more than install the
appropriate adapter factories in the registry at start-up.
As long as libraries and frameworks cooperate with the
adaptation infrastructure proposed here (essentially by
defining and using protocols appropriately, and calling
'adapt' as needed on arguments received and results of
call-back factory functions), the integrator's work thereby
becomes much simpler.
For example, consider SAX1 and SAX2 interfaces: there is an
adapter required to switch between them. Normally, the
programmer must be aware of this; however, with this
adaptation proposal in place, this is no longer the case --
indeed, thanks to the adapter registry, this need may be
removed even if the framework supplying SAX1 and the one
requiring SAX2 are unaware of each other.
Q: Why does this have to be built-in, can't it be standalone?
A: Yes, it does work standalone. However, if it is built-in, it
has a greater chance of usage. The value of this proposal is
primarily in standardization: having libraries and frameworks
coming from different suppliers, including the Python standard
library, use a single approach to adaptation. Furthermore:
0. The mechanism is by its very nature a singleton.
1. If used frequently, it will be much faster as a built-in.
2. It is extensible and unassuming.
3. Once 'adapt' is built-in, it can support syntax extensions
and even be of some help to a type inference system.
Q: Why the verbs __conform__ and __adapt__?
A: conform, verb intransitive
1. To correspond in form or character; be similar.
2. To act or be in accord or agreement; comply.
3. To act in accordance with current customs or modes.
adapt, verb transitive
1. To make suitable to or fit for a specific use or
situation.
Source: The American Heritage Dictionary of the English
Language, Third Edition
Backwards Compatibility
There should be no problem with backwards compatibility unless
someone had used the special names __conform__ or __adapt__ in
other ways, but this seems unlikely, and, in any case, user code
should never use special names for non-standard purposes.
This proposal could be implemented and tested without changes to
the interpreter.
Credits
This proposal was created in large part by the feedback of the
talented individuals on the main Python mailing lists and the
type-sig list. To name specific contributors (with apologies if
we missed anyone!), besides the proposal's authors: the main
suggestions for the proposal's first versions came from Paul
Prescod, with significant feedback from Robin Thomas, and we also
borrowed ideas from Marcin 'Qrczak' Kowalczyk and Carlos Ribeiro.
Other contributors (via comments) include Michel Pelletier, Jeremy
Hylton, Aahz Maruch, Fredrik Lundh, Rainer Deyke, Timothy Delaney,
and Huaiyu Zhu. The current version owes a lot to discussions
with (among others) Phillip J. Eby, Guido van Rossum, Bruce Eckel,
Jim Fulton, and Ka-Ping Yee, and to study and reflection of their
proposals, implementations, and documentation about use and
adaptation of interfaces and protocols in Python.
References and Footnotes
[1] PEP 245, Python Interface Syntax, Pelletier
http://www.python.org/dev/peps/pep-0245/
[2] http://www.zope.org/Wikis/Interfaces/FrontPage
[3] http://www.artima.com/weblogs/index.jsp?blogger=guido
[4] http://www.artima.com/weblogs/viewpost.jsp?thread=87182
[5] http://peak.telecommunity.com/PyProtocols.html
Copyright
This document has been placed in the public domain.
pep-0247 API for Cryptographic Hash Functions
| PEP: | 247 |
|---|---|
| Title: | API for Cryptographic Hash Functions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Final |
| Type: | Informational |
| Created: | 23-Mar-2001 |
| Post-History: | 20-Sep-2001 |
Abstract
There are several different modules available that implement
cryptographic hashing algorithms such as MD5 or SHA. This
document specifies a standard API for such algorithms, to make it
easier to switch between different implementations.
Specification
All hashing modules should present the same interface. Additional
methods or variables can be added, but those described in this
document should always be present.
Hash function modules define one function:
new([string]) (unkeyed hashes)
new([key] , [string]) (keyed hashes)
Create a new hashing object and return it. The first form is
for hashes that are unkeyed, such as MD5 or SHA. For keyed
hashes such as HMAC, 'key' is a required parameter containing
a string giving the key to use. In both cases, the optional
'string' parameter, if supplied, will be immediately hashed
into the object's starting state, as if obj.update(string) was
called.
After creating a hashing object, arbitrary strings can be fed
into the object using its update() method, and the hash value
can be obtained at any time by calling the object's digest()
method.
Arbitrary additional keyword arguments can be added to this
function, but if they're not supplied, sensible default values
should be used. For example, 'rounds' and 'digest_size'
keywords could be added for a hash function which supports a
variable number of rounds and several different output sizes,
and they should default to values believed to be secure.
Hash function modules define one variable:
digest_size
An integer value; the size of the digest produced by the
hashing objects created by this module, measured in bytes.
You could also obtain this value by creating a sample object
and accessing its 'digest_size' attribute, but it can be
convenient to have this value available from the module.
Hashes with a variable output size will set this variable to
None.
Hashing objects require a single attribute:
digest_size
This attribute is identical to the module-level digest_size
variable, measuring the size of the digest produced by the
hashing object, measured in bytes. If the hash has a variable
output size, this output size must be chosen when the hashing
object is created, and this attribute must contain the
selected size. Therefore None is *not* a legal value for this
attribute.
Hashing objects require the following methods:
copy()
Return a separate copy of this hashing object. An update to
this copy won't affect the original object.
digest()
Return the hash value of this hashing object as a string
containing 8-bit data. The object is not altered in any way
by this function; you can continue updating the object after
calling this function.
hexdigest()
Return the hash value of this hashing object as a string
containing hexadecimal digits. Lowercase letters should be used
for the digits 'a' through 'f'. Like the .digest() method, this
method mustn't alter the object.
update(string)
Hash 'string' into the current state of the hashing object.
update() can be called any number of times during a hashing
object's lifetime.
Hashing modules can define additional module-level functions or
object methods and still be compliant with this specification.
Here's an example, using a module named 'MD5':
>>> from Crypto.Hash import MD5
>>> m = MD5.new()
>>> m.digest_size
16
>>> m.update('abc')
>>> m.digest()
'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
>>> m.hexdigest()
'900150983cd24fb0d6963f7d28e17f72'
>>> MD5.new('abc').digest()
'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
Rationale
The digest size is measured in bytes, not bits, even though hash
algorithm sizes are usually quoted in bits; MD5 is a 128-bit
algorithm and not a 16-byte one, for example. This is because, in
the sample code I looked at, the length in bytes is often needed
(to seek ahead or behind in a file; to compute the length of an
output string) while the length in bits is rarely used.
Therefore, the burden will fall on the few people actually needing
the size in bits, who will have to multiply digest_size by 8.
It's been suggested that the update() method would be better named
append(). However, that method is really causing the current
state of the hashing object to be updated, and update() is already
used by the md5 and sha modules included with Python, so it seems
simplest to leave the name update() alone.
The order of the constructor's arguments for keyed hashes was a
sticky issue. It wasn't clear whether the key should come first
or second. It's a required parameter, and the usual convention is
to place required parameters first, but that also means that the
'string' parameter moves from the first position to the second.
It would be possible to get confused and pass a single argument to
a keyed hash, thinking that you're passing an initial string to an
unkeyed hash, but it doesn't seem worth making the interface
for keyed hashes more obscure to avoid this potential error.
Changes
2001-09-17: Renamed clear() to reset(); added digest_size attribute
to objects; added .hexdigest() method.
2001-09-20: Removed reset() method completely.
2001-09-28: Set digest_size to None for variable-size hashes.
Acknowledgements
Thanks to Aahz, Andrew Archibald, Rich Salz, Itamar
Shtull-Trauring, and the readers of the python-crypto list for
their comments on this PEP.
Copyright
This document has been placed in the public domain.
pep-0248 Python Database API Specification v1.0
| PEP: | 248 |
|---|---|
| Title: | Python Database API Specification v1.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Discussions-To: | <db-sig at python.org> |
| Status: | Final |
| Type: | Informational |
| Created: | |
| Post-History: | |
| Superseded-By: | 249 |
Introduction
This API has been defined to encourage similarity between the
Python modules that are used to access databases. By doing this,
we hope to achieve a consistency leading to more easily understood
modules, code that is generally more portable across databases,
and a broader reach of database connectivity from Python.
This interface specification consists of several items:
* Module Interface
* Connection Objects
* Cursor Objects
* DBI Helper Objects
Comments and questions about this specification may be directed to
the SIG on Tabular Databases in Python
(http://www.python.org/sigs/db-sig).
This specification document was last updated on: April 9, 1996.
It will be known as Version 1.0 of this specification.
Module Interface
The database interface modules should typically be named with
something terminated by 'db'. Existing examples are: 'oracledb',
'informixdb', and 'pg95db'. These modules should export several
names:
modulename(connection_string)
Constructor for creating a connection to the database.
Returns a Connection Object.
error
Exception raised for errors from the database module.
Connection Objects
Connection Objects should respond to the following methods:
close()
Close the connection now (rather than whenever __del__ is
called). The connection will be unusable from this point
forward; an exception will be raised if any operation is
attempted with the connection.
commit()
Commit any pending transaction to the database.
rollback()
Roll the database back to the start of any pending
transaction.
cursor()
Return a new Cursor Object. An exception may be thrown if
the database does not support a cursor concept.
callproc([params])
(Note: this method is not well-defined yet.) Call a
stored database procedure with the given (optional)
parameters. Returns the result of the stored procedure.
(all Cursor Object attributes and methods)
For databases that do not have cursors and for simple
applications that do not require the complexity of a
cursor, a Connection Object should respond to each of the
attributes and methods of the Cursor Object. Databases
that have cursor can implement this by using an implicit,
internal cursor.
Cursor Objects
These objects represent a database cursor, which is used to manage
the context of a fetch operation.
Cursor Objects should respond to the following methods and
attributes:
arraysize
This read/write attribute specifies the number of rows to
fetch at a time with fetchmany(). This value is also used
when inserting multiple rows at a time (passing a
tuple/list of tuples/lists as the params value to
execute()). This attribute will default to a single row.
Note that the arraysize is optional and is merely provided
for higher performance database interactions.
Implementations should observe it with respect to the
fetchmany() method, but are free to interact with the
database a single row at a time.
description
This read-only attribute is a tuple of 7-tuples. Each
7-tuple contains information describing each result
column: (name, type_code, display_size, internal_size,
precision, scale, null_ok). This attribute will be None
for operations that do not return rows or if the cursor
has not had an operation invoked via the execute() method
yet.
The 'type_code' is one of the 'dbi' values specified in
the section below.
Note: this is a bit in flux. Generally, the first two
items of the 7-tuple will always be present; the others
may be database specific.
close()
Close the cursor now (rather than whenever __del__ is
called). The cursor will be unusable from this point
forward; an exception will be raised if any operation is
attempted with the cursor.
execute(operation [,params])
Execute (prepare) a database operation (query or command).
Parameters may be provided (as a sequence
(e.g. tuple/list)) and will be bound to variables in the
operation. Variables are specified in a database-specific
notation that is based on the index in the parameter tuple
(position-based rather than name-based).
The parameters may also be specified as a sequence of
sequences (e.g. a list of tuples) to insert multiple rows
in a single operation.
A reference to the operation will be retained by the
cursor. If the same operation object is passed in again,
then the cursor can optimize its behavior. This is most
effective for algorithms where the same operation is used,
but different parameters are bound to it (many times).
For maximum efficiency when reusing an operation, it is
best to use the setinputsizes() method to specify the
parameter types and sizes ahead of time. It is legal for
a parameter to not match the predefined information; the
implementation should compensate, possibly with a loss of
efficiency.
Using SQL terminology, these are the possible result
values from the execute() method:
If the statement is DDL (e.g. CREATE TABLE), then 1 is
returned.
If the statement is DML (e.g. UPDATE or INSERT), then the
number of rows affected is returned (0 or a positive
integer).
If the statement is DQL (e.g. SELECT), None is returned,
indicating that the statement is not really complete until
you use one of the 'fetch' methods.
fetchone()
Fetch the next row of a query result, returning a single
tuple.
fetchmany([size])
Fetch the next set of rows of a query result, returning as
a list of tuples. An empty list is returned when no more
rows are available. The number of rows to fetch is
specified by the parameter. If it is None, then the
cursor's arraysize determines the number of rows to be
fetched.
Note there are performance considerations involved with
the size parameter. For optimal performance, it is
usually best to use the arraysize attribute. If the size
parameter is used, then it is best for it to retain the
same value from one fetchmany() call to the next.
fetchall()
Fetch all rows of a query result, returning as a list of
tuples. Note that the cursor's arraysize attribute can
affect the performance of this operation.
setinputsizes(sizes)
(Note: this method is not well-defined yet.) This can be
used before a call to 'execute()' to predefine memory
areas for the operation's parameters. sizes is specified
as a tuple -- one item for each input parameter. The item
should be a Type object that corresponds to the input that
will be used, or it should be an integer specifying the
maximum length of a string parameter. If the item is
'None', then no predefined memory area will be reserved
for that column (this is useful to avoid predefined areas
for large inputs).
This method would be used before the execute() method is
invoked.
Note that this method is optional and is merely provided
for higher performance database interaction.
Implementations are free to do nothing and users are free
to not use it.
setoutputsize(size [,col])
(Note: this method is not well-defined yet.)
Set a column buffer size for fetches of large columns
(e.g. LONG). The column is specified as an index into the
result tuple. Using a column of None will set the default
size for all large columns in the cursor.
This method would be used before the 'execute()' method is
invoked.
Note that this method is optional and is merely provided
for higher performance database interaction.
Implementations are free to do nothing and users are free
to not use it.
DBI Helper Objects
Many databases need to have the input in a particular format for
binding to an operation's input parameters. For example, if an
input is destined for a DATE column, then it must be bound to the
database in a particular string format. Similar problems exist
for "Row ID" columns or large binary items (e.g. blobs or RAW
columns). This presents problems for Python since the parameters
to the 'execute()' method are untyped. When the database module
sees a Python string object, it doesn't know if it should be bound
as a simple CHAR column, as a raw binary item, or as a DATE.
To overcome this problem, the 'dbi' module was created. This
module specifies some basic database interface types for working
with databases. There are two classes: 'dbiDate' and 'dbiRaw'.
These are simple container classes that wrap up a value. When
passed to the database modules, the module can then detect that
the input parameter is intended as a DATE or a RAW. For symmetry,
the database modules will return DATE and RAW columns as instances
of these classes.
A Cursor Object's 'description' attribute returns information
about each of the result columns of a query. The 'type_code is
defined to be one of five types exported by this module: 'STRING',
'RAW', 'NUMBER', 'DATE', or 'ROWID'.
The module exports the following names:
dbiDate(value)
This function constructs a 'dbiDate' instance that holds a
date value. The value should be specified as an integer
number of seconds since the "epoch" (e.g. time.time()).
dbiRaw(value)
This function constructs a 'dbiRaw' instance that holds a
raw (binary) value. The value should be specified as a
Python string.
STRING
This object is used to describe columns in a database that
are string-based (e.g. CHAR).
RAW
This object is used to describe (large) binary columns in
a database (e.g. LONG RAW, blobs).
NUMBER
This object is used to describe numeric columns in a
database.
DATE
This object is used to describe date columns in a
database.
ROWID
This object is used to describe the "Row ID" column in a
database.
Acknowledgements
Many thanks go to Andrew Kuchling who converted the Python
Database API Specification 1.0 from the original HTML format into
the PEP format.
Copyright
This document has been placed in the Public Domain.
pep-0249 Python Database API Specification v2.0
| PEP: | 249 |
|---|---|
| Title: | Python Database API Specification v2.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | mal at lemburg.com (Marc-André Lemburg) |
| Discussions-To: | db-sig at python.org |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | |
| Post-History: | |
| Replaces: | 248 |
Contents
- Introduction
- Module Interface
- Connection Objects
- Cursor Objects
- Type Objects and Constructors
- Implementation Hints for Module Authors
- Optional DB API Extensions
- Optional Error Handling Extensions
- Optional Two-Phase Commit Extensions
- Frequently Asked Questions
- Major Changes from Version 1.0 to Version 2.0
- Open Issues
- Footnotes
- Acknowledgements
- Copyright
Introduction
This API has been defined to encourage similarity between the Python modules that are used to access databases. By doing this, we hope to achieve a consistency leading to more easily understood modules, code that is generally more portable across databases, and a broader reach of database connectivity from Python.
Comments and questions about this specification may be directed to the SIG for Database Interfacing with Python.
For more information on database interfacing with Python and available packages see the Database Topic Guide.
This document describes the Python Database API Specification 2.0 and a set of common optional extensions. The previous version 1.0 version is still available as reference, in PEP 248. Package writers are encouraged to use this version of the specification as basis for new interfaces.
Module Interface
Constructors
Access to the database is made available through connection objects. The module must provide the following constructor for these:
- connect( parameters... )
Constructor for creating a connection to the database.
Returns a Connection Object. It takes a number of parameters which are database dependent. [1]
Globals
These module globals must be defined:
- apilevel
String constant stating the supported DB API level.
Currently only the strings "1.0" and "2.0" are allowed. If not given, a DB-API 1.0 level interface should be assumed.
- threadsafety
Integer constant stating the level of thread safety the interface supports. Possible values are:
threadsafety Meaning 0 Threads may not share the module. 1 Threads may share the module, but not connections. 2 Threads may share the module and connections. 3 Threads may share the module, connections and cursors. Sharing in the above context means that two threads may use a resource without wrapping it using a mutex semaphore to implement resource locking. Note that you cannot always make external resources thread safe by managing access using a mutex: the resource may rely on global variables or other external sources that are beyond your control.
- paramstyle
String constant stating the type of parameter marker formatting expected by the interface. Possible values are [2]:
paramstyle Meaning qmark Question mark style, e.g. ...WHERE name=? numeric Numeric, positional style, e.g. ...WHERE name=:1 named Named style, e.g. ...WHERE name=:name format ANSI C printf format codes, e.g. ...WHERE name=%s pyformat Python extended format codes, e.g. ...WHERE name=%(name)s
Exceptions
The module should make all error information available through these exceptions or subclasses thereof:
- Warning
- Exception raised for important warnings like data truncations while inserting, etc. It must be a subclass of the Python StandardError (defined in the module exceptions).
- Error
- Exception that is the base class of all other error exceptions. You can use this to catch all errors with one single except statement. Warnings are not considered errors and thus should not use this class as base. It must be a subclass of the Python StandardError (defined in the module exceptions).
- InterfaceError
- Exception raised for errors that are related to the database interface rather than the database itself. It must be a subclass of Error.
- DatabaseError
- Exception raised for errors that are related to the database. It must be a subclass of Error.
- DataError
- Exception raised for errors that are due to problems with the processed data like division by zero, numeric value out of range, etc. It must be a subclass of DatabaseError.
- OperationalError
- Exception raised for errors that are related to the database's operation and not necessarily under the control of the programmer, e.g. an unexpected disconnect occurs, the data source name is not found, a transaction could not be processed, a memory allocation error occurred during processing, etc. It must be a subclass of DatabaseError.
- IntegrityError
- Exception raised when the relational integrity of the database is affected, e.g. a foreign key check fails. It must be a subclass of DatabaseError.
- InternalError
- Exception raised when the database encounters an internal error, e.g. the cursor is not valid anymore, the transaction is out of sync, etc. It must be a subclass of DatabaseError.
- ProgrammingError
- Exception raised for programming errors, e.g. table not found or already exists, syntax error in the SQL statement, wrong number of parameters specified, etc. It must be a subclass of DatabaseError.
- NotSupportedError
- Exception raised in case a method or database API was used which is not supported by the database, e.g. requesting a .rollback() on a connection that does not support transaction or has transactions turned off. It must be a subclass of DatabaseError.
This is the exception inheritance layout:
StandardError
|__Warning
|__Error
|__InterfaceError
|__DatabaseError
|__DataError
|__OperationalError
|__IntegrityError
|__InternalError
|__ProgrammingError
|__NotSupportedError
Note
The values of these exceptions are not defined. They should give the user a fairly good idea of what went wrong, though.
Connection Objects
Connection objects should respond to the following methods.
Connection methods
- .close()
Close the connection now (rather than whenever .__del__() is called).
The connection will be unusable from this point forward; an Error (or subclass) exception will be raised if any operation is attempted with the connection. The same applies to all cursor objects trying to use the connection. Note that closing a connection without committing the changes first will cause an implicit rollback to be performed.
- .commit()
Commit any pending transaction to the database.
Note that if the database supports an auto-commit feature, this must be initially off. An interface method may be provided to turn it back on.
Database modules that do not support transactions should implement this method with void functionality.
- .rollback()
This method is optional since not all databases provide transaction support. [3]
In case a database does provide transactions this method causes the database to roll back to the start of any pending transaction. Closing a connection without committing the changes first will cause an implicit rollback to be performed.
Cursor Objects
These objects represent a database cursor, which is used to manage the context of a fetch operation. Cursors created from the same connection are not isolated, i.e., any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or can not be isolated, depending on how the transaction support is implemented (see also the connection's .rollback() and .commit() methods).
Cursor Objects should respond to the following methods and attributes.
Cursor attributes
- .description
This read-only attribute is a sequence of 7-item sequences.
Each of these sequences contains information describing one result column:
- name
- type_code
- display_size
- internal_size
- precision
- scale
- null_ok
The first two items (name and type_code) are mandatory, the other five are optional and are set to None if no meaningful values can be provided.
This attribute will be None for operations that do not return rows or if the cursor has not had an operation invoked via the .execute*() method yet.
The type_code can be interpreted by comparing it to the Type Objects specified in the section below.
- .rowcount
This read-only attribute specifies the number of rows that the last .execute*() produced (for DQL statements like SELECT) or affected (for DML statements like UPDATE or INSERT). [9]
The attribute is -1 in case no .execute*() has been performed on the cursor or the rowcount of the last operation is cannot be determined by the interface. [7]
Note
Future versions of the DB API specification could redefine the latter case to have the object return None instead of -1.
Cursor methods
- .callproc( procname [, parameters ] )
(This method is optional since not all databases provide stored procedures. [3])
Call a stored database procedure with the given name. The sequence of parameters must contain one entry for each argument that the procedure expects. The result of the call is returned as modified copy of the input sequence. Input parameters are left untouched, output and input/output parameters replaced with possibly new values.
The procedure may also provide a result set as output. This must then be made available through the standard .fetch*() methods.
- .close()
Close the cursor now (rather than whenever __del__ is called).
The cursor will be unusable from this point forward; an Error (or subclass) exception will be raised if any operation is attempted with the cursor.
- .execute(operation [, parameters])
Prepare and execute a database operation (query or command).
Parameters may be provided as sequence or mapping and will be bound to variables in the operation. Variables are specified in a database-specific notation (see the module's paramstyle attribute for details). [5]
A reference to the operation will be retained by the cursor. If the same operation object is passed in again, then the cursor can optimize its behavior. This is most effective for algorithms where the same operation is used, but different parameters are bound to it (many times).
For maximum efficiency when reusing an operation, it is best to use the .setinputsizes() method to specify the parameter types and sizes ahead of time. It is legal for a parameter to not match the predefined information; the implementation should compensate, possibly with a loss of efficiency.
The parameters may also be specified as list of tuples to e.g. insert multiple rows in a single operation, but this kind of usage is deprecated: .executemany() should be used instead.
Return values are not defined.
- .executemany( operation, seq_of_parameters )
Prepare a database operation (query or command) and then execute it against all parameter sequences or mappings found in the sequence seq_of_parameters.
Modules are free to implement this method using multiple calls to the .execute() method or by using array operations to have the database process the sequence as a whole in one call.
Use of this method for an operation which produces one or more result sets constitutes undefined behavior, and the implementation is permitted (but not required) to raise an exception when it detects that a result set has been created by an invocation of the operation.
The same comments as for .execute() also apply accordingly to this method.
Return values are not defined.
- .fetchone()
Fetch the next row of a query result set, returning a single sequence, or None when no more data is available. [6]
An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.
- .fetchmany([size=cursor.arraysize])
Fetch the next set of rows of a query result, returning a sequence of sequences (e.g. a list of tuples). An empty sequence is returned when no more rows are available.
The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor's arraysize determines the number of rows to be fetched. The method should try to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned.
An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.
Note there are performance considerations involved with the size parameter. For optimal performance, it is usually best to use the .arraysize attribute. If the size parameter is used, then it is best for it to retain the same value from one .fetchmany() call to the next.
- .fetchall()
Fetch all (remaining) rows of a query result, returning them as a sequence of sequences (e.g. a list of tuples). Note that the cursor's arraysize attribute can affect the performance of this operation.
An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.
- .nextset()
(This method is optional since not all databases support multiple result sets. [3])
This method will make the cursor skip to the next available set, discarding any remaining rows from the current set.
If there are no more sets, the method returns None. Otherwise, it returns a true value and subsequent calls to the .fetch*() methods will return rows from the next result set.
An Error (or subclass) exception is raised if the previous call to .execute*() did not produce any result set or no call was issued yet.
- .arraysize
This read/write attribute specifies the number of rows to fetch at a time with .fetchmany(). It defaults to 1 meaning to fetch a single row at a time.
Implementations must observe this value with respect to the .fetchmany() method, but are free to interact with the database a single row at a time. It may also be used in the implementation of .executemany().
- .setinputsizes(sizes)
This can be used before a call to .execute*() to predefine memory areas for the operation's parameters.
sizes is specified as a sequence — one item for each input parameter. The item should be a Type Object that corresponds to the input that will be used, or it should be an integer specifying the maximum length of a string parameter. If the item is None, then no predefined memory area will be reserved for that column (this is useful to avoid predefined areas for large inputs).
This method would be used before the .execute*() method is invoked.
Implementations are free to have this method do nothing and users are free to not use it.
- .setoutputsize(size [, column])
Set a column buffer size for fetches of large columns (e.g. LONGs, BLOBs, etc.). The column is specified as an index into the result sequence. Not specifying the column will set the default size for all large columns in the cursor.
This method would be used before the .execute*() method is invoked.
Implementations are free to have this method do nothing and users are free to not use it.
Type Objects and Constructors
Many databases need to have the input in a particular format for binding to an operation's input parameters. For example, if an input is destined for a DATE column, then it must be bound to the database in a particular string format. Similar problems exist for "Row ID" columns or large binary items (e.g. blobs or RAW columns). This presents problems for Python since the parameters to the .execute*() method are untyped. When the database module sees a Python string object, it doesn't know if it should be bound as a simple CHAR column, as a raw BINARY item, or as a DATE.
To overcome this problem, a module must provide the constructors defined below to create objects that can hold special values. When passed to the cursor methods, the module can then detect the proper type of the input parameter and bind it accordingly.
A Cursor Object's description attribute returns information about each of the result columns of a query. The type_code must compare equal to one of Type Objects defined below. Type Objects may be equal to more than one type code (e.g. DATETIME could be equal to the type codes for date, time and timestamp columns; see the Implementation Hints below for details).
The module exports the following constructors and singletons:
- Date(year, month, day)
- This function constructs an object holding a date value.
- Time(hour, minute, second)
- This function constructs an object holding a time value.
- Timestamp(year, month, day, hour, minute, second)
- This function constructs an object holding a time stamp value.
- DateFromTicks(ticks)
- This function constructs an object holding a date value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
- TimeFromTicks(ticks)
- This function constructs an object holding a time value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
- TimestampFromTicks(ticks)
- This function constructs an object holding a time stamp value from the given ticks value (number of seconds since the epoch; see the documentation of the standard Python time module for details).
- Binary(string)
- This function constructs an object capable of holding a binary (long) string value.
- STRING type
- This type object is used to describe columns in a database that are string-based (e.g. CHAR).
- BINARY type
- This type object is used to describe (long) binary columns in a database (e.g. LONG, RAW, BLOBs).
- NUMBER type
- This type object is used to describe numeric columns in a database.
- DATETIME type
- This type object is used to describe date/time columns in a database.
- ROWID type
- This type object is used to describe the "Row ID" column in a database.
SQL NULL values are represented by the Python None singleton on input and output.
Note
Usage of Unix ticks for database interfacing can cause troubles because of the limited date range they cover.
Implementation Hints for Module Authors
Date/time objects can be implemented as Python datetime module objects (available since Python 2.3, with a C API since 2.4) or using the mxDateTime package (available for all Python versions since 1.5.2). They both provide all necessary constructors and methods at Python and C level.
Here is a sample implementation of the Unix ticks based constructors for date/time delegating work to the generic constructors:
import time def DateFromTicks(ticks): return Date(*time.localtime(ticks)[:3]) def TimeFromTicks(ticks): return Time(*time.localtime(ticks)[3:6]) def TimestampFromTicks(ticks): return Timestamp(*time.localtime(ticks)[:6])The preferred object type for Binary objects are the buffer types available in standard Python starting with version 1.5.2. Please see the Python documentation for details. For information about the C interface have a look at Include/bufferobject.h and Objects/bufferobject.c in the Python source distribution.
This Python class allows implementing the above type objects even though the description type code field yields multiple values for on type object:
class DBAPITypeObject: def __init__(self,*values): self.values = values def __cmp__(self,other): if other in self.values: return 0 if other < self.values: return 1 else: return -1The resulting type object compares equal to all values passed to the constructor.
Here is a snippet of Python code that implements the exception hierarchy defined above:
import exceptions class Error(exceptions.StandardError): pass class Warning(exceptions.StandardError): pass class InterfaceError(Error): pass class DatabaseError(Error): pass class InternalError(DatabaseError): pass class OperationalError(DatabaseError): pass class ProgrammingError(DatabaseError): pass class IntegrityError(DatabaseError): pass class DataError(DatabaseError): pass class NotSupportedError(DatabaseError): passIn C you can use the PyErr_NewException(fullname, base, NULL) API to create the exception objects.
Optional DB API Extensions
During the lifetime of DB API 2.0, module authors have often extended their implementations beyond what is required by this DB API specification. To enhance compatibility and to provide a clean upgrade path to possible future versions of the specification, this section defines a set of common extensions to the core DB API 2.0 specification.
As with all DB API optional features, the database module authors are free to not implement these additional attributes and methods (using them will then result in an AttributeError) or to raise a NotSupportedError in case the availability can only be checked at run-time.
It has been proposed to make usage of these extensions optionally visible to the programmer by issuing Python warnings through the Python warning framework. To make this feature useful, the warning messages must be standardized in order to be able to mask them. These standard messages are referred to below as Warning Message.
- Cursor.rownumber
This read-only attribute should provide the current 0-based index of the cursor in the result set or None if the index cannot be determined.
The index can be seen as index of the cursor in a sequence (the result set). The next fetch operation will fetch the row indexed by .rownumber in that sequence.
Warning Message: "DB-API extension cursor.rownumber used"
- Connection.Error, Connection.ProgrammingError, etc.
All exception classes defined by the DB API standard should be exposed on the Connection objects as attributes (in addition to being available at module scope).
These attributes simplify error handling in multi-connection environments.
Warning Message: "DB-API extension connection.<exception> used"
- Cursor.connection
This read-only attribute return a reference to the Connection object on which the cursor was created.
The attribute simplifies writing polymorph code in multi-connection environments.
Warning Message: "DB-API extension cursor.connection used"
- Cursor.scroll(value [, mode='relative' ])
Scroll the cursor in the result set to a new position according to mode.
If mode is relative (default), value is taken as offset to the current position in the result set, if set to absolute, value states an absolute target position.
An IndexError should be raised in case a scroll operation would leave the result set. In this case, the cursor position is left undefined (ideal would be to not move the cursor at all).
Note
This method should use native scrollable cursors, if available, or revert to an emulation for forward-only scrollable cursors. The method may raise NotSupportedError to signal that a specific operation is not supported by the database (e.g. backward scrolling).
Warning Message: "DB-API extension cursor.scroll() used"
- Cursor.messages
This is a Python list object to which the interface appends tuples (exception class, exception value) for all messages which the interfaces receives from the underlying database for this cursor.
The list is cleared by all standard cursor methods calls (prior to executing the call) except for the .fetch*() calls automatically to avoid excessive memory usage and can also be cleared by executing del cursor.messages[:].
All error and warning messages generated by the database are placed into this list, so checking the list allows the user to verify correct operation of the method calls.
The aim of this attribute is to eliminate the need for a Warning exception which often causes problems (some warnings really only have informational character).
Warning Message: "DB-API extension cursor.messages used"
- Connection.messages
Same as Cursor.messages except that the messages in the list are connection oriented.
The list is cleared automatically by all standard connection methods calls (prior to executing the call) to avoid excessive memory usage and can also be cleared by executing del connection.messages[:].
Warning Message: "DB-API extension connection.messages used"
- Cursor.next()
Return the next row from the currently executing SQL statement using the same semantics as .fetchone(). A StopIteration exception is raised when the result set is exhausted for Python versions 2.2 and later. Previous versions don't have the StopIteration exception and so the method should raise an IndexError instead.
Warning Message: "DB-API extension cursor.next() used"
- Cursor.__iter__()
Return self to make cursors compatible to the iteration protocol [8].
Warning Message: "DB-API extension cursor.__iter__() used"
- Cursor.lastrowid
This read-only attribute provides the rowid of the last modified row (most databases return a rowid only when a single INSERT operation is performed). If the operation does not set a rowid or if the database does not support rowids, this attribute should be set to None.
The semantics of .lastrowid are undefined in case the last executed statement modified more than one row, e.g. when using INSERT with .executemany().
Warning Message: "DB-API extension cursor.lastrowid used"
Optional Error Handling Extensions
The core DB API specification only introduces a set of exceptions which can be raised to report errors to the user. In some cases, exceptions may be too disruptive for the flow of a program or even render execution impossible.
For these cases and in order to simplify error handling when dealing with databases, database module authors may choose to implement user defineable error handlers. This section describes a standard way of defining these error handlers.
- Connection.errorhandler, Cursor.errorhandler
Read/write attribute which references an error handler to call in case an error condition is met.
The handler must be a Python callable taking the following arguments:
errorhandler(connection, cursor, errorclass, errorvalue)
where connection is a reference to the connection on which the cursor operates, cursor a reference to the cursor (or None in case the error does not apply to a cursor), errorclass is an error class which to instantiate using errorvalue as construction argument.
The standard error handler should add the error information to the appropriate .messages attribute (Connection.messages or Cursor.messages) and raise the exception defined by the given errorclass and errorvalue parameters.
If no .errorhandler is set (the attribute is None), the standard error handling scheme as outlined above, should be applied.
Warning Message: "DB-API extension .errorhandler used"
Cursors should inherit the .errorhandler setting from their connection objects at cursor creation time.
Optional Two-Phase Commit Extensions
Many databases have support for two-phase commit (TPC) which allows managing transactions across multiple database connections and other resources.
If a database backend provides support for two-phase commit and the database module author wishes to expose this support, the following API should be implemented. NotSupportedError should be raised, if the database backend support for two-phase commit can only be checked at run-time.
TPC Transaction IDs
As many databases follow the XA specification, transaction IDs are formed from three components:
- a format ID
- a global transaction ID
- a branch qualifier
For a particular global transaction, the first two components should be the same for all resources. Each resource in the global transaction should be assigned a different branch qualifier.
The various components must satisfy the following criteria:
- format ID: a non-negative 32-bit integer.
- global transaction ID and branch qualifier: byte strings no longer than 64 characters.
Transaction IDs are created with the .xid() Connection method:
- .xid(format_id, global_transaction_id, branch_qualifier)
Returns a transaction ID object suitable for passing to the .tpc_*() methods of this connection.
If the database connection does not support TPC, a NotSupportedError is raised.
The type of the object returned by .xid() is not defined, but it must provide sequence behaviour, allowing access to the three components. A conforming database module could choose to represent transaction IDs with tuples rather than a custom object.
TPC Connection Methods
- .tpc_begin(xid)
Begins a TPC transaction with the given transaction ID xid.
This method should be called outside of a transaction (i.e. nothing may have executed since the last .commit() or .rollback()).
Furthermore, it is an error to call .commit() or .rollback() within the TPC transaction. A ProgrammingError is raised, if the application calls .commit() or .rollback() during an active TPC transaction.
If the database connection does not support TPC, a NotSupportedError is raised.
- .tpc_prepare()
Performs the first phase of a transaction started with .tpc_begin(). A ProgrammingError should be raised if this method outside of a TPC transaction.
After calling .tpc_prepare(), no statements can be executed until .tpc_commit() or .tpc_rollback() have been called.
- .tpc_commit([ xid ])
When called with no arguments, .tpc_commit() commits a TPC transaction previously prepared with .tpc_prepare().
If .tpc_commit() is called prior to .tpc_prepare(), a single phase commit is performed. A transaction manager may choose to do this if only a single resource is participating in the global transaction.
When called with a transaction ID xid, the database commits the given transaction. If an invalid transaction ID is provided, a ProgrammingError will be raised. This form should be called outside of a transaction, and is intended for use in recovery.
On return, the TPC transaction is ended.
- .tpc_rollback([ xid ])
When called with no arguments, .tpc_rollback() rolls back a TPC transaction. It may be called before or after .tpc_prepare().
When called with a transaction ID xid, it rolls back the given transaction. If an invalid transaction ID is provided, a ProgrammingError is raised. This form should be called outside of a transaction, and is intended for use in recovery.
On return, the TPC transaction is ended.
- .tpc_recover()
Returns a list of pending transaction IDs suitable for use with .tpc_commit(xid) or .tpc_rollback(xid).
If the database does not support transaction recovery, it may return an empty list or raise NotSupportedError.
Frequently Asked Questions
The database SIG often sees reoccurring questions about the DB API specification. This section covers some of the issues people sometimes have with the specification.
Question:
How can I construct a dictionary out of the tuples returned by .fetch*():
Answer:
There are several existing tools available which provide helpers for this task. Most of them use the approach of using the column names defined in the cursor attribute .description as basis for the keys in the row dictionary.
Note that the reason for not extending the DB API specification to also support dictionary return values for the .fetch*() methods is that this approach has several drawbacks:
- Some databases don't support case-sensitive column names or auto-convert them to all lowercase or all uppercase characters.
- Columns in the result set which are generated by the query (e.g. using SQL functions) don't map to table column names and databases usually generate names for these columns in a very database specific way.
As a result, accessing the columns through dictionary keys varies between databases and makes writing portable code impossible.
Major Changes from Version 1.0 to Version 2.0
The Python Database API 2.0 introduces a few major changes compared to the 1.0 version. Because some of these changes will cause existing DB API 1.0 based scripts to break, the major version number was adjusted to reflect this change.
These are the most important changes from 1.0 to 2.0:
- The need for a separate dbi module was dropped and the functionality merged into the module interface itself.
- New constructors and Type Objects were added for date/time values, the RAW Type Object was renamed to BINARY. The resulting set should cover all basic data types commonly found in modern SQL databases.
- New constants (apilevel, threadsafety, paramstyle) and methods (.executemany(), .nextset()) were added to provide better database bindings.
- The semantics of .callproc() needed to call stored procedures are now clearly defined.
- The definition of the .execute() return value changed. Previously, the return value was based on the SQL statement type (which was hard to implement right) — it is undefined now; use the more flexible .rowcount attribute instead. Modules are free to return the old style return values, but these are no longer mandated by the specification and should be considered database interface dependent.
- Class based exceptions were incorporated into the specification. Module implementors are free to extend the exception layout defined in this specification by subclassing the defined exception classes.
Post-publishing additions to the DB API 2.0 specification:
- Additional optional DB API extensions to the set of core functionality were specified.
Open Issues
Although the version 2.0 specification clarifies a lot of questions that were left open in the 1.0 version, there are still some remaining issues which should be addressed in future versions:
- Define a useful return value for .nextset() for the case where a new result set is available.
- Integrate the decimal module Decimal object for use as loss-less monetary and decimal interchange format.
Footnotes
| [1] | As a guideline the connection constructor parameters should be implemented as keyword parameters for more intuitive use and follow this order of parameters:
E.g. a connect could look like this: connect(dsn='myhost:MYDB', user='guido', password='234$') |
| [2] | Module implementors should prefer numeric, named or pyformat over the other formats because these offer more clarity and flexibility. |
| [3] | (1, 2, 3) If the database does not support the functionality required by the method, the interface should throw an exception in case the method is used. The preferred approach is to not implement the method and thus have Python generate an AttributeError in case the method is requested. This allows the programmer to check for database capabilities using the standard hasattr() function. For some dynamically configured interfaces it may not be appropriate to require dynamically making the method available. These interfaces should then raise a NotSupportedError to indicate the non-ability to perform the roll back when the method is invoked. |
| [4] | A database interface may choose to support named cursors by allowing a string argument to the method. This feature is not part of the specification, since it complicates semantics of the .fetch*() methods. |
| [5] | The module will use the __getitem__ method of the parameters object to map either positions (integers) or names (strings) to parameter values. This allows for both sequences and mappings to be used as input. The term bound refers to the process of binding an input value to a database execution buffer. In practical terms, this means that the input value is directly used as a value in the operation. The client should not be required to "escape" the value so that it can be used — the value should be equal to the actual database value. |
| [6] | Note that the interface may implement row fetching using arrays and other optimizations. It is not guaranteed that a call to this method will only move the associated cursor forward by one row. |
| [7] | The rowcount attribute may be coded in a way that updates its value dynamically. This can be useful for databases that return usable rowcount values only after the first call to a .fetch*() method. |
| [8] | Implementation Note: Python C extensions will have to implement the tp_iter slot on the cursor object instead of the .__iter__() method. |
| [9] | The term number of affected rows generally refers to the number of rows deleted, updated or inserted by the last statement run on the database cursor. Most databases will return the total number of rows that were found by the corresponding WHERE clause of the statement. Some databases use a different interpretation for UPDATEs and only return the number of rows that were changed by the UPDATE, even though the WHERE clause of the statement may have found more matching rows. Database module authors should try to implement the more common interpretation of returning the total number of rows found by the WHERE clause, or clearly document a different interpretation of the .rowcount attribute. |
Acknowledgements
Many thanks go to Andrew Kuchling who converted the Python Database API Specification 2.0 from the original HTML format into the PEP format.
Many thanks to James Henstridge for leading the discussion which led to the standardization of the two-phase commit API extensions.
Many thanks to Daniele Varrazzo for converting the specification from text PEP format to ReST PEP format, which allows linking to various parts.
Copyright
This document has been placed in the Public Domain.
pep-0250 Using site-packages on Windows
| PEP: | 250 |
|---|---|
| Title: | Using site-packages on Windows |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Moore <p.f.moore at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 30-Mar-2001 |
| Python-Version: | 2.2 |
| Post-History: | 30-Mar-2001 |
Abstract
The standard Python distribution includes a directory
Lib/site-packages, which is used on Unix platforms to hold
locally-installed modules and packages. The site.py module
distributed with Python includes support for locating other
modules in the site-packages directory.
This PEP proposes that the site-packages directory should be used
on the Windows platform in a similar manner.
Motivation
On Windows platforms, the default setting for sys.path does not
include a directory suitable for users to install locally
developed modules. The "expected" location appears to be the
directory containing the Python executable itself. This is also
the location where distutils (and distutils-generated installers)
installs packages. Including locally developed code in the same
directory as installed executables is not good practice.
Clearly, users can manipulate sys.path, either in a locally
modified site.py, or in a suitable sitecustomize.py, or even via
.pth files. However, there should be a standard location for such
files, rather than relying on every individual site having to set
their own policy.
In addition, with distutils becoming more prevalent as a means of
distributing modules, the need for a standard install location for
distributed modules will become more common. It would be better
to define such a standard now, rather than later when more
distutils-based packages exist which will need rebuilding.
It is relevant to note that prior to Python 2.1, the site-packages
directory was not included in sys.path for Macintosh platforms.
This has been changed in 2.1, and Macintosh includes sys.path now,
leaving Windows as the only major platform with no site-specific
modules directory.
Implementation
The implementation of this feature is fairly trivial. All that
would be required is a change to site.py, to change the section
setting sitedirs. The Python 2.1 version has
if os.sep == '/':
sitedirs = [makepath(prefix,
"lib",
"python" + sys.version[:3],
"site-packages"),
makepath(prefix, "lib", "site-python")]
elif os.sep == ':':
sitedirs = [makepath(prefix, "lib", "site-packages")]
else:
sitedirs = [prefix]
A suitable change would be to simply replace the last 4 lines with
else:
sitedirs == [prefix, makepath(prefix, "lib", "site-packages")]
Changes would also be required to distutils, to reflect this change
in policy. A patch is available on Sourceforge, patch ID 445744,
which implements this change. Note that the patch checks the Python
version and only invokes the new behaviour for Python versions from
2.2 onwards. This is to ensure that distutils remains compatible
with earlier versions of Python.
Finally, the executable code which implements the Windows installer
used by the bdist_wininst command will need changing to use the new
location. A separate patch is available for this, currently
maintained by Thomas Heller.
Notes
- This change does not preclude packages using the current
location -- the change only adds a directory to sys.path, it
does not remove anything.
- Both the current location (sys.prefix) and the new directory
(site-packages) are included in sitedirs, so that .pth files
will be recognised in either location.
- This proposal adds a single additional site-packages directory
to sitedirs. On Unix platforms, two directories are added, one
for version-independent files (Python code) and one for
version-dependent code (C extensions). This is necessary on
Unix, as the sitedirs include a common (across Python versions)
package location, in /usr/local by default. As there is no such
common location available on Windows, there is also no need for
having two separate package directories.
- If users want to keep DLLs in a single location on Windows, rather
than keeping them in the package directory, the DLLs subdirectory
of the Python install directory is already available for that
purpose. Adding an extra directory solely for DLLs should not be
necessary.
Open Issues
- Comments from Unix users indicate that there may be issues with
the current setup on the Unix platform. Rather than become
involved in cross-platform issues, this PEP specifically limits
itself to the Windows platform, leaving changes for other platforms
to be covered inother PEPs.
- There could be issues with applications which embed Python. To the
author's knowledge, there should be no problem as a result of this
change. There have been no comments (supportive or otherwise) from
users who embed Python.
Copyright
This document has been placed in the public domain.
pep-0251 Python 2.2 Release Schedule
| PEP: | 251 |
|---|---|
| Title: | Python 2.2 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org>, Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Informational |
| Created: | 17-Apr-2001 |
| Python-Version: | 2.2 |
| Post-History: | 14-Aug-2001 |
Abstract
This document describes the Python 2.2 development and release
schedule. The schedule primarily concerns itself with PEP-sized
items. Small bug fixes and changes will occur up until the first
beta release.
The schedule below represents the actual release dates of Python
2.2. Note that any subsequent maintenance releases of Python 2.2
should be covered by separate PEPs.
Release Schedule
Tentative future release dates. Note that we've slipped this
compared to the schedule posted around the release of 2.2a1.
21-Dec-2001: 2.2 [Released] (final release)
14-Dec-2001: 2.2c1 [Released]
14-Nov-2001: 2.2b2 [Released]
19-Oct-2001: 2.2b1 [Released]
28-Sep-2001: 2.2a4 [Released]
7-Sep-2001: 2.2a3 [Released]
22-Aug-2001: 2.2a2 [Released]
18-Jul-2001: 2.2a1 [Released]
Release Manager
Barry Warsaw was the Python 2.2 release manager.
Release Mechanics
We experimented with a new mechanism for releases: a week before
every alpha, beta or other release, we forked off a branch which
became the release. Changes to the branch are limited to the
release manager and his designated 'bots. This experiment was
deemed a success and should be observed for future releases. See
PEP 101 for the actual release mechanics[1].
New features for Python 2.2
The following new features are introduced in Python 2.2. For a
more detailed account, see Misc/NEWS[2] in the Python
distribution, or Andrew Kuchling's "What's New in Python 2.2"
document[3].
- iterators (PEP 234)
- generators (PEP 255)
- unifying long ints and plain ints (PEP 237)
- division (PEP 238)
- unification of types and classes (PEP 252, PEP 253)
References
[1] PEP 101, Doing Python Releases 101
http://www.python.org/dev/peps/pep-0101/
[2] Misc/NEWS file from CVS
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Misc/NEWS?rev=1.337.2.4&content-type=text/vnd.viewcvs-markup
[3] Andrew Kuchling, What's New in Python 2.2
http://www.python.org/doc/2.2.1/whatsnew/whatsnew22.html
Copyright
This document has been placed in the public domain.
pep-0252 Making Types Look More Like Classes
| PEP: | 252 |
|---|---|
| Title: | Making Types Look More Like Classes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 19-Apr-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
This PEP proposes changes to the introspection API for types that
makes them look more like classes, and their instances more like
class instances. For example, type(x) will be equivalent to
x.__class__ for most built-in types. When C is x.__class__,
x.meth(a) will generally be equivalent to C.meth(x, a), and
C.__dict__ contains x's methods and other attributes.
This PEP also introduces a new approach to specifying attributes,
using attribute descriptors, or descriptors for short.
Descriptors unify and generalize several different common
mechanisms used for describing attributes: a descriptor can
describe a method, a typed field in the object structure, or a
generalized attribute represented by getter and setter functions.
Based on the generalized descriptor API, this PEP also introduces
a way to declare class methods and static methods.
[Editor's note: the ideas described in this PEP have been incorporated
into Python. The PEP no longer accurately describes the implementation.]
Introduction
One of Python's oldest language warts is the difference between
classes and types. For example, you can't directly subclass the
dictionary type, and the introspection interface for finding out
what methods and instance variables an object has is different for
types and for classes.
Healing the class/type split is a big effort, because it affects
many aspects of how Python is implemented. This PEP concerns
itself with making the introspection API for types look the same
as that for classes. Other PEPs will propose making classes look
more like types, and subclassing from built-in types; these topics
are not on the table for this PEP.
Introspection APIs
Introspection concerns itself with finding out what attributes an
object has. Python's very general getattr/setattr API makes it
impossible to guarantee that there always is a way to get a list
of all attributes supported by a specific object, but in practice
two conventions have appeared that together work for almost all
objects. I'll call them the class-based introspection API and the
type-based introspection API; class API and type API for short.
The class-based introspection API is used primarily for class
instances; it is also used by Jim Fulton's ExtensionClasses. It
assumes that all data attributes of an object x are stored in the
dictionary x.__dict__, and that all methods and class variables
can be found by inspection of x's class, written as x.__class__.
Classes have a __dict__ attribute, which yields a dictionary
containing methods and class variables defined by the class
itself, and a __bases__ attribute, which is a tuple of base
classes that must be inspected recursively. Some assumptions here
are:
- attributes defined in the instance dict override attributes
defined by the object's class;
- attributes defined in a derived class override attributes
defined in a base class;
- attributes in an earlier base class (meaning occurring earlier
in __bases__) override attributes in a later base class.
(The last two rules together are often summarized as the
left-to-right, depth-first rule for attribute search. This is the
classic Python attribute lookup rule. Note that PEP 253 will
propose to change the attribute lookup order, and if accepted,
this PEP will follow suit.)
The type-based introspection API is supported in one form or
another by most built-in objects. It uses two special attributes,
__members__ and __methods__. The __methods__ attribute, if
present, is a list of method names supported by the object. The
__members__ attribute, if present, is a list of data attribute
names supported by the object.
The type API is sometimes combined with a __dict__ that works the
same as for instances (for example for function objects in
Python 2.1, f.__dict__ contains f's dynamic attributes, while
f.__members__ lists the names of f's statically defined
attributes).
Some caution must be exercised: some objects don't list their
"intrinsic" attributes (like __dict__ and __doc__) in __members__,
while others do; sometimes attribute names occur both in
__members__ or __methods__ and as keys in __dict__, in which case
it's anybody's guess whether the value found in __dict__ is used
or not.
The type API has never been carefully specified. It is part of
Python folklore, and most third party extensions support it
because they follow examples that support it. Also, any type that
uses Py_FindMethod() and/or PyMember_Get() in its tp_getattr
handler supports it, because these two functions special-case the
attribute names __methods__ and __members__, respectively.
Jim Fulton's ExtensionClasses ignore the type API, and instead
emulate the class API, which is more powerful. In this PEP, I
propose to phase out the type API in favor of supporting the class
API for all types.
One argument in favor of the class API is that it doesn't require
you to create an instance in order to find out which attributes a
type supports; this in turn is useful for documentation
processors. For example, the socket module exports the SocketType
object, but this currently doesn't tell us what methods are
defined on socket objects. Using the class API, SocketType would
show exactly what the methods for socket objects are, and we can
even extract their docstrings, without creating a socket. (Since
this is a C extension module, the source-scanning approach to
docstring extraction isn't feasible in this case.)
Specification of the class-based introspection API
Objects may have two kinds of attributes: static and dynamic. The
names and sometimes other properties of static attributes are
knowable by inspection of the object's type or class, which is
accessible through obj.__class__ or type(obj). (I'm using type
and class interchangeably; a clumsy but descriptive term that fits
both is "meta-object".)
(XXX static and dynamic are not great terms to use here, because
"static" attributes may actually behave quite dynamically, and
because they have nothing to do with static class members in C++
or Java. Barry suggests to use immutable and mutable instead, but
those words already have precise and different meanings in
slightly different contexts, so I think that would still be
confusing.)
Examples of dynamic attributes are instance variables of class
instances, module attributes, etc. Examples of static attributes
are the methods of built-in objects like lists and dictionaries,
and the attributes of frame and code objects (f.f_code,
c.co_filename, etc.). When an object with dynamic attributes
exposes these through its __dict__ attribute, __dict__ is a static
attribute.
The names and values of dynamic properties are typically stored in
a dictionary, and this dictionary is typically accessible as
obj.__dict__. The rest of this specification is more concerned
with discovering the names and properties of static attributes
than with dynamic attributes; the latter are easily discovered by
inspection of obj.__dict__.
In the discussion below, I distinguish two kinds of objects:
regular objects (like lists, ints, functions) and meta-objects.
Types and classes are meta-objects. Meta-objects are also regular
objects, but we're mostly interested in them because they are
referenced by the __class__ attribute of regular objects (or by
the __bases__ attribute of other meta-objects).
The class introspection API consists of the following elements:
- the __class__ and __dict__ attributes on regular objects;
- the __bases__ and __dict__ attributes on meta-objects;
- precedence rules;
- attribute descriptors.
Together, these not only tell us about *all* attributes defined by
a meta-object, but they also help us calculate the value of a
specific attribute of a given object.
1. The __dict__ attribute on regular objects
A regular object may have a __dict__ attribute. If it does,
this should be a mapping (not necessarily a dictionary)
supporting at least __getitem__(), keys(), and has_key(). This
gives the dynamic attributes of the object. The keys in the
mapping give attribute names, and the corresponding values give
their values.
Typically, the value of an attribute with a given name is the
same object as the value corresponding to that name as a key in
the __dict__. In othe words, obj.__dict__['spam'] is obj.spam.
(But see the precedence rules below; a static attribute with
the same name *may* override the dictionary item.)
2. The __class__ attribute on regular objects
A regular object usually has a __class__ attribute. If it
does, this references a meta-object. A meta-object can define
static attributes for the regular object whose __class__ it
is. This is normally done through the following mechanism:
3. The __dict__ attribute on meta-objects
A meta-object may have a __dict__ attribute, of the same form
as the __dict__ attribute for regular objects (a mapping but
not necessarily a dictionary). If it does, the keys of the
meta-object's __dict__ are names of static attributes for the
corresponding regular object. The values are attribute
descriptors; we'll explain these later. An unbound method is a
special case of an attribute descriptor.
Because a meta-object is also a regular object, the items in a
meta-object's __dict__ correspond to attributes of the
meta-object; however, some transformation may be applied, and
bases (see below) may define additional dynamic attributes. In
other words, mobj.spam is not always mobj.__dict__['spam'].
(This rule contains a loophole because for classes, if
C.__dict__['spam'] is a function, C.spam is an unbound method
object.)
4. The __bases__ attribute on meta-objects
A meta-object may have a __bases__ attribute. If it does, this
should be a sequence (not necessarily a tuple) of other
meta-objects, the bases. An absent __bases__ is equivalent to
an empty sequence of bases. There must never be a cycle in the
relationship between meta-objects defined by __bases__
attributes; in other words, the __bases__ attributes define a
directed acyclic graph, with arcs pointing from derived
meta-objects to their base meta-objects. (It is not
necessarily a tree, since multiple classes can have the same
base class.) The __dict__ attributes of a meta-object in the
inheritance graph supply attribute descriptors for the regular
object whose __class__ attribute points to the root of the
inheritance tree (which is not the same as the root of the
inheritance hierarchy -- rather more the opposite, at the
bottom given how inheritance trees are typically drawn).
Descriptors are first searched in the dictionary of the root
meta-object, then in its bases, according to a precedence rule
(see the next paragraph).
5. Precedence rules
When two meta-objects in the inheritance graph for a given
regular object both define an attribute descriptor with the
same name, the search order is up to the meta-object. This
allows different meta-objects to define different search
orders. In particular, classic classes use the old
left-to-right depth-first rule, while new-style classes use a
more advanced rule (see the section on method resolution order
in PEP 253).
When a dynamic attribute (one defined in a regular object's
__dict__) has the same name as a static attribute (one defined
by a meta-object in the inheritance graph rooted at the regular
object's __class__), the static attribute has precedence if it
is a descriptor that defines a __set__ method (see below);
otherwise (if there is no __set__ method) the dynamic attribute
has precedence. In other words, for data attributes (those
with a __set__ method), the static definition overrides the
dynamic definition, but for other attributes, dynamic overrides
static.
Rationale: we can't have a simple rule like "static overrides
dynamic" or "dynamic overrides static", because some static
attributes indeed override dynamic attributes; for example, a
key '__class__' in an instance's __dict__ is ignored in favor
of the statically defined __class__ pointer, but on the other
hand most keys in inst.__dict__ override attributes defined in
inst.__class__. Presence of a __set__ method on a descriptor
indicates that this is a data descriptor. (Even read-only data
descriptors have a __set__ method: it always raises an
exception.) Absence of a __set__ method on a descriptor
indicates that the descriptor isn't interested in intercepting
assignment, and then the classic rule applies: an instance
variable with the same name as a method hides the method until
it is deleted.
6. Attribute descriptors
This is where it gets interesting -- and messy. Attribute
descriptors (descriptors for short) are stored in the
meta-object's __dict__ (or in the __dict__ of one of its
ancestors), and have two uses: a descriptor can be used to get
or set the corresponding attribute value on the (regular,
non-meta) object, and it has an additional interface that
describes the attribute for documentation and introspection
purposes.
There is little prior art in Python for designing the
descriptor's interface, neither for getting/setting the value
nor for describing the attribute otherwise, except some trivial
properties (it's reasonable to assume that __name__ and __doc__
should be the attribute's name and docstring). I will propose
such an API below.
If an object found in the meta-object's __dict__ is not an
attribute descriptor, backward compatibility dictates certain
minimal semantics. This basically means that if it is a Python
function or an unbound method, the attribute is a method;
otherwise, it is the default value for a dynamic data
attribute. Backwards compatibility also dictates that (in the
absence of a __setattr__ method) it is legal to assign to an
attribute corresponding to a method, and that this creates a
data attribute shadowing the method for this particular
instance. However, these semantics are only required for
backwards compatibility with regular classes.
The introspection API is a read-only API. We don't define the
effect of assignment to any of the special attributes (__dict__,
__class__ and __bases__), nor the effect of assignment to the
items of a __dict__. Generally, such assignments should be
considered off-limits. A future PEP may define some semantics for
some such assignments. (Especially because currently instances
support assignment to __class__ and __dict__, and classes support
assignment to __bases__ and __dict__.)
Specification of the attribute descriptor API
Attribute descriptors may have the following attributes. In the
examples, x is an object, C is x.__class__, x.meth() is a method,
and x.ivar is a data attribute or instance variable. All
attributes are optional -- a specific attribute may or may not be
present on a given descriptor. An absent attribute means that the
corresponding information is not available or the corresponding
functionality is not implemented.
- __name__: the attribute name. Because of aliasing and renaming,
the attribute may (additionally or exclusively) be known under a
different name, but this is the name under which it was born.
Example: C.meth.__name__ == 'meth'.
- __doc__: the attribute's documentation string. This may be
None.
- __objclass__: the class that declared this attribute. The
descriptor only applies to objects that are instances of this
class (this includes instances of its subclasses). Example:
C.meth.__objclass__ is C.
- __get__(): a function callable with one or two arguments that
retrieves the attribute value from an object. This is also
referred to as a "binding" operation, because it may return a
"bound method" object in the case of method descriptors. The
first argument, X, is the object from which the attribute must
be retrieved or to which it must be bound. When X is None, the
optional second argument, T, should be meta-object and the
binding operation may return an *unbound* method restricted to
instances of T. When both X and T are specified, X should be an
instance of T. Exactly what is returned by the binding
operation depends on the semantics of the descriptor; for
example, static methods and class methods (see below) ignore the
instance and bind to the type instead.
- __set__(): a function of two arguments that sets the attribute
value on the object. If the attribute is read-only, this method
may raise a TypeError or AttributeError exception (both are
allowed, because both are historically found for undefined or
unsettable attributes). Example:
C.ivar.set(x, y) ~~ x.ivar = y.
Static methods and class methods
The descriptor API makes it possible to add static methods and
class methods. Static methods are easy to describe: they behave
pretty much like static methods in C++ or Java. Here's an
example:
class C:
def foo(x, y):
print "staticmethod", x, y
foo = staticmethod(foo)
C.foo(1, 2)
c = C()
c.foo(1, 2)
Both the call C.foo(1, 2) and the call c.foo(1, 2) call foo() with
two arguments, and print "staticmethod 1 2". No "self" is declared in
the definition of foo(), and no instance is required in the call.
The line "foo = staticmethod(foo)" in the class statement is the
crucial element: this makes foo() a static method. The built-in
staticmethod() wraps its function argument in a special kind of
descriptor whose __get__() method returns the original function
unchanged. Without this, the __get__() method of standard
function objects would have created a bound method object for
'c.foo' and an unbound method object for 'C.foo'.
(XXX Barry suggests to use "sharedmethod" instead of
"staticmethod", because the word statis is being overloaded in so
many ways already. But I'm not sure if shared conveys the right
meaning.)
Class methods use a similar pattern to declare methods that
receive an implicit first argument that is the *class* for which
they are invoked. This has no C++ or Java equivalent, and is not
quite the same as what class methods are in Smalltalk, but may
serve a similar purpose. According to Armin Rigo, they are
similar to "virtual class methods" in Borland Pascal dialect
Delphi. (Python also has real metaclasses, and perhaps methods
defined in a metaclass have more right to the name "class method";
but I expect that most programmers won't be using metaclasses.)
Here's an example:
class C:
def foo(cls, y):
print "classmethod", cls, y
foo = classmethod(foo)
C.foo(1)
c = C()
c.foo(1)
Both the call C.foo(1) and the call c.foo(1) end up calling foo()
with *two* arguments, and print "classmethod __main__.C 1". The
first argument of foo() is implied, and it is the class, even if
the method was invoked via an instance. Now let's continue the
example:
class D(C):
pass
D.foo(1)
d = D()
d.foo(1)
This prints "classmethod __main__.D 1" both times; in other words,
the class passed as the first argument of foo() is the class
involved in the call, not the class involved in the definition of
foo().
But notice this:
class E(C):
def foo(cls, y): # override C.foo
print "E.foo() called"
C.foo(y)
foo = classmethod(foo)
E.foo(1)
e = E()
e.foo(1)
In this example, the call to C.foo() from E.foo() will see class C
as its first argument, not class E. This is to be expected, since
the call specifies the class C. But it stresses the difference
between these class methods and methods defined in metaclasses,
where an upcall to a metamethod would pass the target class as an
explicit first argument. (If you don't understand this, don't
worry, you're not alone.) Note that calling cls.foo(y) would be a
mistake -- it would cause infinite recursion. Also note that you
can't specify an explicit 'cls' argument to a class method. If
you want this (e.g. the __new__ method in PEP 253 requires this),
use a static method with a class as its explicit first argument
instead.
C API
XXX The following is VERY rough text that I wrote with a different
audience in mind; I'll have to go through this to edit it more.
XXX It also doesn't go into enough detail for the C API.
A built-in type can declare special data attributes in two ways:
using a struct memberlist (defined in structmember.h) or a struct
getsetlist (defined in descrobject.h). The struct memberlist is
an old mechanism put to new use: each attribute has a descriptor
record including its name, an enum giving its type (various C
types are supported as well as PyObject *), an offset from the
start of the instance, and a read-only flag.
The struct getsetlist mechanism is new, and intended for cases
that don't fit in that mold, because they either require
additional checking, or are plain calculated attributes. Each
attribute here has a name, a getter C function pointer, a setter C
function pointer, and a context pointer. The function pointers
are optional, so that for example setting the setter function
pointer to NULL makes a read-only attribute. The context pointer
is intended to pass auxiliary information to generic getter/setter
functions, but I haven't found a need for this yet.
Note that there is also a similar mechanism to declare built-in
methods: these are PyMethodDef structures, which contain a name
and a C function pointer (and some flags for the calling
convention).
Traditionally, built-in types have had to define their own
tp_getattro and tp_setattro slot functions to make these attribute
definitions work (PyMethodDef and struct memberlist are quite
old). There are convenience functions that take an array of
PyMethodDef or memberlist structures, an object, and an attribute
name, and return or set the attribute if found in the list, or
raise an exception if not found. But these convenience functions
had to be explicitly called by the tp_getattro or tp_setattro
method of the specific type, and they did a linear search of the
array using strcmp() to find the array element describing the
requested attribute.
I now have a brand spanking new generic mechanism that improves
this situation substantially.
- Pointers to arrays of PyMethodDef, memberlist, getsetlist
structures are part of the new type object (tp_methods,
tp_members, tp_getset).
- At type initialization time (in PyType_InitDict()), for each
entry in those three arrays, a descriptor object is created and
placed in a dictionary that belongs to the type (tp_dict).
- Descriptors are very lean objects that mostly point to the
corresponding structure. An implementation detail is that all
descriptors share the same object type, and a discriminator
field tells what kind of descriptor it is (method, member, or
getset).
- As explained in PEP 252, descriptors have a get() method that
takes an object argument and returns that object's attribute;
descriptors for writable attributes also have a set() method
that takes an object and a value and set that object's
attribute. Note that the get() object also serves as a bind()
operation for methods, binding the unbound method implementation
to the object.
- Instead of providing their own tp_getattro and tp_setattro
implementation, almost all built-in objects now place
PyObject_GenericGetAttr and (if they have any writable
attributes) PyObject_GenericSetAttr in their tp_getattro and
tp_setattro slots. (Or, they can leave these NULL, and inherit
them from the default base object, if they arrange for an
explicit call to PyType_InitDict() for the type before the first
instance is created.)
- In the simplest case, PyObject_GenericGetAttr() does exactly one
dictionary lookup: it looks up the attribute name in the type's
dictionary (obj->ob_type->tp_dict). Upon success, there are two
possibilities: the descriptor has a get method, or it doesn't.
For speed, the get and set methods are type slots: tp_descr_get
and tp_descr_set. If the tp_descr_get slot is non-NULL, it is
called, passing the object as its only argument, and the return
value from this call is the result of the getattr operation. If
the tp_descr_get slot is NULL, as a fallback the descriptor
itself is returned (compare class attributes that are not
methods but simple values).
- PyObject_GenericSetAttr() works very similar but uses the
tp_descr_set slot and calls it with the object and the new
attribute value; if the tp_descr_set slot is NULL, an
AttributeError is raised.
- But now for a more complicated case. The approach described
above is suitable for most built-in objects such as lists,
strings, numbers. However, some object types have a dictionary
in each instance that can store arbitrary attributes. In fact,
when you use a class statement to subtype an existing built-in
type, you automatically get such a dictionary (unless you
explicitly turn it off, using another advanced feature,
__slots__). Let's call this the instance dict, to distinguish
it from the type dict.
- In the more complicated case, there's a conflict between names
stored in the instance dict and names stored in the type dict.
If both dicts have an entry with the same key, which one should
we return? Looking at classic Python for guidance, I find
conflicting rules: for class instances, the instance dict
overrides the class dict, *except* for the special attributes
(like __dict__ and __class__), which have priority over the
instance dict.
- I resolved this with the following set of rules, implemented in
PyObject_GenericGetAttr():
1. Look in the type dict. If you find a *data* descriptor, use
its get() method to produce the result. This takes care of
special attributes like __dict__ and __class__.
2. Look in the instance dict. If you find anything, that's it.
(This takes care of the requirement that normally the
instance dict overrides the class dict.)
3. Look in the type dict again (in reality this uses the saved
result from step 1, of course). If you find a descriptor,
use its get() method; if you find something else, that's it;
if it's not there, raise AttributeError.
This requires a classification of descriptors as data and
nondata descriptors. The current implementation quite sensibly
classifies member and getset descriptors as data (even if they
are read-only!) and method descriptors as nondata.
Non-descriptors (like function pointers or plain values) are
also classified as non-data (!).
- This scheme has one drawback: in what I assume to be the most
common case, referencing an instance variable stored in the
instance dict, it does *two* dictionary lookups, whereas the
classic scheme did a quick test for attributes starting with two
underscores plus a single dictionary lookup. (Although the
implementation is sadly structured as instance_getattr() calling
instance_getattr1() calling instance_getattr2() which finally
calls PyDict_GetItem(), and the underscore test calls
PyString_AsString() rather than inlining this. I wonder if
optimizing the snot out of this might not be a good idea to
speed up Python 2.2, if we weren't going to rip it all out. :-)
- A benchmark verifies that in fact this is as fast as classic
instance variable lookup, so I'm no longer worried.
- Modification for dynamic types: step 1 and 3 look in the
dictionary of the type and all its base classes (in MRO
sequence, or couse).
Discussion
XXX
Examples
Let's look at lists. In classic Python, the method names of
lists were available as the __methods__ attribute of list objects:
>>> [].__methods__
['append', 'count', 'extend', 'index', 'insert', 'pop',
'remove', 'reverse', 'sort']
>>>
Under the new proposal, the __methods__ attribute no longer exists:
>>> [].__methods__
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'list' object has no attribute '__methods__'
>>>
Instead, you can get the same information from the list type:
>>> T = [].__class__
>>> T
<type 'list'>
>>> dir(T) # like T.__dict__.keys(), but sorted
['__add__', '__class__', '__contains__', '__eq__', '__ge__',
'__getattr__', '__getitem__', '__getslice__', '__gt__',
'__iadd__', '__imul__', '__init__', '__le__', '__len__',
'__lt__', '__mul__', '__ne__', '__new__', '__radd__',
'__repr__', '__rmul__', '__setitem__', '__setslice__', 'append',
'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>>
The new introspection API gives more information than the old one:
in addition to the regular methods, it also shows the methods that
are normally invoked through special notations, e.g. __iadd__
(+=), __len__ (len), __ne__ (!=). You can invoke any method from
this list directly:
>>> a = ['tic', 'tac']
>>> T.__len__(a) # same as len(a)
2
>>> T.append(a, 'toe') # same as a.append('toe')
>>> a
['tic', 'tac', 'toe']
>>>
This is just like it is for user-defined classes.
Notice a familiar yet surprising name in the list: __init__. This
is the domain of PEP 253.
Backwards compatibility
XXX
Warnings and Errors
XXX
Implementation
A partial implementation of this PEP is available from CVS as a
branch named "descr-branch". To experiment with this
implementation, proceed to check out Python from CVS according to
the instructions at http://sourceforge.net/cvs/?group_id=5470 but
add the arguments "-r descr-branch" to the cvs checkout command.
(You can also start with an existing checkout and do "cvs update
-r descr-branch".) For some examples of the features described
here, see the file Lib/test/test_descr.py.
Note: the code in this branch goes way beyond this PEP; it is also
the experimentation area for PEP 253 (Subtyping Built-in Types).
References
XXX
Copyright
This document has been placed in the public domain.
pep-0253 Subtyping Built-in Types
| PEP: | 253 |
|---|---|
| Title: | Subtyping Built-in Types |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 14-May-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
This PEP proposes additions to the type object API that will allow
the creation of subtypes of built-in types, in C and in Python.
[Editor's note: the ideas described in this PEP have been incorporated
into Python. The PEP no longer accurately describes the implementation.]
Introduction
Traditionally, types in Python have been created statically, by
declaring a global variable of type PyTypeObject and initializing
it with a static initializer. The slots in the type object
describe all aspects of a Python type that are relevant to the
Python interpreter. A few slots contain dimensional information
(like the basic allocation size of instances), others contain
various flags, but most slots are pointers to functions to
implement various kinds of behaviors. A NULL pointer means that
the type does not implement the specific behavior; in that case
the system may provide a default behavior or raise an exception
when the behavior is invoked for an instance of the type. Some
collections of function pointers that are usually defined together
are obtained indirectly via a pointer to an additional structure
containing more function pointers.
While the details of initializing a PyTypeObject structure haven't
been documented as such, they are easily gleaned from the examples
in the source code, and I am assuming that the reader is
sufficiently familiar with the traditional way of creating new
Python types in C.
This PEP will introduce the following features:
- a type can be a factory function for its instances
- types can be subtyped in C
- types can be subtyped in Python with the class statement
- multiple inheritance from types is supported (insofar as
practical -- you still can't multiply inherit from list and
dictionary)
- the standard coercion functions (int, tuple, str etc.) will
be redefined to be the corresponding type objects, which serve
as their own factory functions
- a class statement can contain a __metaclass__ declaration,
specifying the metaclass to be used to create the new class
- a class statement can contain a __slots__ declaration,
specifying the specific names of the instance variables
supported
This PEP builds on PEP 252, which adds standard introspection to
types; for example, when a particular type object initializes the
tp_hash slot, that type object has a __hash__ method when
introspected. PEP 252 also adds a dictionary to type objects
which contains all methods. At the Python level, this dictionary
is read-only for built-in types; at the C level, it is accessible
directly (but it should not be modified except as part of
initialization).
For binary compatibility, a flag bit in the tp_flags slot
indicates the existence of the various new slots in the type
object introduced below. Types that don't have the
Py_TPFLAGS_HAVE_CLASS bit set in their tp_flags slot are assumed
to have NULL values for all the subtyping slots. (Warning: the
current implementation prototype is not yet consistent in its
checking of this flag bit. This should be fixed before the final
release.)
In current Python, a distinction is made between types and
classes. This PEP together with PEP 254 will remove that
distinction. However, for backwards compatibility the distinction
will probably remain for years to come, and without PEP 254, the
distinction is still large: types ultimately have a built-in type
as a base class, while classes ultimately derive from a
user-defined class. Therefore, in the rest of this PEP, I will
use the word type whenever I can -- including base type or
supertype, derived type or subtype, and metatype. However,
sometimes the terminology necessarily blends, for example an
object's type is given by its __class__ attribute, and subtyping
in Python is spelled with a class statement. If further
distinction is necessary, user-defined classes can be referred to
as "classic" classes.
About metatypes
Inevitably the discussion comes to metatypes (or metaclasses).
Metatypes are nothing new in Python: Python has always been able
to talk about the type of a type:
>>> a = 0
>>> type(a)
<type 'int'>
>>> type(type(a))
<type 'type'>
>>> type(type(type(a)))
<type 'type'>
>>>
In this example, type(a) is a "regular" type, and type(type(a)) is
a metatype. While as distributed all types have the same metatype
(PyType_Type, which is also its own metatype), this is not a
requirement, and in fact a useful and relevant 3rd party extension
(ExtensionClasses by Jim Fulton) creates an additional metatype.
The type of classic classes, known as types.ClassType, can also be
considered a distinct metatype.
A feature closely connected to metatypes is the "Don Beaudry
hook", which says that if a metatype is callable, its instances
(which are regular types) can be subclassed (really subtyped)
using a Python class statement. I will use this rule to support
subtyping of built-in types, and in fact it greatly simplifies the
logic of class creation to always simply call the metatype. When
no base class is specified, a default metatype is called -- the
default metatype is the "ClassType" object, so the class statement
will behave as before in the normal case. (This default can be
changed per module by setting the global variable __metaclass__.)
Python uses the concept of metatypes or metaclasses in a different
way than Smalltalk. In Smalltalk-80, there is a hierarchy of
metaclasses that mirrors the hierarchy of regular classes,
metaclasses map 1-1 to classes (except for some funny business at
the root of the hierarchy), and each class statement creates both
a regular class and its metaclass, putting class methods in the
metaclass and instance methods in the regular class.
Nice though this may be in the context of Smalltalk, it's not
compatible with the traditional use of metatypes in Python, and I
prefer to continue in the Python way. This means that Python
metatypes are typically written in C, and may be shared between
many regular types. (It will be possible to subtype metatypes in
Python, so it won't be absolutely necessary to write C to use
metatypes; but the power of Python metatypes will be limited. For
example, Python code will never be allowed to allocate raw memory
and initialize it at will.)
Metatypes determine various *policies* for types, such as what
happens when a type is called, how dynamic types are (whether a
type's __dict__ can be modified after it is created), what the
method resolution order is, how instance attributes are looked
up, and so on.
I'll argue that left-to-right depth-first is not the best
solution when you want to get the most use from multiple
inheritance.
I'll argue that with multiple inheritance, the metatype of the
subtype must be a descendant of the metatypes of all base types.
I'll come back to metatypes later.
Making a type a factory for its instances
Traditionally, for each type there is at least one C factory
function that creates instances of the type (PyTuple_New(),
PyInt_FromLong() and so on). These factory functions take care of
both allocating memory for the object and initializing that
memory. As of Python 2.0, they also have to interface with the
garbage collection subsystem, if the type chooses to participate
in garbage collection (which is optional, but strongly recommended
for so-called "container" types: types that may contain references
to other objects, and hence may participate in reference cycles).
In this proposal, type objects can be factory functions for their
instances, making the types directly callable from Python. This
mimics the way classes are instantiated. The C APIs for creating
instances of various built-in types will remain valid and in some
cases more efficient. Not all types will become their own factory
functions.
The type object has a new slot, tp_new, which can act as a factory
for instances of the type. Types are now callable, because the
tp_call slot is set in PyType_Type (the metatype); the function
looks for the tp_new slot of the type that is being called.
Explanation: the tp_call slot of a regular type object (such as
PyInt_Type or PyList_Type) defines what happens when *instances*
of that type are called; in particular, the tp_call slot in the
function type, PyFunction_Type, is the key to making functions
callable. As another example, PyInt_Type.tp_call is NULL, because
integers are not callable. The new paradigm makes *type objects*
callable. Since type objects are instances of their metatype
(PyType_Type), the metatype's tp_call slot (PyType_Type.tp_call)
points to a function that is invoked when any type object is
called. Now, since each type has do do something different to
create an instance of itself, PyType_Type.tp_call immediately
defers to the tp_new slot of the type that is being called.
PyType_Type itself is also callable: its tp_new slot creates a new
type. This is used by the class statement (formalizing the Don
Beaudry hook, see above). And what makes PyType_Type callable?
The tp_call slot of *its* metatype -- but since it is its own
metatype, that is its own tp_call slot!
If the type's tp_new slot is NULL, an exception is raised.
Otherwise, the tp_new slot is called. The signature for the
tp_new slot is
PyObject *tp_new(PyTypeObject *type,
PyObject *args,
PyObject *kwds)
where 'type' is the type whose tp_new slot is called, and 'args'
and 'kwds' are the sequential and keyword arguments to the call,
passed unchanged from tp_call. (The 'type' argument is used in
combination with inheritance, see below.)
There are no constraints on the object type that is returned,
although by convention it should be an instance of the given
type. It is not necessary that a new object is returned; a
reference to an existing object is fine too. The return value
should always be a new reference, owned by the caller.
Once the tp_new slot has returned an object, further initialization
is attempted by calling the tp_init() slot of the resulting
object's type, if not NULL. This has the following signature:
int tp_init(PyObject *self,
PyObject *args,
PyObject *kwds)
It corresponds more closely to the __init__() method of classic
classes, and in fact is mapped to that by the slot/special-method
correspondence rules. The difference in responsibilities between
the tp_new() slot and the tp_init() slot lies in the invariants
they ensure. The tp_new() slot should ensure only the most
essential invariants, without which the C code that implements the
objects would break. The tp_init() slot should be used for
overridable user-specific initializations. Take for example the
dictionary type. The implementation has an internal pointer to a
hash table which should never be NULL. This invariant is taken
care of by the tp_new() slot for dictionaries. The dictionary
tp_init() slot, on the other hand, could be used to give the
dictionary an initial set of keys and values based on the
arguments passed in.
Note that for immutable object types, the initialization cannot be
done by the tp_init() slot: this would provide the Python user
with a way to change the initialization. Therefore, immutable
objects typically have an empty tp_init() implementation and do
all their initialization in their tp_new() slot.
You may wonder why the tp_new() slot shouldn't call the tp_init()
slot itself. The reason is that in certain circumstances (like
support for persistent objects), it is important to be able to
create an object of a particular type without initializing it any
further than necessary. This may conveniently be done by calling
the tp_new() slot without calling tp_init(). It is also possible
that tp_init() is not called, or called more than once -- its
operation should be robust even in these anomalous cases.
For some objects, tp_new() may return an existing object. For
example, the factory function for integers caches the integers -1
throug 99. This is permissible only when the type argument to
tp_new() is the type that defined the tp_new() function (in the
example, if type == &PyInt_Type), and when the tp_init() slot for
this type does nothing. If the type argument differs, the
tp_new() call is initiated by by a derived type's tp_new() to
create the object and initialize the base type portion of the
object; in this case tp_new() should always return a new object
(or raise an exception).
Both tp_new() and tp_init() should receive exactly the same 'args'
and 'kwds' arguments, and both should check that the arguments are
acceptable, because they may be called independently.
There's a third slot related to object creation: tp_alloc(). Its
responsibility is to allocate the memory for the object,
initialize the reference count (ob_refcnt) and the type pointer
(ob_type), and initialize the rest of the object to all zeros. It
should also register the object with the garbage collection
subsystem if the type supports garbage collection. This slot
exists so that derived types can override the memory allocation
policy (like which heap is being used) separately from the
initialization code. The signature is:
PyObject *tp_alloc(PyTypeObject *type, int nitems)
The type argument is the type of the new object. The nitems
argument is normally zero, except for objects with a variable
allocation size (basically strings, tuples, and longs). The
allocation size is given by the following expression:
type->tp_basicsize + nitems * type->tp_itemsize
The tp_alloc slot is only used for subclassable types. The tp_new()
function of the base class must call the tp_alloc() slot of the
type passed in as its first argument. It is the tp_new()
function's responsibility to calculate the number of items. The
tp_alloc() slot will set the ob_size member of the new object if
the type->tp_itemsize member is nonzero.
(Note: in certain debugging compilation modes, the type structure
used to have members named tp_alloc and a tp_free slot already,
counters for the number of allocations and deallocations. These
are renamed to tp_allocs and tp_deallocs.)
Standard implementations for tp_alloc() and tp_new() are
available. PyType_GenericAlloc() allocates an object from the
standard heap and initializes it properly. It uses the above
formula to determine the amount of memory to allocate, and takes
care of GC registration. The only reason not to use this
implementation would be to allocate objects from a different heap
(as is done by some very small frequently used objects like ints
and tuples). PyType_GenericNew() adds very little: it just calls
the type's tp_alloc() slot with zero for nitems. But for mutable
types that do all their initialization in their tp_init() slot,
this may be just the ticket.
Preparing a type for subtyping
The idea behind subtyping is very similar to that of single
inheritance in C++. A base type is described by a structure
declaration (similar to the C++ class declaration) plus a type
object (similar to the C++ vtable). A derived type can extend the
structure (but must leave the names, order and type of the members
of the base structure unchanged) and can override certain slots in
the type object, leaving others the same. (Unlike C++ vtables,
all Python type objects have the same memory layout.)
The base type must do the following:
- Add the flag value Py_TPFLAGS_BASETYPE to tp_flags.
- Declare and use tp_new(), tp_alloc() and optional tp_init()
slots.
- Declare and use tp_dealloc() and tp_free().
- Export its object structure declaration.
- Export a subtyping-aware type-checking macro.
The requirements and signatures for tp_new(), tp_alloc() and
tp_init() have already been discussed above: tp_alloc() should
allocate the memory and initialize it to mostly zeros; tp_new()
should call the tp_alloc() slot and then proceed to do the
minimally required initialization; tp_init() should be used for
more extensive initialization of mutable objects.
It should come as no surprise that there are similar conventions
at the end of an object's lifetime. The slots involved are
tp_dealloc() (familiar to all who have ever implemented a Python
extension type) and tp_free(), the new kid on the block. (The
names aren't quite symmetric; tp_free() corresponds to tp_alloc(),
which is fine, but tp_dealloc() corresponds to tp_new(). Maybe
the tp_dealloc slot should be renamed?)
The tp_free() slot should be used to free the memory and
unregister the object with the garbage collection subsystem, and
can be overridden by a derived class; tp_dealloc() should
deinitialize the object (usually by calling Py_XDECREF() for
various sub-objects) and then call tp_free() to deallocate the
memory. The signature for tp_dealloc() is the same as it always
was:
void tp_dealloc(PyObject *object)
The signature for tp_free() is the same:
void tp_free(PyObject *object)
(In a previous version of this PEP, there was also a role reserved
for the tp_clear() slot. This turned out to be a bad idea.)
To be usefully subtyped in C, a type must export the structure
declaration for its instances through a header file, as it is
needed to derive a subtype. The type object for the base type
must also be exported.
If the base type has a type-checking macro (like PyDict_Check()),
this macro should be made to recognize subtypes. This can be done
by using the new PyObject_TypeCheck(object, type) macro, which
calls a function that follows the base class links.
The PyObject_TypeCheck() macro contains a slight optimization: it
first compares object->ob_type directly to the type argument, and
if this is a match, bypasses the function call. This should make
it fast enough for most situations.
Note that this change in the type-checking macro means that C
functions that require an instance of the base type may be invoked
with instances of the derived type. Before enabling subtyping of
a particular type, its code should be checked to make sure that
this won't break anything. It has proved useful in the prototype
to add another type-checking macro for the built-in Python object
types, to check for exact type match too (for example,
PyDict_Check(x) is true if x is an instance of dictionary or of a
dictionary subclass, while PyDict_CheckExact(x) is true only if x
is a dictionary).
Creating a subtype of a built-in type in C
The simplest form of subtyping is subtyping in C. It is the
simplest form because we can require the C code to be aware of
some of the problems, and it's acceptable for C code that doesn't
follow the rules to dump core. For added simplicity, it is
limited to single inheritance.
Let's assume we're deriving from a mutable base type whose
tp_itemsize is zero. The subtype code is not GC-aware, although
it may inherit GC-awareness from the base type (this is
automatic). The base type's allocation uses the standard heap.
The derived type begins by declaring a type structure which
contains the base type's structure. For example, here's the type
structure for a subtype of the built-in list type:
typedef struct {
PyListObject list;
int state;
} spamlistobject;
Note that the base type structure member (here PyListObject) must
be the first member of the structure; any following members are
additions. Also note that the base type is not referenced via a
pointer; the actual contents of its structure must be included!
(The goal is for the memory layout of the beginning of the
subtype instance to be the same as that of the base type
instance.)
Next, the derived type must declare a type object and initialize
it. Most of the slots in the type object may be initialized to
zero, which is a signal that the base type slot must be copied
into it. Some slots that must be initialized properly:
- The object header must be filled in as usual; the type should
be &PyType_Type.
- The tp_basicsize slot must be set to the size of the subtype
instance struct (in the above example:
sizeof(spamlistobject)).
- The tp_base slot must be set to the address of the base type's
type object.
- If the derived slot defines any pointer members, the
tp_dealloc slot function requires special attention, see
below; otherwise, it can be set to zero, to inherit the base
type's deallocation function.
- The tp_flags slot must be set to the usual Py_TPFLAGS_DEFAULT
value.
- The tp_name slot must be set; it is recommended to set tp_doc
as well (these are not inherited).
If the subtype defines no additional structure members (it only
defines new behavior, no new data), the tp_basicsize and the
tp_dealloc slots may be left set to zero.
The subtype's tp_dealloc slot deserves special attention. If the
derived type defines no additional pointer members that need to be
DECREF'ed or freed when the object is deallocated, it can be set
to zero. Otherwise, the subtype's tp_dealloc() function must call
Py_XDECREF() for any PyObject * members and the correct memory
freeing function for any other pointers it owns, and then call the
base class's tp_dealloc() slot. This call has to be made via the
base type's type structure, for example, when deriving from the
standard list type:
PyList_Type.tp_dealloc(self);
If the subtype wants to use a different allocation heap than the
base type, the subtype must override both the tp_alloc() and the
tp_free() slots. These will be called by the base class's
tp_new() and tp_dealloc() slots, respectively.
To complete the initialization of the type, PyType_InitDict() must
be called. This replaces slots initialized to zero in the subtype
with the value of the corresponding base type slots. (It also
fills in tp_dict, the type's dictionary, and does various other
initializations necessary for type objects.)
A subtype is not usable until PyType_InitDict() is called for it;
this is best done during module initialization, assuming the
subtype belongs to a module. An alternative for subtypes added to
the Python core (which don't live in a particular module) would be
to initialize the subtype in their constructor function. It is
allowed to call PyType_InitDict() more than once; the second and
further calls have no effect. To avoid unnecessary calls, a test
for tp_dict==NULL can be made.
(During initialization of the Python interpreter, some types are
actually used before they are initialized. As long as the slots
that are actually needed are initialized, especially tp_dealloc,
this works, but it is fragile and not recommended as a general
practice.)
To create a subtype instance, the subtype's tp_new() slot is
called. This should first call the base type's tp_new() slot and
then initialize the subtype's additional data members. To further
initialize the instance, the tp_init() slot is typically called.
Note that the tp_new() slot should *not* call the tp_init() slot;
this is up to tp_new()'s caller (typically a factory function).
There are circumstances where it is appropriate not to call
tp_init().
If a subtype defines a tp_init() slot, the tp_init() slot should
normally first call the base type's tp_init() slot.
(XXX There should be a paragraph or two about argument passing
here.)
Subtyping in Python
The next step is to allow subtyping of selected built-in types
through a class statement in Python. Limiting ourselves to single
inheritance for now, here is what happens for a simple class
statement:
class C(B):
var1 = 1
def method1(self): pass
# etc.
The body of the class statement is executed in a fresh environment
(basically, a new dictionary used as local namespace), and then C
is created. The following explains how C is created.
Assume B is a type object. Since type objects are objects, and
every object has a type, B has a type. Since B is itself a type,
we also call its type its metatype. B's metatype is accessible
via type(B) or B.__class__ (the latter notation is new for types;
it is introduced in PEP 252). Let's say this metatype is M (for
Metatype). The class statement will create a new type, C. Since
C will be a type object just like B, we view the creation of C as
an instantiation of the metatype, M. The information that needs
to be provided for the creation of a subclass is:
- its name (in this example the string "C");
- its bases (a singleton tuple containing B);
- the results of executing the class body, in the form of a
dictionary (for example {"var1": 1, "method1": <function
method1 at ...>, ...}).
The class statement will result in the following call:
C = M("C", (B,), dict)
where dict is the dictionary resulting from execution of the
class body. In other words, the metatype (M) is called.
Note that even though the example has only one base, we still pass
in a (singleton) sequence of bases; this makes the interface
uniform with the multiple-inheritance case.
In current Python, this is called the "Don Beaudry hook" after its
inventor; it is an exceptional case that is only invoked when a
base class is not a regular class. For a regular base class (or
when no base class is specified), current Python calls
PyClass_New(), the C level factory function for classes, directly.
Under the new system this is changed so that Python *always*
determines a metatype and calls it as given above. When one or
more bases are given, the type of the first base is used as the
metatype; when no base is given, a default metatype is chosen. By
setting the default metatype to PyClass_Type, the metatype of
"classic" classes, the classic behavior of the class statement is
retained. This default can be changed per module by setting the
global variable __metaclass__.
There are two further refinements here. First, a useful feature
is to be able to specify a metatype directly. If the class
suite defines a variable __metaclass__, that is the metatype
to call. (Note that setting __metaclass__ at the module level
only affects class statements without a base class and without an
explicit __metaclass__ declaration; but setting __metaclass__ in a
class suite overrides the default metatype unconditionally.)
Second, with multiple bases, not all bases need to have the same
metatype. This is called a metaclass conflict [1]. Some
metaclass conflicts can be resolved by searching through the set
of bases for a metatype that derives from all other given
metatypes. If such a metatype cannot be found, an exception is
raised and the class statement fails.
This conflict resolution can be implemented by the metatype
constructors: the class statement just calls the metatype of the first
base (or that specified by the __metaclass__ variable), and this
metatype's constructor looks for the most derived metatype. If
that is itself, it proceeds; otherwise, it calls that metatype's
constructor. (Ultimate flexibility: another metatype might choose
to require that all bases have the same metatype, or that there's
only one base class, or whatever.)
(In [1], a new metaclass is automatically derived that is a
subclass of all given metaclasses. But since it is questionable
in Python how conflicting method definitions of the various
metaclasses should be merged, I don't think this is feasible.
Should the need arise, the user can derive such a metaclass
manually and specify it using the __metaclass__ variable. It is
also possible to have a new metaclass that does this.)
Note that calling M requires that M itself has a type: the
meta-metatype. And the meta-metatype has a type, the
meta-meta-metatype. And so on. This is normally cut short at
some level by making a metatype be its own metatype. This is
indeed what happens in Python: the ob_type reference in
PyType_Type is set to &PyType_Type. In the absence of third party
metatypes, PyType_Type is the only metatype in the Python
interpreter.
(In a previous version of this PEP, there was one additional
meta-level, and there was a meta-metatype called "turtle". This
turned out to be unnecessary.)
In any case, the work for creating C is done by M's tp_new() slot.
It allocates space for an "extended" type structure, containing:
the type object; the auxiliary structures (as_sequence etc.); the
string object containing the type name (to ensure that this object
isn't deallocated while the type object is still referencing it); and
some auxiliary storage (to be described later). It initializes this
storage to zeros except for a few crucial slots (for example, tp_name
is set to point to the type name) and then sets the tp_base slot to
point to B. Then PyType_InitDict() is called to inherit B's slots.
Finally, C's tp_dict slot is updated with the contents of the
namespace dictionary (the third argument to the call to M).
Multiple inheritance
The Python class statement supports multiple inheritance, and we
will also support multiple inheritance involving built-in types.
However, there are some restrictions. The C runtime architecture
doesn't make it feasible to have a meaningful subtype of two
different built-in types except in a few degenerate cases.
Changing the C runtime to support fully general multiple
inheritance would be too much of an upheaval of the code base.
The main problem with multiple inheritance from different built-in
types stems from the fact that the C implementation of built-in
types accesses structure members directly; the C compiler
generates an offset relative to the object pointer and that's
that. For example, the list and dictionary type structures each
declare a number of different but overlapping structure members.
A C function accessing an object expecting a list won't work when
passed a dictionary, and vice versa, and there's not much we could
do about this without rewriting all code that accesses lists and
dictionaries. This would be too much work, so we won't do this.
The problem with multiple inheritance is caused by conflicting
structure member allocations. Classes defined in Python normally
don't store their instance variables in structure members: they
are stored in an instance dictionary. This is the key to a
partial solution. Suppose we have the following two classes:
class A(dictionary):
def foo(self): pass
class B(dictionary):
def bar(self): pass
class C(A, B): pass
(Here, 'dictionary' is the type of built-in dictionary objects,
a.k.a. type({}) or {}.__class__ or types.DictType.) If we look at
the structure layout, we find that an A instance has the layout
of a dictionary followed by the __dict__ pointer, and a B instance
has the same layout; since there are no structure member layout
conflicts, this is okay.
Here's another example:
class X(object):
def foo(self): pass
class Y(dictionary):
def bar(self): pass
class Z(X, Y): pass
(Here, 'object' is the base for all built-in types; its structure
layout only contains the ob_refcnt and ob_type members.) This
example is more complicated, because the __dict__ pointer for X
instances has a different offset than that for Y instances. Where
is the __dict__ pointer for Z instances? The answer is that the
offset for the __dict__ pointer is not hardcoded, it is stored in
the type object.
Suppose on a particular machine an 'object' structure is 8 bytes
long, and a 'dictionary' struct is 60 bytes, and an object pointer
is 4 bytes. Then an X structure is 12 bytes (an object structure
followed by a __dict__ pointer), and a Y structure is 64 bytes (a
dictionary structure followed by a __dict__ pointer). The Z
structure has the same layout as the Y structure in this example.
Each type object (X, Y and Z) has a "__dict__ offset" which is
used to find the __dict__ pointer. Thus, the recipe for looking
up an instance variable is:
1. get the type of the instance
2. get the __dict__ offset from the type object
3. add the __dict__ offset to the instance pointer
4. look in the resulting address to find a dictionary reference
5. look up the instance variable name in that dictionary
Of course, this recipe can only be implemented in C, and I have
left out some details. But this allows us to use multiple
inheritance patterns similar to the ones we can use with classic
classes.
XXX I should write up the complete algorithm here to determine
base class compatibility, but I can't be bothered right now. Look
at best_base() in typeobject.c in the implementation mentioned
below.
MRO: Method resolution order (the lookup rule)
With multiple inheritance comes the question of method resolution
order: the order in which a class or type and its bases are
searched looking for a method of a given name.
In classic Python, the rule is given by the following recursive
function, also known as the left-to-right depth-first rule:
def classic_lookup(cls, name):
if cls.__dict__.has_key(name):
return cls.__dict__[name]
for base in cls.__bases__:
try:
return classic_lookup(base, name)
except AttributeError:
pass
raise AttributeError, name
The problem with this becomes apparent when we consider a "diamond
diagram":
class A:
^ ^ def save(self): ...
/ \
/ \
/ \
/ \
class B class C:
^ ^ def save(self): ...
\ /
\ /
\ /
\ /
class D
Arrows point from a subtype to its base type(s). This particular
diagram means B and C derive from A, and D derives from B and C
(and hence also, indirectly, from A).
Assume that C overrides the method save(), which is defined in the
base A. (C.save() probably calls A.save() and then saves some of
its own state.) B and D don't override save(). When we invoke
save() on a D instance, which method is called? According to the
classic lookup rule, A.save() is called, ignoring C.save()!
This is not good. It probably breaks C (its state doesn't get
saved), defeating the whole purpose of inheriting from C in the
first place.
Why was this not a problem in classic Python? Diamond diagrams
are rarely found in classic Python class hierarchies. Most class
hierarchies use single inheritance, and multiple inheritance is
usually confined to mix-in classes. In fact, the problem shown
here is probably the reason why multiple inheritance is unpopular
in classic Python.
Why will this be a problem in the new system? The 'object' type
at the top of the type hierarchy defines a number of methods that
can usefully be extended by subtypes, for example __getattr__().
(Aside: in classic Python, the __getattr__() method is not really
the implementation for the get-attribute operation; it is a hook
that only gets invoked when an attribute cannot be found by normal
means. This has often been cited as a shortcoming -- some class
designs have a legitimate need for a __getattr__() method that
gets called for *all* attribute references. But then of course
this method has to be able to invoke the default implementation
directly. The most natural way is to make the default
implementation available as object.__getattr__(self, name).)
Thus, a classic class hierarchy like this:
class B class C:
^ ^ def __getattr__(self, name): ...
\ /
\ /
\ /
\ /
class D
will change into a diamond diagram under the new system:
object:
^ ^ __getattr__()
/ \
/ \
/ \
/ \
class B class C:
^ ^ def __getattr__(self, name): ...
\ /
\ /
\ /
\ /
class D
and while in the original diagram C.__getattr__() is invoked,
under the new system with the classic lookup rule,
object.__getattr__() would be invoked!
Fortunately, there's a lookup rule that's better. It's a bit
difficult to explain, but it does the right thing in the diamond
diagram, and it is the same as the classic lookup rule when there
are no diamonds in the inheritance graph (when it is a tree).
The new lookup rule constructs a list of all classes in the
inheritance diagram in the order in which they will be searched.
This construction is done at class definition time to save time.
To explain the new lookup rule, let's first consider what such a
list would look like for the classic lookup rule. Note that in
the presence of diamonds the classic lookup visits some classes
multiple times. For example, in the ABCD diamond diagram above,
the classic lookup rule visits the classes in this order:
D, B, A, C, A
Note how A occurs twice in the list. The second occurrence is
redundant, since anything that could be found there would already
have been found when searching the first occurrence.
We use this observation to explain our new lookup rule. Using the
classic lookup rule, construct the list of classes that would be
searched, including duplicates. Now for each class that occurs in
the list multiple times, remove all occurrences except for the
last. The resulting list contains each ancestor class exactly
once (including the most derived class, D in the example).
Searching for methods in this order will do the right thing for
the diamond diagram. Because of the way the list is constructed,
it does not change the search order in situations where no diamond
is involved.
Isn't this backwards incompatible? Won't it break existing code?
It would, if we changed the method resolution order for all
classes. However, in Python 2.2, the new lookup rule will only be
applied to types derived from built-in types, which is a new
feature. Class statements without a base class create "classic
classes", and so do class statements whose base classes are
themselves classic classes. For classic classes the classic
lookup rule will be used. (To experiment with the new lookup rule
for classic classes, you will be able to specify a different
metaclass explicitly.) We'll also provide a tool that analyzes a
class hierarchy looking for methods that would be affected by a
change in method resolution order.
XXX Another way to explain the motivation for the new MRO, due to
Damian Conway: you never use the method defined in a base class if
it is defined in a derived class that you haven't explored yet
(using the old search order).
XXX To be done
Additional topics to be discussed in this PEP:
- backwards compatibility issues!!!
- class methods and static methods
- cooperative methods and super()
- mapping between type object slots (tp_foo) and special methods
(__foo__) (actually, this may belong in PEP 252)
- built-in names for built-in types (object, int, str, list etc.)
- __dict__ and __dictoffset__
- __slots__
- the HEAPTYPE flag bit
- GC support
- API docs for all the new functions
- how to use __new__
- writing metaclasses (using mro() etc.)
- high level user overview
- open issues:
- do we need __del__?
- assignment to __dict__, __bases__
- inconsistent naming
(e.g. tp_dealloc/tp_new/tp_init/tp_alloc/tp_free)
- add builtin alias 'dict' for 'dictionary'?
- when subclasses of dict/list etc. are passed to system
functions, the __getitem__ overrides (etc.) aren't always
used
Implementation
A prototype implementation of this PEP (and for PEP 252) is
available from CVS, and in the series of Python 2.2 alpha and beta
releases. For some examples of the features described here, see
the file Lib/test/test_descr.py and the extension module
Modules/xxsubtype.c.
References
[1] "Putting Metaclasses to Work", by Ira R. Forman and Scott
H. Danforth, Addison-Wesley 1999.
(http://www.aw.com/product/0,2627,0201433052,00.html)
Copyright
This document has been placed in the public domain.
pep-0254 Making Classes Look More Like Types
| PEP: | 254 |
|---|---|
| Title: | Making Classes Look More Like Types |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 18-June-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
This PEP has not been written yet. Watch this space!
Status
This PEP was a stub entry and eventually abandoned without having
been filled-out. Substantially most of the intended functionality
was implemented in Py2.2 with new-style types and classes.
Copyright
This document has been placed in the public domain.
pep-0255 Simple Generators
| PEP: | 255 |
|---|---|
| Title: | Simple Generators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neil Schemenauer <nas at arctrix.com>, Tim Peters <tim at zope.com>, Magnus Lie Hetland <magnus at hetland.org> |
| Discussions-To: | <python-iterators at lists.sourceforge.net> |
| Status: | Final |
| Type: | Standards Track |
| Requires: | 234 |
| Created: | 18-May-2001 |
| Python-Version: | 2.2 |
| Post-History: | 14-Jun-2001, 23-Jun-2001 |
Abstract
This PEP introduces the concept of generators to Python, as well
as a new statement used in conjunction with them, the "yield"
statement.
Motivation
When a producer function has a hard enough job that it requires
maintaining state between values produced, most programming languages
offer no pleasant and efficient solution beyond adding a callback
function to the producer's argument list, to be called with each value
produced.
For example, tokenize.py in the standard library takes this approach:
the caller must pass a "tokeneater" function to tokenize(), called
whenever tokenize() finds the next token. This allows tokenize to be
coded in a natural way, but programs calling tokenize are typically
convoluted by the need to remember between callbacks which token(s)
were seen last. The tokeneater function in tabnanny.py is a good
example of that, maintaining a state machine in global variables, to
remember across callbacks what it has already seen and what it hopes to
see next. This was difficult to get working correctly, and is still
difficult for people to understand. Unfortunately, that's typical of
this approach.
An alternative would have been for tokenize to produce an entire parse
of the Python program at once, in a large list. Then tokenize clients
could be written in a natural way, using local variables and local
control flow (such as loops and nested if statements) to keep track of
their state. But this isn't practical: programs can be very large, so
no a priori bound can be placed on the memory needed to materialize the
whole parse; and some tokenize clients only want to see whether
something specific appears early in the program (e.g., a future
statement, or, as is done in IDLE, just the first indented statement),
and then parsing the whole program first is a severe waste of time.
Another alternative would be to make tokenize an iterator[1],
delivering the next token whenever its .next() method is invoked. This
is pleasant for the caller in the same way a large list of results
would be, but without the memory and "what if I want to get out early?"
drawbacks. However, this shifts the burden on tokenize to remember
*its* state between .next() invocations, and the reader need only
glance at tokenize.tokenize_loop() to realize what a horrid chore that
would be. Or picture a recursive algorithm for producing the nodes of
a general tree structure: to cast that into an iterator framework
requires removing the recursion manually and maintaining the state of
the traversal by hand.
A fourth option is to run the producer and consumer in separate
threads. This allows both to maintain their states in natural ways,
and so is pleasant for both. Indeed, Demo/threads/Generator.py in the
Python source distribution provides a usable synchronized-communication
class for doing that in a general way. This doesn't work on platforms
without threads, though, and is very slow on platforms that do
(compared to what is achievable without threads).
A final option is to use the Stackless[2][3] variant implementation of
Python instead, which supports lightweight coroutines. This has much
the same programmatic benefits as the thread option, but is much more
efficient. However, Stackless is a controversial rethinking of the
Python core, and it may not be possible for Jython to implement the
same semantics. This PEP isn't the place to debate that, so suffice it
to say here that generators provide a useful subset of Stackless
functionality in a way that fits easily into the current CPython
implementation, and is believed to be relatively straightforward for
other Python implementations.
That exhausts the current alternatives. Some other high-level
languages provide pleasant solutions, notably iterators in Sather[4],
which were inspired by iterators in CLU; and generators in Icon[5], a
novel language where every expression "is a generator". There are
differences among these, but the basic idea is the same: provide a
kind of function that can return an intermediate result ("the next
value") to its caller, but maintaining the function's local state so
that the function can be resumed again right where it left off. A
very simple example:
def fib():
a, b = 0, 1
while 1:
yield b
a, b = b, a+b
When fib() is first invoked, it sets a to 0 and b to 1, then yields b
back to its caller. The caller sees 1. When fib is resumed, from its
point of view the yield statement is really the same as, say, a print
statement: fib continues after the yield with all local state intact.
a and b then become 1 and 1, and fib loops back to the yield, yielding
1 to its invoker. And so on. From fib's point of view it's just
delivering a sequence of results, as if via callback. But from its
caller's point of view, the fib invocation is an iterable object that
can be resumed at will. As in the thread approach, this allows both
sides to be coded in the most natural ways; but unlike the thread
approach, this can be done efficiently and on all platforms. Indeed,
resuming a generator should be no more expensive than a function call.
The same kind of approach applies to many producer/consumer functions.
For example, tokenize.py could yield the next token instead of invoking
a callback function with it as argument, and tokenize clients could
iterate over the tokens in a natural way: a Python generator is a kind
of Python iterator[1], but of an especially powerful kind.
Specification: Yield
A new statement is introduced:
yield_stmt: "yield" expression_list
"yield" is a new keyword, so a future statement[8] is needed to phase
this in: in the initial release, a module desiring to use generators
must include the line
from __future__ import generators
near the top (see PEP 236[8]) for details). Modules using the
identifier "yield" without a future statement will trigger warnings.
In the following release, yield will be a language keyword and the
future statement will no longer be needed.
The yield statement may only be used inside functions. A function that
contains a yield statement is called a generator function. A generator
function is an ordinary function object in all respects, but has the
new CO_GENERATOR flag set in the code object's co_flags member.
When a generator function is called, the actual arguments are bound to
function-local formal argument names in the usual way, but no code in
the body of the function is executed. Instead a generator-iterator
object is returned; this conforms to the iterator protocol[6], so in
particular can be used in for-loops in a natural way. Note that when
the intent is clear from context, the unqualified name "generator" may
be used to refer either to a generator-function or a generator-
iterator.
Each time the .next() method of a generator-iterator is invoked, the
code in the body of the generator-function is executed until a yield
or return statement (see below) is encountered, or until the end of
the body is reached.
If a yield statement is encountered, the state of the function is
frozen, and the value of expression_list is returned to .next()'s
caller. By "frozen" we mean that all local state is retained,
including the current bindings of local variables, the instruction
pointer, and the internal evaluation stack: enough information is
saved so that the next time .next() is invoked, the function can
proceed exactly as if the yield statement were just another external
call.
Restriction: A yield statement is not allowed in the try clause of a
try/finally construct. The difficulty is that there's no guarantee
the generator will ever be resumed, hence no guarantee that the finally
block will ever get executed; that's too much a violation of finally's
purpose to bear.
Restriction: A generator cannot be resumed while it is actively
running:
>>> def g():
... i = me.next()
... yield i
>>> me = g()
>>> me.next()
Traceback (most recent call last):
...
File "<string>", line 2, in g
ValueError: generator already executing
Specification: Return
A generator function can also contain return statements of the form:
"return"
Note that an expression_list is not allowed on return statements
in the body of a generator (although, of course, they may appear in
the bodies of non-generator functions nested within the generator).
When a return statement is encountered, control proceeds as in any
function return, executing the appropriate finally clauses (if any
exist). Then a StopIteration exception is raised, signalling that the
iterator is exhausted. A StopIteration exception is also raised if
control flows off the end of the generator without an explict return.
Note that return means "I'm done, and have nothing interesting to
return", for both generator functions and non-generator functions.
Note that return isn't always equivalent to raising StopIteration: the
difference lies in how enclosing try/except constructs are treated.
For example,
>>> def f1():
... try:
... return
... except:
... yield 1
>>> print list(f1())
[]
because, as in any function, return simply exits, but
>>> def f2():
... try:
... raise StopIteration
... except:
... yield 42
>>> print list(f2())
[42]
because StopIteration is captured by a bare "except", as is any
exception.
Specification: Generators and Exception Propagation
If an unhandled exception-- including, but not limited to,
StopIteration --is raised by, or passes through, a generator function,
then the exception is passed on to the caller in the usual way, and
subsequent attempts to resume the generator function raise
StopIteration. In other words, an unhandled exception terminates a
generator's useful life.
Example (not idiomatic but to illustrate the point):
>>> def f():
... return 1/0
>>> def g():
... yield f() # the zero division exception propagates
... yield 42 # and we'll never get here
>>> k = g()
>>> k.next()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 2, in g
File "<stdin>", line 2, in f
ZeroDivisionError: integer division or modulo by zero
>>> k.next() # and the generator cannot be resumed
Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
>>>
Specification: Try/Except/Finally
As noted earlier, yield is not allowed in the try clause of a try/
finally construct. A consequence is that generators should allocate
critical resources with great care. There is no restriction on yield
otherwise appearing in finally clauses, except clauses, or in the try
clause of a try/except construct:
>>> def f():
... try:
... yield 1
... try:
... yield 2
... 1/0
... yield 3 # never get here
... except ZeroDivisionError:
... yield 4
... yield 5
... raise
... except:
... yield 6
... yield 7 # the "raise" above stops this
... except:
... yield 8
... yield 9
... try:
... x = 12
... finally:
... yield 10
... yield 11
>>> print list(f())
[1, 2, 4, 5, 8, 9, 10, 11]
>>>
Example
# A binary tree class.
class Tree:
def __init__(self, label, left=None, right=None):
self.label = label
self.left = left
self.right = right
def __repr__(self, level=0, indent=" "):
s = level*indent + `self.label`
if self.left:
s = s + "\n" + self.left.__repr__(level+1, indent)
if self.right:
s = s + "\n" + self.right.__repr__(level+1, indent)
return s
def __iter__(self):
return inorder(self)
# Create a Tree from a list.
def tree(list):
n = len(list)
if n == 0:
return []
i = n / 2
return Tree(list[i], tree(list[:i]), tree(list[i+1:]))
# A recursive generator that generates Tree labels in in-order.
def inorder(t):
if t:
for x in inorder(t.left):
yield x
yield t.label
for x in inorder(t.right):
yield x
# Show it off: create a tree.
t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
# Print the nodes of the tree in in-order.
for x in t:
print x,
print
# A non-recursive generator.
def inorder(node):
stack = []
while node:
while node.left:
stack.append(node)
node = node.left
yield node.label
while not node.right:
try:
node = stack.pop()
except IndexError:
return
yield node.label
node = node.right
# Exercise the non-recursive generator.
for x in t:
print x,
print
Both output blocks display:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Q & A
Q. Why not a new keyword instead of reusing "def"?
A. See BDFL Pronouncements section below.
Q. Why a new keyword for "yield"? Why not a builtin function instead?
A. Control flow is much better expressed via keyword in Python, and
yield is a control construct. It's also believed that efficient
implementation in Jython requires that the compiler be able to
determine potential suspension points at compile-time, and a new
keyword makes that easy. The CPython referrence implementation also
exploits it heavily, to detect which functions *are* generator-
functions (although a new keyword in place of "def" would solve that
for CPython -- but people asking the "why a new keyword?" question
don't want any new keyword).
Q: Then why not some other special syntax without a new keyword? For
example, one of these instead of "yield 3":
return 3 and continue
return and continue 3
return generating 3
continue return 3
return >> , 3
from generator return 3
return >> 3
return << 3
>> 3
<< 3
* 3
A: Did I miss one <wink>? Out of hundreds of messages, I counted three
suggesting such an alternative, and extracted the above from them.
It would be nice not to need a new keyword, but nicer to make yield
very clear -- I don't want to have to *deduce* that a yield is
occurring from making sense of a previously senseless sequence of
keywords or operators. Still, if this attracts enough interest,
proponents should settle on a single consensus suggestion, and Guido
will Pronounce on it.
Q. Why allow "return" at all? Why not force termination to be spelled
"raise StopIteration"?
A. The mechanics of StopIteration are low-level details, much like the
mechanics of IndexError in Python 2.1: the implementation needs to
do *something* well-defined under the covers, and Python exposes
these mechanisms for advanced users. That's not an argument for
forcing everyone to work at that level, though. "return" means "I'm
done" in any kind of function, and that's easy to explain and to use.
Note that "return" isn't always equivalent to "raise StopIteration"
in try/except construct, either (see the "Specification: Return"
section).
Q. Then why not allow an expression on "return" too?
A. Perhaps we will someday. In Icon, "return expr" means both "I'm
done", and "but I have one final useful value to return too, and
this is it". At the start, and in the absence of compelling uses
for "return expr", it's simply cleaner to use "yield" exclusively
for delivering values.
BDFL Pronouncements
Issue: Introduce another new keyword (say, "gen" or "generator") in
place of "def", or otherwise alter the syntax, to distinguish
generator-functions from non-generator functions.
Con: In practice (how you think about them), generators *are*
functions, but with the twist that they're resumable. The mechanics of
how they're set up is a comparatively minor technical issue, and
introducing a new keyword would unhelpfully overemphasize the
mechanics of how generators get started (a vital but tiny part of a
generator's life).
Pro: In reality (how you think about them), generator-functions are
actually factory functions that produce generator-iterators as if by
magic. In this respect they're radically different from non-generator
functions, acting more like a constructor than a function, so reusing
"def" is at best confusing. A "yield" statement buried in the body is
not enough warning that the semantics are so different.
BDFL: "def" it stays. No argument on either side is totally
convincing, so I have consulted my language designer's intuition. It
tells me that the syntax proposed in the PEP is exactly right - not too
hot, not too cold. But, like the Oracle at Delphi in Greek mythology,
it doesn't tell me why, so I don't have a rebuttal for the arguments
against the PEP syntax. The best I can come up with (apart from
agreeing with the rebuttals ... already made) is "FUD". If this had
been part of the language from day one, I very much doubt it would have
made Andrew Kuchling's "Python Warts" page.
Reference Implementation
The current implementation, in a preliminary state (no docs, but well
tested and solid), is part of Python's CVS development tree[9]. Using
this requires that you build Python from source.
This was derived from an earlier patch by Neil Schemenauer[7].
Footnotes and References
[1] PEP 234, Iterators, Yee, Van Rossum
http://www.python.org/dev/peps/pep-0234/
[2] http://www.stackless.com/
[3] PEP 219, Stackless Python, McMillan
http://www.python.org/dev/peps/pep-0219/
[4] "Iteration Abstraction in Sather"
Murer, Omohundro, Stoutamire and Szyperski
http://www.icsi.berkeley.edu/~sather/Publications/toplas.html
[5] http://www.cs.arizona.edu/icon/
[6] The concept of iterators is described in PEP 234. See [1] above.
[7] http://python.ca/nas/python/generator.diff
[8] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
[9] To experiment with this implementation, check out Python from CVS
according to the instructions at
http://sf.net/cvs/?group_id=5470
Note that the std test Lib/test/test_generators.py contains many
examples, including all those in this PEP.
Copyright
This document has been placed in the public domain.
pep-0256 Docstring Processing System Framework
| PEP: | 256 |
|---|---|
| Title: | Docstring Processing System Framework |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Goodger <goodger at python.org> |
| Discussions-To: | <doc-sig at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 01-Jun-2001 |
| Post-History: | 13-Jun-2001 |
Contents
Rejection Notice
This proposal seems to have run out of steam.
Abstract
Python lends itself to inline documentation. With its built-in docstring syntax, a limited form of Literate Programming [4] is easy to do in Python. However, there are no satisfactory standard tools for extracting and processing Python docstrings. The lack of a standard toolset is a significant gap in Python's infrastructure; this PEP aims to fill the gap.
The issues surrounding docstring processing have been contentious and difficult to resolve. This PEP proposes a generic Docstring Processing System (DPS) framework, which separates out the components (program and conceptual), enabling the resolution of individual issues either through consensus (one solution) or through divergence (many). It promotes standard interfaces which will allow a variety of plug-in components (input context readers, markup parsers, and output format writers) to be used.
The concepts of a DPS framework are presented independently of implementation details.
Road Map to the Docstring PEPs
There are many aspects to docstring processing. The "Docstring PEPs" have broken up the issues in order to deal with each of them in isolation, or as close as possible. The individual aspects and associated PEPs are as follows:
- Docstring syntax. PEP 287, "reStructuredText Docstring Format" [1], proposes a syntax for Python docstrings, PEPs, and other uses.
- Docstring semantics consist of at least two aspects:
- Processing mechanisms. This PEP (PEP 256) outlines the high-level issues and specification of an abstract docstring processing system (DPS). PEP 258, "Docutils Design Specification" [3], is an overview of the design and implementation of one DPS under development.
- Output styles: developers want the documentation generated from their source code to look good, and there are many different ideas about what that means. PEP 258 touches on "Stylist Transforms". This aspect of docstring processing has yet to be fully explored.
By separating out the issues, we can form consensus more easily (smaller fights ;-), and accept divergence more readily.
Rationale
There are standard inline documentation systems for some other languages. For example, Perl has POD [5] ("Plain Old Documentation") and Java has Javadoc [6], but neither of these mesh with the Pythonic way. POD syntax is very explicit, but takes after Perl in terms of readability. Javadoc is HTML-centric; except for "@field" tags, raw HTML is used for markup. There are also general tools such as Autoduck [7] and Web [8] (Tangle & Weave), useful for multiple languages.
There have been many attempts to write auto-documentation systems for Python (not an exhaustive list):
- Marc-Andre Lemburg's doc.py [9]
- Daniel Larsson's pythondoc [10] & gendoc [10]
- Doug Hellmann's HappyDoc [11]
- Laurence Tratt's Crystal (no longer available on the web)
- Ka-Ping Yee's pydoc [12] (pydoc.py is now part of the Python standard library; see below)
- Tony Ibbs' docutils [13] (Tony has donated this name to the Docutils project [14])
- Edward Loper's STminus [15] formalization and related efforts
These systems, each with different goals, have had varying degrees of success. A problem with many of the above systems was over-ambition combined with inflexibility. They provided a self-contained set of components: a docstring extraction system, a markup parser, an internal processing system and one or more output format writers with a fixed style. Inevitably, one or more aspects of each system had serious shortcomings, and they were not easily extended or modified, preventing them from being adopted as standard tools.
It has become clear (to this author, at least) that the "all or nothing" approach cannot succeed, since no monolithic self-contained system could possibly be agreed upon by all interested parties. A modular component approach designed for extension, where components may be multiply implemented, may be the only chance for success. Standard inter-component APIs will make the DPS components comprehensible without requiring detailed knowledge of the whole, lowering the barrier for contributions, and ultimately resulting in a rich and varied system.
Each of the components of a docstring processing system should be developed independently. A "best of breed" system should be chosen, either merged from existing systems, and/or developed anew. This system should be included in Python's standard library.
PyDoc & Other Existing Systems
PyDoc became part of the Python standard library as of release 2.1. It extracts and displays docstrings from within the Python interactive interpreter, from the shell command line, and from a GUI window into a web browser (HTML). Although a very useful tool, PyDoc has several deficiencies, including:
- In the case of the GUI/HTML, except for some heuristic hyperlinking of identifier names, no formatting of the docstrings is done. They are presented within <p><small><tt> tags to avoid unwanted line wrapping. Unfortunately, the result is not attractive.
- PyDoc extracts docstrings and structural information (class identifiers, method signatures, etc.) from imported module objects. There are security issues involved with importing untrusted code. Also, information from the source is lost when importing, such as comments, "additional docstrings" (string literals in non-docstring contexts; see PEP 258 [3]), and the order of definitions.
The functionality proposed in this PEP could be added to or used by PyDoc when serving HTML pages. The proposed docstring processing system's functionality is much more than PyDoc needs in its current form. Either an independent tool will be developed (which PyDoc may or may not use), or PyDoc could be expanded to encompass this functionality and become the docstring processing system (or one such system). That decision is beyond the scope of this PEP.
Similarly for other existing docstring processing systems, their authors may or may not choose compatibility with this framework. However, if this framework is accepted and adopted as the Python standard, compatibility will become an important consideration in these systems' future.
Specification
The docstring processing system framework is broken up as follows:
Docstring conventions. Documents issues such as:
- What should be documented where.
- First line is a one-line synopsis.
Docstring processing system design specification. Documents issues such as:
- High-level spec: what a DPS does.
- Command-line interface for executable script.
- System Python API.
- Docstring extraction rules.
- Readers, which encapsulate the input context.
- Parsers.
- Document tree: the intermediate internal data structure. The output of the Parser and Reader, and the input to the Writer all share the same data structure.
- Transforms, which modify the document tree.
- Writers for output formats.
- Distributors, which handle output management (one file, many files, or objects in memory).
These issues are applicable to any docstring processing system implementation. PEP 258 [3] documents these issues.
Docstring processing system implementation.
Input markup specifications: docstring syntax. PEP 287 [1] proposes a standard syntax.
Input parser implementations.
Input context readers ("modes": Python source code, PEP, standalone text file, email, etc.) and implementations.
Stylists: certain input context readers may have associated stylists which allow for a variety of output document styles.
Output formats (HTML, XML, TeX, DocBook, info, etc.) and writer implementations.
Components 1, 2/3/5, and 4 are the subject of individual companion PEPs. If there is another implementation of the framework or syntax/parser, additional PEPs may be required. Multiple implementations of each of components 6 and 7 will be required; the PEP mechanism may be overkill for these components.
Project Web Site
A SourceForge project has been set up for this work at http://docutils.sourceforge.net/.
References and Footnotes
| [1] | (1, 2) PEP 287, reStructuredText Docstring Format, Goodger (http://www.python.org/dev/peps/pep-0287/) |
| [2] | (1, 2) PEP 257, Docstring Conventions, Goodger, Van Rossum (http://www.python.org/dev/peps/pep-0257/) |
| [3] | (1, 2, 3) PEP 258, Docutils Design Specification, Goodger (http://www.python.org/dev/peps/pep-0258/) |
| [4] | http://www.literateprogramming.com/ |
| [5] | http://www.perldoc.com/perl5.6/pod/perlpod.html |
| [6] | http://java.sun.com/j2se/javadoc/ |
| [7] | http://www.helpmaster.com/hlp-developmentaids-autoduck.htm |
| [8] | http://www-cs-faculty.stanford.edu/~knuth/cweb.html |
| [9] | http://www.egenix.com/files/python/SoftwareDescriptions.html#doc.py |
| [10] | (1, 2) http://starship.python.net/crew/danilo/pythondoc/ |
| [11] | http://happydoc.sourceforge.net/ |
| [12] | http://docs.python.org/library/pydoc.html |
| [13] | http://www.tibsnjoan.co.uk/docutils.html |
| [14] | http://docutils.sourceforge.net/ |
| [15] | http://www.cis.upenn.edu/~edloper/pydoc/ |
| [16] | http://www.python.org/sigs/doc-sig/ |
Copyright
This document has been placed in the public domain.
Acknowledgements
This document borrows ideas from the archives of the Python Doc-SIG [16]. Thanks to all members past & present.
pep-0257 Docstring Conventions
| PEP: | 257 |
|---|---|
| Title: | Docstring Conventions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Goodger <goodger at python.org>, Guido van Rossum <guido at python.org> |
| Discussions-To: | doc-sig at python.org |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 29-May-2001 |
| Post-History: | 13-Jun-2001 |
Contents
Abstract
This PEP documents the semantics and conventions associated with Python docstrings.
Rationale
The aim of this PEP is to standardize the high-level structure of docstrings: what they should contain, and how to say it (without touching on any markup syntax within docstrings). The PEP contains conventions, not laws or syntax.
"A universal convention supplies all of maintainability, clarity, consistency, and a foundation for good programming habits too. What it doesn't do is insist that you follow it against your will. That's Python!"
—Tim Peters on comp.lang.python, 2001-06-16
If you violate these conventions, the worst you'll get is some dirty looks. But some software (such as the Docutils [3] docstring processing system [1] [2]) will be aware of the conventions, so following them will get you the best results.
Specification
What is a Docstring?
A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.
All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings. Public methods (including the __init__ constructor) should also have docstrings. A package may be documented in the module docstring of the __init__.py file in the package directory.
String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes (i.e. not assigned to __doc__), but two types of extra docstrings may be extracted by software tools:
- String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called "attribute docstrings".
- String literals occurring immediately after another docstring are called "additional docstrings".
Please see PEP 258, "Docutils Design Specification" [2], for a detailed description of attribute and additional docstrings.
XXX Mention docstrings of 2.2 properties.
For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".
There are two forms of docstrings: one-liners and multi-line docstrings.
One-line Docstrings
One-liners are for really obvious cases. They should really fit on one line. For example:
def kos_root():
"""Return the pathname of the KOS root directory."""
global _kos_root
if _kos_root: return _kos_root
...
Notes:
Triple quotes are used even though the string fits on one line. This makes it easy to later expand it.
The closing quotes are on the same line as the opening quotes. This looks better for one-liners.
There's no blank line either before or after the docstring.
The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...".
The one-line docstring should NOT be a "signature" reiterating the function/method parameters (which can be obtained by introspection). Don't do:
def function(a, b): """function(a, b) -> list"""This type of docstring is only appropriate for C functions (such as built-ins), where introspection is not possible. However, the nature of the return value cannot be determined by introspection, so it should be mentioned. The preferred form for such a docstring would be something like:
def function(a, b): """Do X and return a list."""(Of course "Do X" should be replaced by a useful description!)
Multi-line Docstrings
Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line. The summary line may be on the same line as the opening quotes or on the next line. The entire docstring is indented the same as the quotes at its first line (see example below).
Insert a blank line after all docstrings (one-line or multi-line) that document a class -- generally speaking, the class's methods are separated from each other by a single blank line, and the docstring needs to be offset from the first method by a blank line.
The docstring of a script (a stand-alone program) should be usable as its "usage" message, printed when the script is invoked with incorrect or missing arguments (or perhaps with a "-h" option, for "help"). Such a docstring should document the script's function and command line syntax, environment variables, and files. Usage messages can be fairly elaborate (several screens full) and should be sufficient for a new user to use the command properly, as well as a complete quick reference to all options and arguments for the sophisticated user.
The docstring for a module should generally list the classes, exceptions and functions (and any other objects) that are exported by the module, with a one-line summary of each. (These summaries generally give less detail than the summary line in the object's docstring.) The docstring for a package (i.e., the docstring of the package's __init__.py module) should also list the modules and subpackages exported by the package.
The docstring for a function or method should summarize its behavior and document its arguments, return value(s), side effects, exceptions raised, and restrictions on when it can be called (all if applicable). Optional arguments should be indicated. It should be documented whether keyword arguments are part of the interface.
The docstring for a class should summarize its behavior and list the public methods and instance variables. If the class is intended to be subclassed, and has an additional interface for subclasses, this interface should be listed separately (in the docstring). The class constructor should be documented in the docstring for its __init__ method. Individual methods should be documented by their own docstring.
If a class subclasses another class and its behavior is mostly inherited from that class, its docstring should mention this and summarize the differences. Use the verb "override" to indicate that a subclass method replaces a superclass method and does not call the superclass method; use the verb "extend" to indicate that a subclass method calls the superclass method (in addition to its own behavior).
Do not use the Emacs convention of mentioning the arguments of functions or methods in upper case in running text. Python is case sensitive and the argument names can be used for keyword arguments, so the docstring should document the correct argument names. It is best to list each argument on a separate line. For example:
def complex(real=0.0, imag=0.0):
"""Form a complex number.
Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
return complex_zero
...
Unless the entire docstring fits on a line, place the closing quotes on a line by themselves. This way, Emacs' fill-paragraph command can be used on it.
Handling Docstring Indentation
Docstring processing tools will strip a uniform amount of indentation from the second and further lines of the docstring, equal to the minimum indentation of all non-blank lines after the first line. Any indentation in the first line of the docstring (i.e., up to the first newline) is insignificant and removed. Relative indentation of later lines in the docstring is retained. Blank lines should be removed from the beginning and end of the docstring.
Since code is much more precise than words, here is an implementation of the algorithm:
def trim(docstring):
if not docstring:
return ''
# Convert tabs to spaces (following the normal Python rules)
# and split into a list of lines:
lines = docstring.expandtabs().splitlines()
# Determine minimum indentation (first line doesn't count):
indent = sys.maxint
for line in lines[1:]:
stripped = line.lstrip()
if stripped:
indent = min(indent, len(line) - len(stripped))
# Remove indentation (first line is special):
trimmed = [lines[0].strip()]
if indent < sys.maxint:
for line in lines[1:]:
trimmed.append(line[indent:].rstrip())
# Strip off trailing and leading blank lines:
while trimmed and not trimmed[-1]:
trimmed.pop()
while trimmed and not trimmed[0]:
trimmed.pop(0)
# Return a single string:
return '\n'.join(trimmed)
The docstring in this example contains two newline characters and is therefore 3 lines long. The first and last lines are blank:
def foo():
"""
This is the second line of the docstring.
"""
To illustrate:
>>> print repr(foo.__doc__) '\n This is the second line of the docstring.\n ' >>> foo.__doc__.splitlines() ['', ' This is the second line of the docstring.', ' '] >>> trim(foo.__doc__) 'This is the second line of the docstring.'
Once trimmed, these docstrings are equivalent:
def foo():
"""A multi-line
docstring.
"""
def bar():
"""
A multi-line
docstring.
"""
References and Footnotes
| [1] | PEP 256, Docstring Processing System Framework, Goodger (http://www.python.org/dev/peps/pep-0256/) |
| [2] | (1, 2) PEP 258, Docutils Design Specification, Goodger (http://www.python.org/dev/peps/pep-0258/) |
| [3] | http://docutils.sourceforge.net/ |
| [4] | http://www.python.org/dev/peps/pep-0008/ |
| [5] | http://www.python.org/sigs/doc-sig/ |
Copyright
This document has been placed in the public domain.
Acknowledgements
The "Specification" text comes mostly verbatim from the Python Style Guide [4] essay by Guido van Rossum.
This document borrows ideas from the archives of the Python Doc-SIG [5]. Thanks to all members past and present.
pep-0258 Docutils Design Specification
| PEP: | 258 |
|---|---|
| Title: | Docutils Design Specification |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Goodger <goodger at python.org> |
| Discussions-To: | <doc-sig at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 256 257 |
| Created: | 31-May-2001 |
| Post-History: | 13-Jun-2001 |
Contents
Rejection Notice
While this may serve as an interesting design document for the now-independent docutils, it is no longer slated for inclusion in the standard library.
Abstract
This PEP documents design issues and implementation details for Docutils, a Python Docstring Processing System (DPS). The rationale and high-level concepts of a DPS are documented in PEP 256, "Docstring Processing System Framework" [1]. Also see PEP 256 for a "Road Map to the Docstring PEPs".
Docutils is being designed modularly so that any of its components can be replaced easily. In addition, Docutils is not limited to the processing of Python docstrings; it processes standalone documents as well, in several contexts.
No changes to the core Python language are required by this PEP. Its deliverables consist of a package for the standard library and its documentation.
Specification
Docutils Project Model
Project components and data flow:
+---------------------------+
| Docutils: |
| docutils.core.Publisher, |
| docutils.core.publish_*() |
+---------------------------+
/ | \
/ | \
1,3,5 / 6 | \ 7
+--------+ +-------------+ +--------+
| READER | ----> | TRANSFORMER | ====> | WRITER |
+--------+ +-------------+ +--------+
/ \\ |
/ \\ |
2 / 4 \\ 8 |
+-------+ +--------+ +--------+
| INPUT | | PARSER | | OUTPUT |
+-------+ +--------+ +--------+
The numbers above each component indicate the path a document's data takes. Double-width lines between Reader & Parser and between Transformer & Writer indicate that data sent along these paths should be standard (pure & unextended) Docutils doc trees. Single-width lines signify that internal tree extensions or completely unrelated representations are possible, but they must be supported at both ends.
Publisher
The docutils.core module contains a "Publisher" facade class and several convenience functions: "publish_cmdline()" (for command-line front ends), "publish_file()" (for programmatic use with file-like I/O), and "publish_string()" (for programmatic use with string I/O). The Publisher class encapsulates the high-level logic of a Docutils system. The Publisher class has overall responsibility for processing, controlled by the Publisher.publish() method:
- Set up internal settings (may include config files & command-line options) and I/O objects.
- Call the Reader object to read data from the source Input object and parse the data with the Parser object. A document object is returned.
- Set up and apply transforms via the Transformer object attached to the document.
- Call the Writer object which translates the document to the final output format and writes the formatted data to the destination Output object. Depending on the Output object, the output may be returned from the Writer, and then from the publish() method.
Calling the "publish" function (or instantiating a "Publisher" object) with component names will result in default behavior. For custom behavior (customizing component settings), create custom component objects first, and pass them to the Publisher or publish_* convenience functions.
Readers
Readers understand the input context (where the data is coming from), send the whole input or discrete "chunks" to the parser, and provide the context to bind the chunks together back into a cohesive whole.
Each reader is a module or package exporting a "Reader" class with a "read" method. The base "Reader" class can be found in the docutils/readers/__init__.py module.
Most Readers will have to be told what parser to use. So far (see the list of examples below), only the Python Source Reader ("PySource"; still incomplete) will be able to determine the parser on its own.
Responsibilities:
- Get input text from the source I/O.
- Pass the input text to the parser, along with a fresh document tree root.
Examples:
Standalone (Raw/Plain): Just read a text file and process it. The reader needs to be told which parser to use.
The "Standalone Reader" has been implemented in module docutils.readers.standalone.
Python Source: See Python Source Reader below. This Reader is currently in development in the Docutils sandbox.
Email: RFC-822 headers, quoted excerpts, signatures, MIME parts.
PEP: RFC-822 headers, "PEP xxxx" and "RFC xxxx" conversion to URIs. The "PEP Reader" has been implemented in module docutils.readers.pep; see PEP 287 and PEP 12.
Wiki: Global reference lookups of "wiki links" incorporated into transforms. (CamelCase only or unrestricted?) Lazy indentation?
Web Page: As standalone, but recognize meta fields as meta tags. Support for templates of some sort? (After <body>, before </body>?)
FAQ: Structured "question & answer(s)" constructs.
Compound document: Merge chapters into a book. Master manifest file?
Parsers
Parsers analyze their input and produce a Docutils document tree. They don't know or care anything about the source or destination of the data.
Each input parser is a module or package exporting a "Parser" class with a "parse" method. The base "Parser" class can be found in the docutils/parsers/__init__.py module.
Responsibilities: Given raw input text and a doctree root node, populate the doctree by parsing the input text.
Example: The only parser implemented so far is for the reStructuredText markup. It is implemented in the docutils/parsers/rst/ package.
The development and integration of other parsers is possible and encouraged.
Transformer
The Transformer class, in docutils/transforms/__init__.py, stores transforms and applies them to documents. A transformer object is attached to every new document tree. The Publisher calls Transformer.apply_transforms() to apply all stored transforms to the document tree. Transforms change the document tree from one form to another, add to the tree, or prune it. Transforms resolve references and footnote numbers, process interpreted text, and do other context-sensitive processing.
Some transforms are specific to components (Readers, Parser, Writers, Input, Output). Standard component-specific transforms are specified in the default_transforms attribute of component classes. After the Reader has finished processing, the Publisher calls Transformer.populate_from_components() with a list of components and all default transforms are stored.
Each transform is a class in a module in the docutils/transforms/ package, a subclass of docutils.tranforms.Transform. Transform classes each have a default_priority attribute which is used by the Transformer to apply transforms in order (low to high). The default priority can be overridden when adding transforms to the Transformer object.
Transformer responsibilities:
- Apply transforms to the document tree, in priority order.
- Store a mapping of component type name ('reader', 'writer', etc.) to component objects. These are used by certain transforms (such as "components.Filter") to determine suitability.
Transform responsibilities:
- Modify a doctree in-place, either purely transforming one structure into another, or adding new structures based on the doctree and/or external data.
Examples of transforms (in the docutils/transforms/ package):
- frontmatter.DocInfo: Conversion of document metadata (bibliographic information).
- references.AnonymousHyperlinks: Resolution of anonymous references to corresponding targets.
- parts.Contents: Generates a table of contents for a document.
- document.Merger: Combining multiple populated doctrees into one. (Not yet implemented or fully understood.)
- document.Splitter: Splits a document into a tree-structure of subdocuments, perhaps by section. It will have to transform references appropriately. (Neither implemented not remotely understood.)
- components.Filter: Includes or excludes elements which depend on a specific Docutils component.
Writers
Writers produce the final output (HTML, XML, TeX, etc.). Writers translate the internal document tree structure into the final data format, possibly running Writer-specific transforms first.
By the time the document gets to the Writer, it should be in final form. The Writer's job is simply (and only) to translate from the Docutils doctree structure to the target format. Some small transforms may be required, but they should be local and format-specific.
Each writer is a module or package exporting a "Writer" class with a "write" method. The base "Writer" class can be found in the docutils/writers/__init__.py module.
Responsibilities:
- Translate doctree(s) into specific output formats.
- Transform references into format-native forms.
- Write the translated output to the destination I/O.
Examples:
- XML: Various forms, such as:
- Docutils XML (an expression of the internal document tree, implemented as docutils.writers.docutils_xml).
- DocBook (being implemented in the Docutils sandbox).
- HTML (XHTML implemented as docutils.writers.html4css1).
- PDF (a ReportLabs interface is being developed in the Docutils sandbox).
- TeX (a LaTeX Writer is being implemented in the sandbox).
- Docutils-native pseudo-XML (implemented as docutils.writers.pseudoxml, used for testing).
- Plain text
- reStructuredText?
Input/Output
I/O classes provide a uniform API for low-level input and output. Subclasses will exist for a variety of input/output mechanisms. However, they can be considered an implementation detail. Most applications should be satisfied using one of the convenience functions associated with the Publisher.
I/O classes are currently in the preliminary stages; there's a lot of work yet to be done. Issues:
- How to represent multi-file input (files & directories) in the API?
- How to represent multi-file output? Perhaps "Writer" variants, one for each output distribution type? Or Output objects with associated transforms?
Responsibilities:
- Read data from the input source (Input objects) or write data to the output destination (Output objects).
Examples of input sources:
- A single file on disk or a stream (implemented as docutils.io.FileInput).
- Multiple files on disk (MultiFileInput?).
- Python source files: modules and packages.
- Python strings, as received from a client application (implemented as docutils.io.StringInput).
Examples of output destinations:
- A single file on disk or a stream (implemented as docutils.io.FileOutput).
- A tree of directories and files on disk.
- A Python string, returned to a client application (implemented as docutils.io.StringOutput).
- No output; useful for programmatic applications where only a portion of the normal output is to be used (implemented as docutils.io.NullOutput).
- A single tree-shaped data structure in memory.
- Some other set of data structures in memory.
Docutils Package Structure
Package "docutils".
Module "__init__.py" contains: class "Component", a base class for Docutils components; class "SettingsSpec", a base class for specifying runtime settings (used by docutils.frontend); and class "TransformSpec", a base class for specifying transforms.
Module "docutils.core" contains facade class "Publisher" and convenience functions. See Publisher above.
Module "docutils.frontend" provides runtime settings support, for programmatic use and front-end tools (including configuration file support, and command-line argument and option processing).
Module "docutils.io" provides a uniform API for low-level input and output. See Input/Output above.
Module "docutils.nodes" contains the Docutils document tree element class library plus tree-traversal Visitor pattern base classes. See Document Tree below.
Module "docutils.statemachine" contains a finite state machine specialized for regular-expression-based text filters and parsers. The reStructuredText parser implementation is based on this module.
Module "docutils.urischemes" contains a mapping of known URI schemes ("http", "ftp", "mail", etc.).
Module "docutils.utils" contains utility functions and classes, including a logger class ("Reporter"; see Error Handling below).
Package "docutils.parsers": markup parsers.
- Function "get_parser_class(parser_name)" returns a parser module by name. Class "Parser" is the base class of specific parsers. (docutils/parsers/__init__.py)
- Package "docutils.parsers.rst": the reStructuredText parser.
- Alternate markup parsers may be added.
See Parsers above.
Package "docutils.readers": context-aware input readers.
- Function "get_reader_class(reader_name)" returns a reader module by name or alias. Class "Reader" is the base class of specific readers. (docutils/readers/__init__.py)
- Module "docutils.readers.standalone" reads independent document files.
- Module "docutils.readers.pep" reads PEPs (Python Enhancement Proposals).
- Readers to be added for: Python source code (structure & docstrings), email, FAQ, and perhaps Wiki and others.
See Readers above.
Package "docutils.writers": output format writers.
- Function "get_writer_class(writer_name)" returns a writer module by name. Class "Writer" is the base class of specific writers. (docutils/writers/__init__.py)
- Module "docutils.writers.html4css1" is a simple HyperText Markup Language document tree writer for HTML 4.01 and CSS1.
- Module "docutils.writers.docutils_xml" writes the internal document tree in XML form.
- Module "docutils.writers.pseudoxml" is a simple internal document tree writer; it writes indented pseudo-XML.
- Writers to be added: HTML 3.2 or 4.01-loose, XML (various forms, such as DocBook), PDF, TeX, plaintext, reStructuredText, and perhaps others.
See Writers above.
Package "docutils.transforms": tree transform classes.
- Class "Transformer" stores transforms and applies them to document trees. (docutils/transforms/__init__.py)
- Class "Transform" is the base class of specific transforms. (docutils/transforms/__init__.py)
- Each module contains related transform classes.
See Transforms above.
Package "docutils.languages": Language modules contain language-dependent strings and mappings. They are named for their language identifier (as defined in Choice of Docstring Format below), converting dashes to underscores.
- Function "get_language(language_code)", returns matching language module. (docutils/languages/__init__.py)
- Modules: en.py (English), de.py (German), fr.py (French), it.py (Italian), sk.py (Slovak), sv.py (Swedish).
- Other languages to be added.
Third-party modules: "extras" directory. These modules are installed only if they're not already present in the Python installation.
- extras/optparse.py and extras/textwrap.py provide option parsing and command-line help; from Greg Ward's http://optik.sf.net/ project, included for convenience.
- extras/roman.py contains Roman numeral conversion routines.
Front-End Tools
The tools/ directory contains several front ends for common Docutils processing. See Docutils Front-End Tools [4] for details.
Document Tree
A single intermediate data structure is used internally by Docutils, in the interfaces between components; it is defined in the docutils.nodes module. It is not required that this data structure be used internally by any of the components, just between components as outlined in the diagram in the Docutils Project Model above.
Custom node types are allowed, provided that either (a) a transform converts them to standard Docutils nodes before they reach the Writer proper, or (b) the custom node is explicitly supported by certain Writers, and is wrapped in a filtered "pending" node. An example of condition (a) is the Python Source Reader (see below), where a "stylist" transform converts custom nodes. The HTML <meta> tag is an example of condition (b); it is supported by the HTML Writer but not by others. The reStructuredText "meta" directive creates a "pending" node, which contains knowledge that the embedded "meta" node can only be handled by HTML-compatible writers. The "pending" node is resolved by the docutils.transforms.components.Filter transform, which checks that the calling writer supports HTML; if it doesn't, the "pending" node (and enclosed "meta" node) is removed from the document.
The document tree data structure is similar to a DOM tree, but with specific node names (classes) instead of DOM's generic nodes. The schema is documented in an XML DTD (eXtensible Markup Language Document Type Definition), which comes in two parts:
- the Docutils Generic DTD, docutils.dtd [5], and
- the OASIS Exchange Table Model, soextbl.dtd [6].
The DTD defines a rich set of elements, suitable for many input and output formats. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof.
See The Docutils Document Tree [7] for details (incomplete).
Error Handling
When the parser encounters an error in markup, it inserts a system message (DTD element "system_message"). There are five levels of system messages:
- Level-0, "DEBUG": an internal reporting issue. There is no effect on the processing. Level-0 system messages are handled separately from the others.
- Level-1, "INFO": a minor issue that can be ignored. There is little or no effect on the processing. Typically level-1 system messages are not reported.
- Level-2, "WARNING": an issue that should be addressed. If ignored, there may be minor problems with the output. Typically level-2 system messages are reported but do not halt processing
- Level-3, "ERROR": a major issue that should be addressed. If ignored, the output will contain unpredictable errors. Typically level-3 system messages are reported but do not halt processing
- Level-4, "SEVERE": a critical error that must be addressed. Typically level-4 system messages are turned into exceptions which halt processing. If ignored, the output will contain severe errors.
Although the initial message levels were devised independently, they have a strong correspondence to VMS error condition severity levels [8]; the names in quotes for levels 1 through 4 were borrowed from VMS. Error handling has since been influenced by the log4j project [9].
Python Source Reader
The Python Source Reader ("PySource") is the Docutils component that reads Python source files, extracts docstrings in context, then parses, links, and assembles the docstrings into a cohesive whole. It is a major and non-trivial component, currently under experimental development in the Docutils sandbox. High-level design issues are presented here.
Processing Model
This model will evolve over time, incorporating experience and discoveries.
- The PySource Reader uses an Input class to read in Python packages and modules, into a tree of strings.
- The Python modules are parsed, converting the tree of strings into a tree of abstract syntax trees with docstring nodes.
- The abstract syntax trees are converted into an internal representation of the packages/modules. Docstrings are extracted, as well as code structure details. See AST Mining below. Namespaces are constructed for lookup in step 6.
- One at a time, the docstrings are parsed, producing standard Docutils doctrees.
- PySource assembles all the individual docstrings' doctrees into a Python-specific custom Docutils tree paralleling the package/module/class structure; this is a custom Reader-specific internal representation (see the Docutils Python Source DTD [10]). Namespaces must be merged: Python identifiers, hyperlink targets.
- Cross-references from docstrings (interpreted text) to Python identifiers are resolved according to the Python namespace lookup rules. See Identifier Cross-References below.
- A "Stylist" transform is applied to the custom doctree (by the Transformer), custom nodes are rendered using standard nodes as primitives, and a standard document tree is emitted. See Stylist Transforms below.
- Other transforms are applied to the standard doctree by the Transformer.
- The standard doctree is sent to a Writer, which translates the document into a concrete format (HTML, PDF, etc.).
- The Writer uses an Output class to write the resulting data to its destination (disk file, directories and files, etc.).
AST Mining
Abstract Syntax Tree mining code will be written (or adapted) that scans a parsed Python module, and returns an ordered tree containing the names, docstrings (including attribute and additional docstrings; see below), and additional info (in parentheses below) of all of the following objects:
- packages
- modules
- module attributes (+ initial values)
- classes (+ inheritance)
- class attributes (+ initial values)
- instance attributes (+ initial values)
- methods (+ parameters & defaults)
- functions (+ parameters & defaults)
(Extract comments too? For example, comments at the start of a module would be a good place for bibliographic field lists.)
In order to evaluate interpreted text cross-references, namespaces for each of the above will also be required.
See the python-dev/docstring-develop thread "AST mining", started on 2001-08-14.
Docstring Extraction Rules
What to examine:
- If the "__all__" variable is present in the module being documented, only identifiers listed in "__all__" are examined for docstrings.
- In the absence of "__all__", all identifiers are examined, except those whose names are private (names begin with "_" but don't begin and end with "__").
- 1a and 1b can be overridden by runtime settings.
Where:
Docstrings are string literal expressions, and are recognized in the following places within Python modules:
- At the beginning of a module, function definition, class definition, or method definition, after any comments. This is the standard for Python __doc__ attributes.
- Immediately following a simple assignment at the top level of a module, class definition, or __init__ method definition, after any comments. See Attribute Docstrings below.
- Additional string literals found immediately after the docstrings in (a) and (b) will be recognized, extracted, and concatenated. See Additional Docstrings below.
- @@@ 2.2-style "properties" with attribute docstrings? Wait for syntax?
How:
Whenever possible, Python modules should be parsed by Docutils, not imported. There are several reasons:
- Importing untrusted code is inherently insecure.
- Information from the source is lost when using introspection to examine an imported module, such as comments and the order of definitions.
- Docstrings are to be recognized in places where the byte-code compiler ignores string literal expressions (2b and 2c above), meaning importing the module will lose these docstrings.
Of course, standard Python parsing tools such as the "parser" library module should be used.
When the Python source code for a module is not available (i.e. only the .pyc file exists) or for C extension modules, to access docstrings the module can only be imported, and any limitations must be lived with.
Since attribute docstrings and additional docstrings are ignored by the Python byte-code compiler, no namespace pollution or runtime bloat will result from their use. They are not assigned to __doc__ or to any other attribute. The initial parsing of a module may take a slight performance hit.
Attribute Docstrings
(This is a simplified version of PEP 224 [2].)
A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions:
The assignment must be in one of the following contexts:
- At the top level of a module (i.e., not nested inside a compound statement such as a loop or conditional): a module attribute.
- At the top level of a class definition: a class attribute.
- At the top level of the "__init__" method definition of a class: an instance attribute. Instance attributes assigned in other methods are assumed to be implementation details. (@@@ __new__ methods?)
- A function attribute assignment at the top level of a module or class definition.
Since each of the above contexts are at the top level (i.e., in the outermost suite of a definition), it may be necessary to place dummy assignments for attributes assigned conditionally or in a loop.
The assignment must be to a single target, not to a list or a tuple of targets.
The form of the target:
- For contexts 1a and 1b above, the target must be a simple identifier (not a dotted identifier, a subscripted expression, or a sliced expression).
- For context 1c above, the target must be of the form "self.attrib", where "self" matches the "__init__" method's first parameter (the instance parameter) and "attrib" is a simple identifier as in 3a.
- For context 1d above, the target must be of the form "name.attrib", where "name" matches an already-defined function or method name and "attrib" is a simple identifier as in 3a.
Blank lines may be used after attribute docstrings to emphasize the connection between the assignment and the docstring.
Examples:
g = 'module attribute (module-global variable)'
"""This is g's docstring."""
class AClass:
c = 'class attribute'
"""This is AClass.c's docstring."""
def __init__(self):
"""Method __init__'s docstring."""
self.i = 'instance attribute'
"""This is self.i's docstring."""
def f(x):
"""Function f's docstring."""
return x**2
f.a = 1
"""Function attribute f.a's docstring."""
Additional Docstrings
(This idea was adapted from PEP 216 [3].)
Many programmers would like to make extensive use of docstrings for API documentation. However, docstrings do take up space in the running program, so some programmers are reluctant to "bloat up" their code. Also, not all API documentation is applicable to interactive environments, where __doc__ would be displayed.
Docutils' docstring extraction tools will concatenate all string literal expressions which appear at the beginning of a definition or after a simple assignment. Only the first strings in definitions will be available as __doc__, and can be used for brief usage text suitable for interactive sessions; subsequent string literals and all attribute docstrings are ignored by the Python byte-code compiler and may contain more extensive API information.
Example:
def function(arg):
"""This is __doc__, function's docstring."""
"""
This is an additional docstring, ignored by the byte-code
compiler, but extracted by Docutils.
"""
pass
Issue: from __future__ import
This would break "from __future__ import" statements introduced in Python 2.1 for multiple module docstrings (main docstring plus additional docstring(s)). The Python Reference Manual specifies:
A future statement must appear near the top of the module. The only lines that can appear before a future statement are:
- the module docstring (if any),
- comments,
- blank lines, and
- other future statements.
Resolution?
- Should we search for docstrings after a __future__ statement? Very ugly.
- Redefine __future__ statements to allow multiple preceding string literals?
- Or should we not even worry about this? There probably shouldn't be __future__ statements in production code, after all. Perhaps modules with __future__ statements will simply have to put up with the single-docstring limitation.
Choice of Docstring Format
Rather than force everyone to use a single docstring format, multiple input formats are allowed by the processing system. A special variable, __docformat__, may appear at the top level of a module before any function or class definitions. Over time or through decree, a standard format or set of formats should emerge.
A module's __docformat__ variable only applies to the objects defined in the module's file. In particular, the __docformat__ variable in a package's __init__.py file does not apply to objects defined in subpackages and submodules.
The __docformat__ variable is a string containing the name of the format being used, a case-insensitive string matching the input parser's module or package name (i.e., the same name as required to "import" the module or package), or a registered alias. If no __docformat__ is specified, the default format is "plaintext" for now; this may be changed to the standard format if one is ever established.
The __docformat__ string may contain an optional second field, separated from the format name (first field) by a single space: a case-insensitive language identifier as defined in RFC 1766. A typical language identifier consists of a 2-letter language code from ISO 639 [11] (3-letter codes used only if no 2-letter code exists; RFC 1766 is currently being revised to allow 3-letter codes). If no language identifier is specified, the default is "en" for English. The language identifier is passed to the parser and can be used for language-dependent markup features.
Identifier Cross-References
In Python docstrings, interpreted text is used to classify and mark up program identifiers, such as the names of variables, functions, classes, and modules. If the identifier alone is given, its role is inferred implicitly according to the Python namespace lookup rules. For functions and methods (even when dynamically assigned), parentheses ('()') may be included:
This function uses `another()` to do its work.
For class, instance and module attributes, dotted identifiers are used when necessary. For example (using reStructuredText markup):
class Keeper(Storer):
"""
Extend `Storer`. Class attribute `instances` keeps track
of the number of `Keeper` objects instantiated.
"""
instances = 0
"""How many `Keeper` objects are there?"""
def __init__(self):
"""
Extend `Storer.__init__()` to keep track of instances.
Keep count in `Keeper.instances`, data in `self.data`.
"""
Storer.__init__(self)
Keeper.instances += 1
self.data = []
"""Store data in a list, most recent last."""
def store_data(self, data):
"""
Extend `Storer.store_data()`; append new `data` to a
list (in `self.data`).
"""
self.data = data
Each of the identifiers quoted with backquotes ("`") will become references to the definitions of the identifiers themselves.
Stylist Transforms
Stylist transforms are specialized transforms specific to the PySource Reader. The PySource Reader doesn't have to make any decisions as to style; it just produces a logically constructed document tree, parsed and linked, including custom node types. Stylist transforms understand the custom nodes created by the Reader and convert them into standard Docutils nodes.
Multiple Stylist transforms may be implemented and one can be chosen at runtime (through a "--style" or "--stylist" command-line option). Each Stylist transform implements a different layout or style; thus the name. They decouple the context-understanding part of the Reader from the layout-generating part of processing, resulting in a more flexible and robust system. This also serves to "separate style from content", the SGML/XML ideal.
By keeping the piece of code that does the styling small and modular, it becomes much easier for people to roll their own styles. The "barrier to entry" is too high with existing tools; extracting the stylist code will lower the barrier considerably.
References and Footnotes
| [1] | PEP 256, Docstring Processing System Framework, Goodger (http://www.python.org/dev/peps/pep-0256/) |
| [2] | PEP 224, Attribute Docstrings, Lemburg (http://www.python.org/dev/peps/pep-0224/) |
| [3] | PEP 216, Docstring Format, Zadka (http://www.python.org/dev/peps/pep-0216/) |
| [4] | http://docutils.sourceforge.net/docs/user/tools.html |
| [5] | http://docutils.sourceforge.net/docs/ref/docutils.dtd |
| [6] | http://docutils.sourceforge.net/docs/ref/soextblx.dtd |
| [7] | http://docutils.sourceforge.net/docs/ref/doctree.html |
| [8] | http://www.openvms.compaq.com:8000/73final/5841/841pro_027.html#error_cond_severity |
| [9] | http://logging.apache.org/log4j/docs/index.html |
| [10] | http://docutils.sourceforge.net/docs/dev/pysource.dtd |
| [11] | http://lcweb.loc.gov/standards/iso639-2/englangn.html |
| [12] | http://www.python.org/sigs/doc-sig/ |
Project Web Site
A SourceForge project has been set up for this work at http://docutils.sourceforge.net/.
Copyright
This document has been placed in the public domain.
Acknowledgements
This document borrows ideas from the archives of the Python Doc-SIG [12]. Thanks to all members past & present.
pep-0259 Omit printing newline after newline
| PEP: | 259 |
|---|---|
| Title: | Omit printing newline after newline |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 11-Jun-2001 |
| Python-Version: | 2.2 |
| Post-History: | 11-Jun-2001 |
Abstract
Currently, the print statement always appends a newline, unless a
trailing comma is used. This means that if we want to print data
that already ends in a newline, we get two newlines, unless
special precautions are taken.
I propose to skip printing the newline when it follows a newline
that came from data.
In order to avoid having to add yet another magic variable to file
objects, I propose to give the existing 'softspace' variable an
extra meaning: a negative value will mean "the last data written
ended in a newline so no space *or* newline is required."
Problem
When printing data that resembles the lines read from a file using
a simple loop, double-spacing occurs unless special care is taken:
>>> for line in open("/etc/passwd").readlines():
... print line
...
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:
daemon:x:2:2:daemon:/sbin:
(etc.)
>>>
While there are easy work-arounds, this is often noticed only
during testing and requires an extra edit-test roundtrip; the
fixed code is uglier and harder to maintain.
Proposed Solution
In the PRINT_ITEM opcode in ceval.c, when a string object is
printed, a check is already made that looks at the last character
of that string. Currently, if that last character is a whitespace
character other than space, the softspace flag is reset to zero;
this suppresses the space between two items if the first item is a
string ending in newline, tab, etc. (but not when it ends in a
space). Otherwise the softspace flag is set to one.
The proposal changes this test slightly so that softspace is set
to:
-1 -- if the last object written is a string ending in a
newline
0 -- if the last object written is a string ending in a
whitespace character that's neither space nor newline
1 -- in all other cases (including the case when the last
object written is an empty string or not a string)
Then, the PRINT_NEWLINE opcode, printing of the newline is
suppressed if the value of softspace is negative; in any case the
softspace flag is reset to zero.
Scope
This only affects printing of 8-bit strings. It doesn't affect
Unicode, although that could be considered a bug in the Unicode
implementation. It doesn't affect other objects whose string
representation happens to end in a newline character.
Risks
This change breaks some existing code. For example:
print "Subject: PEP 259\n"
print message_body
In current Python, this produces a blank line separating the
subject from the message body; with the proposed change, the body
begins immediately below the subject. This is not very robust
code anyway; it is better written as
print "Subject: PEP 259"
print
print message_body
In the test suite, only test_StringIO (which explicitly tests for
this feature) breaks.
Implementation
A patch relative to current CVS is here:
http://sourceforge.net/tracker/index.php?func=detail&aid=432183&group_id=5470&atid=305470
Rejected
The user community unanimously rejected this, so I won't pursue
this idea any further. Frequently heard arguments against
included:
- It it likely to break thousands of CGI scripts.
- Enough magic already (also: no more tinkering with 'print'
please).
Copyright
This document has been placed in the public domain.
pep-0260 Simplify xrange()
| PEP: | 260 |
|---|---|
| Title: | Simplify xrange() |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 26-Jun-2001 |
| Python-Version: | 2.2 |
| Post-History: | 26-Jun-2001 |
Abstract
This PEP proposes to strip the xrange() object from some rarely
used behavior like x[i:j] and x*n.
Problem
The xrange() function has one idiomatic use:
for i in xrange(...): ...
However, the xrange() object has a bunch of rarely used behaviors
that attempt to make it more sequence-like. These are so rarely
used that historically they have has serious bugs (e.g. off-by-one
errors) that went undetected for several releases.
I claim that it's better to drop these unused features. This will
simplify the implementation, testing, and documentation, and
reduce maintenance and code size.
Proposed Solution
I propose to strip the xrange() object to the bare minimum. The
only retained sequence behaviors are x[i], len(x), and repr(x).
In particular, these behaviors will be dropped:
x[i:j] (slicing)
x*n, n*x (sequence-repeat)
cmp(x1, x2) (comparisons)
i in x (containment test)
x.tolist() method
x.start, x.stop, x.step attributes
I also propose to change the signature of the PyRange_New() C API
to remove the 4th argument (the repetition count).
By implementing a custom iterator type, we could speed up the
common use, but this is optional (the default sequence iterator
does just fine).
Scope
This PEP affects the xrange() built-in function and the
PyRange_New() C API.
Risks
Somebody's code could be relying on the extended code, and this
code would break. However, given that historically bugs in the
extended code have gone undetected for so long, it's unlikely that
much code is affected.
Transition
For backwards compatibility, the existing functionality will still
be present in Python 2.2, but will trigger a warning. A year
after Python 2.2 final is released (probably in 2.4) the
functionality will be ripped out.
Copyright
This document has been placed in the public domain.
pep-0261 Support for "wide" Unicode characters
| PEP: | 261 |
|---|---|
| Title: | Support for "wide" Unicode characters |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Prescod <paul at prescod.net> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 27-Jun-2001 |
| Python-Version: | 2.2 |
| Post-History: | 27-Jun-2001 |
Abstract
Python 2.1 unicode characters can have ordinals only up to 2**16 -1.
This range corresponds to a range in Unicode known as the Basic
Multilingual Plane. There are now characters in Unicode that live
on other "planes". The largest addressable character in Unicode
has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
will call this TOPCHAR and call characters in this range "wide
characters".
Glossary
Character
Used by itself, means the addressable units of a Python
Unicode string.
Code point
A code point is an integer between 0 and TOPCHAR.
If you imagine Unicode as a mapping from integers to
characters, each integer is a code point. But the
integers between 0 and TOPCHAR that do not map to
characters are also code points. Some will someday
be used for characters. Some are guaranteed never
to be used for characters.
Codec
A set of functions for translating between physical
encodings (e.g. on disk or coming in from a network)
into logical Python objects.
Encoding
Mechanism for representing abstract characters in terms of
physical bits and bytes. Encodings allow us to store
Unicode characters on disk and transmit them over networks
in a manner that is compatible with other Unicode software.
Surrogate pair
Two physical characters that represent a single logical
character. Part of a convention for representing 32-bit
code points in terms of two 16-bit code points.
Unicode string
A Python type representing a sequence of code points with
"string semantics" (e.g. case conversions, regular
expression compatibility, etc.) Constructed with the
unicode() function.
Proposed Solution
One solution would be to merely increase the maximum ordinal
to a larger value. Unfortunately the only straightforward
implementation of this idea is to use 4 bytes per character.
This has the effect of doubling the size of most Unicode
strings. In order to avoid imposing this cost on every
user, Python 2.2 will allow the 4-byte implementation as a
build-time option. Users can choose whether they care about
wide characters or prefer to preserve memory.
The 4-byte option is called "wide Py_UNICODE". The 2-byte option
is called "narrow Py_UNICODE".
Most things will behave identically in the wide and narrow worlds.
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
length-one string.
* unichr(i) for 2**16 <= i <= TOPCHAR will return a
length-one string on wide Python builds. On narrow builds it will
raise ValueError.
ISSUE
Python currently allows \U literals that cannot be
represented as a single Python character. It generates two
Python characters known as a "surrogate pair". Should this
be disallowed on future narrow Python builds?
Pro:
Python already the construction of a surrogate pair
for a large unicode literal character escape sequence.
This is basically designed as a simple way to construct
"wide characters" even in a narrow Python build. It is also
somewhat logical considering that the Unicode-literal syntax
is basically a short-form way of invoking the unicode-escape
codec.
Con:
Surrogates could be easily created this way but the user
still needs to be careful about slicing, indexing, printing
etc. Therefore some have suggested that Unicode
literals should not support surrogates.
ISSUE
Should Python allow the construction of characters that do
not correspond to Unicode code points? Unassigned Unicode
code points should obviously be legal (because they could
be assigned at any time). But code points above TOPCHAR are
guaranteed never to be used by Unicode. Should we allow access
to them anyhow?
Pro:
If a Python user thinks they know what they're doing why
should we try to prevent them from violating the Unicode
spec? After all, we don't stop 8-bit strings from
containing non-ASCII characters.
Con:
Codecs and other Unicode-consuming code will have to be
careful of these characters which are disallowed by the
Unicode specification.
* ord() is always the inverse of unichr()
* There is an integer value in the sys module that describes the
largest ordinal for a character in a Unicode string on the current
interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
of Python and TOPCHAR on wide builds.
ISSUE: Should there be distinct constants for accessing
TOPCHAR and the real upper bound for the domain of
unichr (if they differ)? There has also been a
suggestion of sys.unicodewidth which can take the
values 'wide' and 'narrow'.
* every Python Unicode character represents exactly one Unicode code
point (i.e. Python Unicode Character = Abstract Unicode character).
* codecs will be upgraded to support "wide characters"
(represented directly in UCS-4, and as variable-length sequences
in UTF-8 and UTF-16). This is the main part of the implementation
left to be done.
* There is a convention in the Unicode world for encoding a 32-bit
code point in terms of two 16-bit code points. These are known
as "surrogate pairs". Python's codecs will adopt this convention
and encode 32-bit code points as surrogate pairs on narrow Python
builds.
ISSUE
Should there be a way to tell codecs not to generate
surrogates and instead treat wide characters as
errors?
Pro:
I might want to write code that works only with
fixed-width characters and does not have to worry about
surrogates.
Con:
No clear proposal of how to communicate this to codecs.
* there are no restrictions on constructing strings that use
code points "reserved for surrogates" improperly. These are
called "isolated surrogates". The codecs should disallow reading
these from files, but you could construct them using string
literals or unichr().
Implementation
There is a new define:
#define Py_UNICODE_SIZE 2
To test whether UCS2 or UCS4 is in use, the derived macro
Py_UNICODE_WIDE should be used, which is defined when UCS-4 is in
use.
There is a new configure option:
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
wchar_t if it fits
--enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
wchar_t if it fits
--enable-unicode same as "=ucs2"
--disable-unicode entirely remove the Unicode functionality.
It is also proposed that one day --enable-unicode will just
default to the width of your platforms wchar_t.
Windows builds will be narrow for a while based on the fact that
there have been few requests for wide characters, those requests
are mostly from hard-core programmers with the ability to buy
their own Python and Windows itself is strongly biased towards
16-bit characters.
Notes
This PEP does NOT imply that people using Unicode need to use a
4-byte encoding for their files on disk or sent over the network.
It only allows them to do so. For example, ASCII is still a
legitimate (7-bit) Unicode-encoding.
It has been proposed that there should be a module that handles
surrogates in narrow Python builds for programmers. If someone
wants to implement that, it will be another PEP. It might also be
combined with features that allow other kinds of character-,
word- and line- based indexing.
Rejected Suggestions
More or less the status-quo
We could officially say that Python characters are 16-bit and
require programmers to implement wide characters in their
application logic by combining surrogate pairs. This is a heavy
burden because emulating 32-bit characters is likely to be
very inefficient if it is coded entirely in Python. Plus these
abstracted pseudo-strings would not be legal as input to the
regular expression engine.
"Space-efficient Unicode" type
Another class of solution is to use some efficient storage
internally but present an abstraction of wide characters to
the programmer. Any of these would require a much more complex
implementation than the accepted solution. For instance consider
the impact on the regular expression engine. In theory, we could
move to this implementation in the future without breaking Python
code. A future Python could "emulate" wide Python semantics on
narrow Python. Guido is not willing to undertake the
implementation right now.
Two types
We could introduce a 32-bit Unicode type alongside the 16-bit
type. There is a lot of code that expects there to be only a
single Unicode type.
This PEP represents the least-effort solution. Over the next
several years, 32-bit Unicode characters will become more common
and that may either convince us that we need a more sophisticated
solution or (on the other hand) convince us that simply
mandating wide Unicode characters is an appropriate solution.
Right now the two options on the table are do nothing or do
this.
References
Unicode Glossary: http://www.unicode.org/glossary/
Copyright
This document has been placed in the public domain.
pep-0262 A Database of Installed Python Packages
| PEP: | 262 |
|---|---|
| Title: | A Database of Installed Python Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 08-Jul-2001 |
| Post-History: | 27-Mar-2002 |
Introduction
This PEP describes a format for a database of the Python software
installed on a system.
(In this document, the term "distribution" is used to mean a set
of code that's developed and distributed together. A "distribution"
is the same as a Red Hat or Debian package, but the term "package"
already has a meaning in Python terminology, meaning "a directory
with an __init__.py file in it.")
Requirements
We need a way to figure out what distributions, and what versions of
those distributions, are installed on a system. We want to provide
features similar to CPAN, APT, or RPM. Required use cases that
should be supported are:
* Is distribution X on a system?
* What version of distribution X is installed?
* Where can the new version of distribution X be found? (This can
be defined as either "a home page where the user can go and
find a download link", or "a place where a program can find
the newest version?" Both should probably be supported.)
* What files did distribution X put on my system?
* What distribution did the file x/y/z.py come from?
* Has anyone modified x/y/z.py locally?
* What other distributions does this software need?
* What Python modules does this distribution provide?
Database Location
The database lives in a bunch of files under
<prefix>/lib/python<version>/install-db/. This location will be
called INSTALLDB through the remainder of this PEP.
The structure of the database is deliberately kept simple; each
file in this directory or its subdirectories (if any) describes a
single distribution. Binary packagings of Python software such as
RPMs can then update Python's database by just installing the
corresponding file into the INSTALLDB directory.
The rationale for scanning subdirectories is that we can move to a
directory-based indexing scheme if the database directory contains
too many entries. For example, this would let us transparently
switch from INSTALLDB/Numeric to INSTALLDB/N/Nu/Numeric or some
similar hashing scheme.
Database Contents
Each file in INSTALLDB or its subdirectories describes a single
distribution, and has the following contents:
An initial line listing the sections in this file, separated
by whitespace. Currently this will always be 'PKG-INFO FILES
REQUIRES PROVIDES'. This is for future-proofing; if we add a
new section, for example to list documentation files, then
we'd add a DOCS section and list it in the contents. Sections
are always separated by blank lines.
A distribution that uses the Distutils for installation should
automatically update the database. Distributions that roll their
own installation will have to use the database's API to to
manually add or update their own entry. System package managers
such as RPM or pkgadd can just create the new file in the
INSTALLDB directory.
Each section of the file is used for a different purpose.
PKG-INFO section
An initial set of RFC-822 headers containing the distribution
information for a file, as described in PEP 241, "Metadata for
Python Software Packages".
FILES section
An entry for each file installed by the
distribution. Generated files such as .pyc and .pyo files are
on this list as well as the original .py files installed by a
distribution; their checksums won't be stored or checked,
though.
Each file's entry is a single tab-delimited line that contains
the following fields:
* The file's full path, as installed on the system.
* The file's size
* The file's permissions. On Windows, this field will always be
'unknown'
* The owner and group of the file, separated by a tab.
On Windows, these fields will both be 'unknown'.
* A SHA1 digest of the file, encoded in hex. For generated files
such as *.pyc files, this field must contain the string "-",
which indicates that the file's checksum should not be verified.
REQUIRES section
This section is a list of strings giving the services required for
this module distribution to run properly. This list includes the
distribution name ("python-stdlib") and module names ("rfc822",
"htmllib", "email", "email.Charset"). It will be specified
by an extra 'requires' argument to the distutils.core.setup()
function. For example:
setup(..., requires=['xml.utils.iso8601',
Eventually there may be automated tools that look through all of
the code and produce a list of requirements, but it's unlikely
that these tools can handle all possible cases; a manual
way to specify requirements will always be necessary.
PROVIDES section
This section is a list of strings giving the services provided by
an installed distribution. This list includes the distribution name
("python-stdlib") and module names ("rfc822", "htmllib", "email",
"email.Charset").
XXX should files be listed? e.g. $PREFIX/lib/color-table.txt,
to pick up data files, required scripts, etc.
Eventually there may be an option to let module developers add
their own strings to this section. For example, you might add
"XML parser" to this section, and other module distributions could
then list "XML parser" as one of their dependencies to indicate
that multiple different XML parsers can be used. For now this
ability isn't supported because it raises too many issues: do we
need a central registry of legal strings, or just let people put
whatever they like? Etc., etc...
API Description
There's a single fundamental class, InstallationDatabase. The
code for it lives in distutils/install_db.py. (XXX any
suggestions for alternate locations in the standard library, or an
alternate module name?)
The InstallationDatabase returns instances of Distribution that contain
all the information about an installed distribution.
XXX Several of the fields in Distribution are duplicates of ones in
distutils.dist.Distribution. Probably they should be factored out
into the Distribution class proposed here, but can this be done in a
backward-compatible way?
InstallationDatabase has the following interface:
class InstallationDatabase:
def __init__ (self, path=None):
"""InstallationDatabase(path:string)
Read the installation database rooted at the specified path.
If path is None, INSTALLDB is used as the default.
"""
def get_distribution (self, distribution_name):
"""get_distribution(distribution_name:string) : Distribution
Get the object corresponding to a single distribution.
"""
def list_distributions (self):
"""list_distributions() : [Distribution]
Return a list of all distributions installed on the system,
enumerated in no particular order.
"""
def find_distribution (self, path):
"""find_file(path:string) : Distribution
Search and return the distribution containing the file 'path'.
Returns None if the file doesn't belong to any distribution
that the InstallationDatabase knows about.
XXX should this work for directories?
"""
class Distribution:
"""Instance attributes:
name : string
Distribution name
files : {string : (size:int, perms:int, owner:string, group:string,
digest:string)}
Dictionary mapping the path of a file installed by this distribution
to information about the file.
The following fields all come from PEP 241.
version : distutils.version.Version
Version of this distribution
platform : [string]
summary : string
description : string
keywords : string
home_page : string
author : string
author_email : string
license : string
"""
def add_file (self, path):
"""add_file(path:string):None
Record the size, ownership, &c., information for an installed file.
XXX as written, this would stat() the file. Should the size/perms/
checksum all be provided as parameters to this method instead?
"""
def has_file (self, path):
"""has_file(path:string) : Boolean
Returns true if the specified path belongs to a file in this
distribution.
"""
def check_file (self, path):
"""check_file(path:string) : Boolean
Checks whether the file's size, checksum, and ownership match,
returning true if they do.
"""
Deliverables
A description of the database API, to be added to this PEP.
Patches to the Distutils that 1) implement an InstallationDatabase
class, 2) Update the database when a new distribution is installed. 3)
add a simple package management tool, features to be added to this
PEP. (Or should that be a separate PEP?) See [2] for the current
patch.
Open Issues
PJE suggests the installation database "be potentially present on
every directory in sys.path, with the contents merged in sys.path
order. This would allow home-directory or other
alternate-location installs to work, and ease the process of a
distutils install command writing the file." Nice feature: it does
mean that package manager tools can take into account Python
packages that a user has privately installed.
AMK wonders: what does setup.py do if it's told to install
packages to a directory not on sys.path? Does it write an
install-db directory to the directory it's told to write to, or
does it do nothing?
Should the package-database file itself be included in the files
list? (PJE would think yes, but of course it can't contain its
own checksum. AMK can't think of a use case where including the
DB file matters.)
PJE wonders about writing the package DB file
*first*, before installing any other files, so that failed partial
installations can both be backed out, and recognized as broken.
This PEP may have to specify some algorithm for recognizing this
situation.
Should we guarantee the format of installation databases remains
compatible across Python versions, or is it subject to arbitrary
change? Probably we need to guarantee compatibility.
Rejected Suggestions
Instead of using one text file per distribution, one large text
file or an anydbm file could be used. This has been rejected for
a few reasons. First, performance is probably not an extremely
pressing concern as the database is only used when installing or
removing software, a relatively infrequent task. Scalability also
likely isn't a problem, as people may have hundreds of Python
packages installed, but thousands or tens of thousands seems
unlikely. Finally, individual text files are compatible with
installers such as RPM or DPKG because a binary packager can just
drop the new database file into the database directory. If one
large text file or a binary file were used, the Python database
would then have to be updated by running a postinstall script.
On Windows, the permissions and owner/group of a file aren't
stored. Windows does in fact support ownership and access
permissions, but reading and setting them requires the win32all
extensions, and they aren't present in the basic Python installer
for Windows.
References
[1] Michael Muller's patch (posted to the Distutils-SIG around 28
Dec 1999) generates a list of installed files.
[2] A patch to implement this PEP will be tracked as
patch #562100 on SourceForge.
http://www.python.org/sf/562100 .
Code implementing the installation database is currently in
Python CVS in the nondist/sandbox/pep262 directory.
Acknowledgements
Ideas for this PEP originally came from postings by Greg Ward,
Fred L. Drake Jr., Thomas Heller, Mats Wichmann, Phillip J. Eby,
and others.
Many changes and rewrites to this document were suggested by the
readers of the Distutils SIG.
Copyright
This document has been placed in the public domain.
pep-0263 Defining Python Source Code Encodings
| PEP: | 0263 |
|---|---|
| Title: | Defining Python Source Code Encodings |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com>, Martin von LĂświs <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 06-Jun-2001 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.
Problem
In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the
programming environment rather unfriendly to Python users who live
and work in non-Latin-1 locales such as many of the Asian
countries. Programmers can write their 8-bit strings using the
favorite encoding, but are bound to the "unicode-escape" encoding
for Unicode literals.
Proposed Solution
I propose to make the Python source code encoding both visible and
changeable on a per-source file basis by using a special comment
at the top of the file to declare the encoding.
To make Python aware of this encoding declaration a number of
concept changes are necessary with respect to the handling of
Python source code data.
Defining the Encoding
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.
To aid with platforms such as Windows, which add Unicode BOM marks
to the beginning of Unicode files, the UTF-8 signature
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
(even if no magic encoding comment is given).
If a source file uses both the UTF-8 BOM mark signature and a
magic encoding comment, the only allowed encoding for the comment
is 'utf-8'. Any other encoding will cause an error.
Examples
These are some examples to clarify the different styles for
defining the source code encoding at the top of a Python source
file:
1. With interpreter binary and using Emacs style file encoding
comment:
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
...
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import os, sys
...
#!/usr/bin/python
# -*- coding: ascii -*-
import os, sys
...
2. Without interpreter line, using plain text:
# This Python file uses the following encoding: utf-8
import os, sys
...
3. Text editors might have different ways of defining the file's
encoding, e.g.
#!/usr/local/bin/python
# coding: latin-1
import os, sys
...
4. Without encoding comment, Python's parser will assume ASCII
text:
#!/usr/local/bin/python
import os, sys
...
5. Encoding comments which don't work:
Missing "coding:" prefix:
#!/usr/local/bin/python
# latin-1
import os, sys
...
Encoding comment not on line 1 or 2:
#!/usr/local/bin/python
#
# -*- coding: latin-1 -*-
import os, sys
...
Unsupported encoding:
#!/usr/local/bin/python
# -*- coding: utf-42 -*-
import os, sys
...
Concepts
The PEP is based on the following concepts which would have to be
implemented to enable usage of such a magic comment:
1. The complete Python source file should use a single encoding.
Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code.
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
2. Handling of escape sequences should continue to work as it does
now, but with all possible source code encodings, that is
standard string literals (both 8-bit and Unicode) are subject to
escape sequence expansion while raw string literals only expand
a very small subset of escape sequences.
3. Python's tokenizer/compiler combo will need to be updated to
work as follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. convert it into a UTF-8 byte string
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
Note that Python identifiers are restricted to the ASCII
subset of the encoding, and thus need no further conversion
after step 4.
Implementation
For backwards-compatibility with existing code which currently
uses non-ASCII in string literals without declaring an encoding,
the implementation will be introduced in two phases:
1. Allow non-ASCII in string literals and comments, by internally
treating a missing encoding declaration as a declaration of
"iso-8859-1". This will cause arbitrary byte strings to
correctly round-trip between step 2 and step 5 of the
processing, and provide compatibility with Python 2.2 for
Unicode literals that contain non-ASCII bytes.
A warning will be issued if non-ASCII bytes are found in the
input, once per improperly encoded input file.
2. Remove the warning, and change the default encoding to "ascii".
The builtin compile() API will be enhanced to accept Unicode as
input. 8-bit string input is subject to the standard procedure for
encoding detection as described above.
If a Unicode string with a coding declaration is passed to compile(),
a SyntaxError will be raised.
SUZUKI Hisao is working on a patch; see [2] for details. A patch
implementing only phase 1 is available at [1].
Phases
Implementation of steps 1 and 2 above were completed in 2.3,
except for changing the default encoding to "ascii".
The default encoding was set to "ascii" in version 2.5.
Scope
This PEP intends to provide an upgrade path from the current
(more-or-less) undefined source code encoding situation to a more
robust and portable definition.
References
[1] Phase 1 implementation:
http://python.org/sf/526840
[2] Phase 2 implementation:
http://python.org/sf/534304
History
1.10 and above: see CVS history
1.8: Added '.' to the coding RE.
1.7: Added warnings to phase 1 implementation. Replaced the
Latin-1 default encoding with the interpreter's default
encoding. Added tweaks to compile().
1.4 - 1.6: Minor tweaks
1.3: Worked in comments by Martin v. Loewis:
UTF-8 BOM mark detection, Emacs style magic comment,
two phase approach to the implementation
Copyright
This document has been placed in the public domain.
pep-0264 Future statements in simulated shells
| PEP: | 264 |
|---|---|
| Title: | Future statements in simulated shells |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michael Hudson <mwh at python.net> |
| Status: | Final |
| Type: | Standards Track |
| Requires: | 236 |
| Created: | 30-Jul-2001 |
| Python-Version: | 2.2 |
| Post-History: | 30-Jul-2001 |
Abstract
As noted in PEP 236, there is no clear way for "simulated
interactive shells" to simulate the behaviour of __future__
statements in "real" interactive shells, i.e. have __future__
statements' effects last the life of the shell.
The PEP also takes the opportunity to clean up the other
unresolved issue mentioned in PEP 236, the inability to stop
compile() inheriting the effect of future statements affecting the
code calling compile().
This PEP proposes to address the first problem by adding an
optional fourth argument to the builtin function "compile", adding
information to the _Feature instances defined in __future__.py and
adding machinery to the standard library modules "codeop" and
"code" to make the construction of such shells easy.
The second problem is dealt with by simply adding *another*
optional argument to compile(), which if non-zero suppresses the
inheriting of future statements' effects.
Specification
I propose adding a fourth, optional, "flags" argument to the
builtin "compile" function. If this argument is omitted,
there will be no change in behaviour from that of Python 2.1.
If it is present it is expected to be an integer, representing
various possible compile time options as a bitfield. The
bitfields will have the same values as the CO_* flags already used
by the C part of Python interpreter to refer to future statements.
compile() shall raise a ValueError exception if it does not
recognize any of the bits set in the supplied flags.
The flags supplied will be bitwise-"or"ed with the flags that
would be set anyway, unless the new fifth optional argument is a
non-zero intger, in which case the flags supplied will be exactly
the set used.
The above-mentioned flags are not currently exposed to Python. I
propose adding .compiler_flag attributes to the _Feature objects
in __future__.py that contain the necessary bits, so one might
write code such as:
import __future__
def compile_generator(func_def):
return compile(func_def, "<input>", "suite",
__future__.generators.compiler_flag)
A recent change means that these same bits can be used to tell if
a code object was compiled with a given feature; for instance
codeob.co_flags & __future__.generators.compiler_flag
will be non-zero if and only if the code object "codeob" was
compiled in an environment where generators were allowed.
I will also add a .all_feature_flags attribute to the __future__
module, giving a low-effort way of enumerating all the __future__
options supported by the running interpreter.
I also propose adding a pair of classes to the standard library
module codeop.
One - Compile - will sport a __call__ method which will act much
like the builtin "compile" of 2.1 with the difference that after
it has compiled a __future__ statement, it "remembers" it and
compiles all subsequent code with the __future__ option in effect.
It will do this by using the new features of the __future__ module
mentioned above.
Objects of the other class added to codeop - CommandCompiler -
will do the job of the existing codeop.compile_command function,
but in a __future__-aware way.
Finally, I propose to modify the class InteractiveInterpreter in
the standard library module code to use a CommandCompiler to
emulate still more closely the behaviour of the default Python
shell.
Backward Compatibility
Should be very few or none; the changes to compile will make no
difference to existing code, nor will adding new functions or
classes to codeop. Existing code using
code.InteractiveInterpreter may change in behaviour, but only for
the better in that the "real" Python shell will be being better
impersonated.
Forward Compatibility
The fiddling that needs to be done to Lib/__future__.py when
adding a __future_ feature will be a touch more complicated.
Everything else should just work.
Issues
I hope the above interface is not too disruptive to implement for
Jython.
Implementation
A series of preliminary implementations are at:
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=449043&group_id=5470
After light massaging by Tim Peters, they have now been checked in.
Copyright
This document has been placed in the public domain.
pep-0265 Sorting Dictionaries by Value
| PEP: | 265 |
|---|---|
| Title: | Sorting Dictionaries by Value |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Grant Griffin <g2 at iowegian.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 8-Aug-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
This PEP suggests a "sort by value" operation for dictionaries.
The primary benefit would be in terms of "batteries included"
support for a common Python idiom which, in its current form, is
both difficult for beginners to understand and cumbersome for all
to implement.
BDFL Pronouncement
This PEP is rejected because the need for it has been largely
fulfilled by Py2.4's sorted() builtin function:
>>> sorted(d.iteritems(), key=itemgetter(1), reverse=True)
[('b', 23), ('d', 17), ('c', 5), ('a', 2), ('e', 1)]
or for just the keys:
sorted(d, key=d.__getitem__, reverse=True)
['b', 'd', 'c', 'a', 'e']
Also, Python 2.5's heapq.nlargest() function addresses the common use
case of finding only a few of the highest valued items:
>>> nlargest(2, d.iteritems(), itemgetter(1))
[('b', 23), ('d', 17)]
Motivation
A common use of dictionaries is to count occurrences by setting
the value of d[key] to 1 on its first occurrence, then increment
the value on each subsequent occurrence. This can be done several
different ways, but the get() method is the most succinct:
d[key] = d.get(key, 0) + 1
Once all occurrences have been counted, a common use of the
resulting dictionary is to print the occurrences in
occurrence-sorted order, often with the largest value first.
This leads to a need to sort a dictionary's items by value. The
canonical method of doing so in Python is to first use d.items()
to get a list of the dictionary's items, then invert the ordering
of each item's tuple from (key, value) into (value, key), then
sort the list; since Python sorts the list based on the first item
of the tuple, the list of (inverted) items is therefore sorted by
value. If desired, the list can then be reversed, and the tuples
can be re-inverted back to (key, value). (However, in my
experience, the inverted tuple ordering is fine for most purposes,
e.g. printing out the list.)
For example, given an occurrence count of:
>>> d = {'a':2, 'b':23, 'c':5, 'd':17, 'e':1}
we might do:
>>> items = [(v, k) for k, v in d.items()]
>>> items.sort()
>>> items.reverse() # so largest is first
>>> items = [(k, v) for v, k in items]
resulting in:
>>> items
[('b', 23), ('d', 17), ('c', 5), ('a', 2), ('e', 1)]
which shows the list in by-value order, largest first. (In this
case, 'b' was found to have the most occurrences.)
This works fine, but is "hard to use" in two aspects. First,
although this idiom is known to veteran Pythoneers, it is not at
all obvious to newbies -- either in terms of its algorithm
(inverting the ordering of item tuples) or its implementation
(using list comprehensions -- which are an advanced Python
feature.) Second, it requires having to repeatedly type a lot of
"grunge", resulting in both tedium and mistakes.
We therefore would rather Python provide a method of sorting
dictionaries by value which would be both easy for newbies to
understand (or, better yet, not to _have to_ understand) and
easier for all to use.
Rationale
As Tim Peters has pointed out, this sort of thing brings on the
problem of trying to be all things to all people. Therefore, we
will limit its scope to try to hit "the sweet spot". Unusual
cases (e.g. sorting via a custom comparison function) can, of
course, be handled "manually" using present methods.
Here are some simple possibilities:
The items() method of dictionaries can be augmented with new
parameters having default values that provide for full
backwards-compatibility:
(1) items(sort_by_values=0, reversed=0)
or maybe just:
(2) items(sort_by_values=0)
since reversing a list is easy enough.
Alternatively, items() could simply let us control the (key, value)
order:
(3) items(values_first=0)
Again, this is fully backwards-compatible. It does less work than
the others, but it at least eases the most complicated/tricky part
of the sort-by-value problem: inverting the order of item tuples.
Using this is very simple:
items = d.items(1)
items.sort()
items.reverse() # (if desired)
The primary drawback of the preceding three approaches is the
additional overhead for the parameter-less "items()" case, due to
having to process default parameters. (However, if one assumes
that items() gets used primarily for creating sort-by-value lists,
this is not really a drawback in practice.)
Alternatively, we might add a new dictionary method which somehow
embodies "sorting". This approach offers two advantages. First,
it avoids adding overhead to the items() method. Second, it is
perhaps more accessible to newbies: when they go looking for a
method for sorting dictionaries, they hopefully run into this one,
and they will not have to understand the finer points of tuple
inversion and list sorting to achieve sort-by-value.
To allow the four basic possibilities of sorting by key/value and in
forward/reverse order, we could add this method:
(4) sorted_items(by_value=0, reversed=0)
I believe the most common case would actually be "by_value=1,
reversed=1", but the defaults values given here might lead to
fewer surprises by users: sorted_items() would be the same as
items() followed by sort().
Finally (as a last resort), we could use:
(5) items_sorted_by_value(reversed=0)
Implementation
The proposed dictionary methods would necessarily be implemented
in C. Presumably, the implementation would be fairly simple since
it involves just adding a few calls to Python's existing
machinery.
Concerns
Aside from the run-time overhead already addressed in
possibilities 1 through 3, concerns with this proposal probably
will fall into the categories of "feature bloat" and/or "code
bloat". However, I believe that several of the suggestions made
here will result in quite minimal bloat, resulting in a good
tradeoff between bloat and "value added".
Tim Peters has noted that implementing this in C might not be
significantly faster than implementing it in Python today.
However, the major benefits intended here are "accessibility" and
"ease of use", not "speed". Therefore, as long as it is not
noticeably slower (in the case of plain items(), speed need not be
a consideration.
References
A related thread called "counting occurrences" appeared on
comp.lang.python in August, 2001. This included examples of
approaches to systematizing the sort-by-value problem by
implementing it as reusable Python functions and classes.
Copyright
This document has been placed in the public domain.
pep-0266 Optimizing Global Variable/Attribute Access
| PEP: | 266 |
|---|---|
| Title: | Optimizing Global Variable/Attribute Access |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Skip Montanaro <skip at pobox.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 13-Aug-2001 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
The bindings for most global variables and attributes of other
modules typically never change during the execution of a Python
program, but because of Python's dynamic nature, code which
accesses such global objects must run through a full lookup each
time the object is needed. This PEP proposes a mechanism that
allows code that accesses most global objects to treat them as
local objects and places the burden of updating references on the
code that changes the name bindings of such objects.
Introduction
Consider the workhorse function sre_compile._compile. It is the
internal compilation function for the sre module. It consists
almost entirely of a loop over the elements of the pattern being
compiled, comparing opcodes with known constant values and
appending tokens to an output list. Most of the comparisons are
with constants imported from the sre_constants module. This means
there are lots of LOAD_GLOBAL bytecodes in the compiled output of
this module. Just by reading the code it's apparent that the
author intended LITERAL, NOT_LITERAL, OPCODES and many other
symbols to be constants. Still, each time they are involved in an
expression, they must be looked up anew.
Most global accesses are actually to objects that are "almost
constants". This includes global variables in the current module
as well as the attributes of other imported modules. Since they
rarely change, it seems reasonable to place the burden of updating
references to such objects on the code that changes the name
bindings. If sre_constants.LITERAL is changed to refer to another
object, perhaps it would be worthwhile for the code that modifies
the sre_constants module dict to correct any active references to
that object. By doing so, in many cases global variables and the
attributes of many objects could be cached as local variables. If
the bindings between the names given to the objects and the
objects themselves changes rarely, the cost of keeping track of
such objects should be low and the potential payoff fairly large.
In an attempt to gauge the effect of this proposal, I modified the
Pystone benchmark program included in the Python distribution to
cache global functions. Its main function, Proc0, makes calls to
ten different functions inside its for loop. In addition, Func2
calls Func1 repeatedly inside a loop. If local copies of these 11
global idenfiers are made before the functions' loops are entered,
performance on this particular benchmark improves by about two per
cent (from 5561 pystones to 5685 on my laptop). It gives some
indication that performance would be improved by caching most
global variable access. Note also that the pystone benchmark
makes essentially no accesses of global module attributes, an
anticipated area of improvement for this PEP.
Proposed Change
I propose that the Python virtual machine be modified to include
TRACK_OBJECT and UNTRACK_OBJECT opcodes. TRACK_OBJECT would
associate a global name or attribute of a global name with a slot
in the local variable array and perform an initial lookup of the
associated object to fill in the slot with a valid value. The
association it creates would be noted by the code responsible for
changing the name-to-object binding to cause the associated local
variable to be updated. The UNTRACK_OBJECT opcode would delete
any association between the name and the local variable slot.
Threads
Operation of this code in threaded programs will be no different
than in unthreaded programs. If you need to lock an object to
access it, you would have had to do that before TRACK_OBJECT would
have been executed and retain that lock until after you stop using
it.
FIXME: I suspect I need more here.
Rationale
Global variables and attributes rarely change. For example, once
a function imports the math module, the binding between the name
"math" and the module it refers to aren't likely to change.
Similarly, if the function that uses the math module refers to its
"sin" attribute, it's unlikely to change. Still, every time the
module wants to call the math.sin function, it must first execute
a pair of instructions:
LOAD_GLOBAL math
LOAD_ATTR sin
If the client module always assumed that math.sin was a local
constant and it was the responsibility of "external forces"
outside the function to keep the reference correct, we might have
code like this:
TRACK_OBJECT math.sin
...
LOAD_FAST math.sin
...
UNTRACK_OBJECT math.sin
If the LOAD_FAST was in a loop the payoff in reduced global loads
and attribute lookups could be significant.
This technique could, in theory, be applied to any global variable
access or attribute lookup. Consider this code:
l = []
for i in range(10):
l.append(math.sin(i))
return l
Even though l is a local variable, you still pay the cost of
loading l.append ten times in the loop. The compiler (or an
optimizer) could recognize that both math.sin and l.append are
being called in the loop and decide to generate the tracked local
code, avoiding it for the builtin range() function because it's
only called once during loop setup. Performance issues related to
accessing local variables make tracking l.append less attractive
than tracking globals such as math.sin.
According to a post to python-dev by Marc-Andre Lemburg [1],
LOAD_GLOBAL opcodes account for over 7% of all instructions
executed by the Python virtual machine. This can be a very
expensive instruction, at least relative to a LOAD_FAST
instruction, which is a simple array index and requires no extra
function calls by the virtual machine. I believe many LOAD_GLOBAL
instructions and LOAD_GLOBAL/LOAD_ATTR pairs could be converted to
LOAD_FAST instructions.
Code that uses global variables heavily often resorts to various
tricks to avoid global variable and attribute lookup. The
aforementioned sre_compile._compile function caches the append
method of the growing output list. Many people commonly abuse
functions' default argument feature to cache global variable
lookups. Both of these schemes are hackish and rarely address all
the available opportunities for optimization. (For example,
sre_compile._compile does not cache the two globals that it uses
most frequently: the builtin len function and the global OPCODES
array that it imports from sre_constants.py.
Questions
Q. What about threads? What if math.sin changes while in cache?
A. I believe the global interpreter lock will protect values from
being corrupted. In any case, the situation would be no worse
than it is today. If one thread modified math.sin after another
thread had already executed "LOAD_GLOBAL math", but before it
executed "LOAD_ATTR sin", the client thread would see the old
value of math.sin.
The idea is this. I use a multi-attribute load below as an
example, not because it would happen very often, but because by
demonstrating the recursive nature with an extra call hopefully
it will become clearer what I have in mind. Suppose a function
defined in module foo wants to access spam.eggs.ham and that
spam is a module imported at the module level in foo:
import spam
...
def somefunc():
...
x = spam.eggs.ham
Upon entry to somefunc, a TRACK_GLOBAL instruction will be
executed:
TRACK_GLOBAL spam.eggs.ham n
"spam.eggs.ham" is a string literal stored in the function's
constants array. "n" is a fastlocals index. "&fastlocals[n]"
is a reference to slot "n" in the executing frame's fastlocals
array, the location in which the spam.eggs.ham reference will
be stored. Here's what I envision happening:
1. The TRACK_GLOBAL instruction locates the object referred to
by the name "spam" and finds it in its module scope. It
then executes a C function like
_PyObject_TrackName(m, "spam.eggs.ham", &fastlocals[n])
where "m" is the module object with an attribute "spam".
2. The module object strips the leading "spam." stores the
necessary information ("eggs.ham" and &fastlocals[n]) in
case its binding for the name "eggs" changes. It then
locates the object referred to by the key "eggs" in its
dict and recursively calls
_PyObject_TrackName(eggs, "eggs.ham", &fastlocals[n])
3. The eggs object strips the leading "eggs.", stores the
("ham", &fastlocals[n]) info, locates the object in its
namespace called "ham" and calls _PyObject_TrackName once
again:
_PyObject_TrackName(ham, "ham", &fastlocals[n])
4. The "ham" object strips the leading string (no "." this
time, but that's a minor point), sees that the result is
empty, then uses its own value (self, probably) to update
the location it was handed:
Py_XDECREF(&fastlocals[n]);
&fastlocals[n] = self;
Py_INCREF(&fastlocals[n]);
At this point, each object involved in resolving
"spam.eggs.ham" knows which entry in its namespace needs to be
tracked and what location to update if that name changes.
Furthermore, if the one name it is tracking in its local
storage changes, it can call _PyObject_TrackName using the new
object once the change has been made. At the bottom end of
the food chain, the last object will always strip a name, see
the empty string and know that its value should be stuffed
into the location it's been passed.
When the object referred to by the dotted expression
"spam.eggs.ham" is going to go out of scope, an
"UNTRACK_GLOBAL spam.eggs.ham n" instruction is executed. It
has the effect of deleting all the tracking information that
TRACK_GLOBAL established.
The tracking operation may seem expensive, but recall that the
objects being tracked are assumed to be "almost constant", so
the setup cost will be traded off against hopefully multiple
local instead of global loads. For globals with attributes
the tracking setup cost grows but is offset by avoiding the
extra LOAD_ATTR cost. The TRACK_GLOBAL instruction needs to
perform a PyDict_GetItemString for the first name in the chain
to determine where the top-level object resides. Each object
in the chain has to store a string and an address somewhere,
probably in a dict that uses storage locations as keys
(e.g. the &fastlocals[n]) and strings as values. (This dict
could possibly be a central dict of dicts whose keys are
object addresses instead of a per-object dict.) It shouldn't
be the other way around because multiple active frames may
want to track "spam.eggs.ham", but only one frame will want to
associate that name with one of its fast locals slots.
Unresolved Issues
Threading -
What about this (dumb) code?
l = []
lock = threading.Lock()
...
def fill_l():
for i in range(1000):
lock.acquire()
l.append(math.sin(i))
lock.release()
...
def consume_l():
while 1:
lock.acquire()
if l:
elt = l.pop()
lock.release()
fiddle(elt)
It's not clear from a static analysis of the code what the lock is
protecting. (You can't tell at compile-time that threads are even
involved can you?) Would or should it affect attempts to track
"l.append" or "math.sin" in the fill_l function?
If we annotate the code with mythical track_object and untrack_object
builtins (I'm not proposing this, just illustrating where stuff would
go!), we get
l = []
lock = threading.Lock()
...
def fill_l():
track_object("l.append", append)
track_object("math.sin", sin)
for i in range(1000):
lock.acquire()
append(sin(i))
lock.release()
untrack_object("math.sin", sin)
untrack_object("l.append", append)
...
def consume_l():
while 1:
lock.acquire()
if l:
elt = l.pop()
lock.release()
fiddle(elt)
Is that correct both with and without threads (or at least equally
incorrect with and without threads)?
Nested Scopes -
The presence of nested scopes will affect where TRACK_GLOBAL finds
a global variable, but shouldn't affect anything after that. (I
think.)
Missing Attributes -
Suppose I am tracking the object referred to by "spam.eggs.ham"
and "spam.eggs" is rebound to an object that does not have a "ham"
attribute. It's clear this will be an AttributeError if the
programmer attempts to resolve "spam.eggs.ham" in the current
Python virtual machine, but suppose the programmer has anticipated
this case:
if hasattr(spam.eggs, "ham"):
print spam.eggs.ham
elif hasattr(spam.eggs, "bacon"):
print spam.eggs.bacon
else:
print "what? no meat?"
You can't raise an AttributeError when the tracking information is
recalculated. If it does not raise AttributeError and instead
lets the tracking stand, it may be setting the programmer up for a
very subtle error.
One solution to this problem would be to track the shortest
possible root of each dotted expression the function refers to
directly. In the above example, "spam.eggs" would be tracked, but
"spam.eggs.ham" and "spam.eggs.bacon" would not.
Who does the dirty work? -
In the Questions section I postulated the existence of a
_PyObject_TrackName function. While the API is fairly easy to
specify, the implementation behind-the-scenes is not so obvious.
A central dictionary could be used to track the name/location
mappings, but it appears that all setattr functions might need to
be modified to accommodate this new functionality.
If all types used the PyObject_GenericSetAttr function to set
attributes that would localize the update code somewhat. They
don't however (which is not too surprising), so it seems that all
getattrfunc and getattrofunc functions will have to be updated.
In addition, this would place an absolute requirement on C
extension module authors to call some function when an attribute
changes value (PyObject_TrackUpdate?).
Finally, it's quite possible that some attributes will be set by
side effect and not by any direct call to a setattr method of some
sort. Consider a device interface module that has an interrupt
routine that copies the contents of a device register into a slot
in the object's struct whenever it changes. In these situations,
more extensive modifications would have to be made by the module
author. To identify such situations at compile time would be
impossible. I think an extra slot could be added to PyTypeObjects
to indicate if an object's code is safe for global tracking. It
would have a default value of 0 (Py_TRACKING_NOT_SAFE). If an
extension module author has implemented the necessary tracking
support, that field could be initialized to 1 (Py_TRACKING_SAFE).
_PyObject_TrackName could check that field and issue a warning if
it is asked to track an object that the author has not explicitly
said was safe for tracking.
Discussion
Jeremy Hylton has an alternate proposal on the table [2]. His
proposal seeks to create a hybrid dictionary/list object for use
in global name lookups that would make global variable access look
more like local variable access. While there is no C code
available to examine, the Python implementation given in his
proposal still appears to require dictionary key lookup. It
doesn't appear that his proposal could speed local variable
attribute lookup, which might be worthwhile in some situations if
potential performance burdens could be addressed.
Backwards Compatibility
I don't believe there will be any serious issues of backward
compatibility. Obviously, Python bytecode that contains
TRACK_OBJECT opcodes could not be executed by earlier versions of
the interpreter, but breakage at the bytecode level is often
assumed between versions.
Implementation
TBD. This is where I need help. I believe there should be either
a central name/location registry or the code that modifies object
attributes should be modified, but I'm not sure the best way to go
about this. If you look at the code that implements the
STORE_GLOBAL and STORE_ATTR opcodes, it seems likely that some
changes will be required to PyDict_SetItem and PyObject_SetAttr or
their String variants. Ideally, there'd be a fairly central place
to localize these changes. If you begin considering tracking
attributes of local variables you get into issues of modifying
STORE_FAST as well, which could be a problem, since the name
bindings for local variables are changed much more frequently. (I
think an optimizer could avoid inserting the tracking code for the
attributes for any local variables where the variable's name
binding changes.)
Performance
I believe (though I have no code to prove it at this point), that
implementing TRACK_OBJECT will generally not be much more
expensive than a single LOAD_GLOBAL instruction or a
LOAD_GLOBAL/LOAD_ATTR pair. An optimizer should be able to avoid
converting LOAD_GLOBAL and LOAD_GLOBAL/LOAD_ATTR to the new scheme
unless the object access occurred within a loop. Further down the
line, a register-oriented replacement for the current Python
virtual machine [3] could conceivably eliminate most of the
LOAD_FAST instructions as well.
The number of tracked objects should be relatively small. All
active frames of all active threads could conceivably be tracking
objects, but this seems small compared to the number of functions
defined in a given application.
References
[1] http://mail.python.org/pipermail/python-dev/2000-July/007609.html
[2] http://www.zope.org/Members/jeremy/CurrentAndFutureProjects/FastGlobalsPEP
[3] http://www.musi-cal.com/~skip/python/rattlesnake20010813.tar.gz
Copyright
This document has been placed in the public domain.
pep-0267 Optimized Access to Module Namespaces
| PEP: | 267 |
|---|---|
| Title: | Optimized Access to Module Namespaces |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeremy Hylton <jeremy at alum.mit.edu> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 23-May-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Deferral
While this PEP is a nice idea, no-one has yet emerged to do the work of
hashing out the differences between this PEP, PEP 266 and PEP 280.
Hence, it is being deferred.
Abstract
This PEP proposes a new implementation of global module namespaces
and the builtin namespace that speeds name resolution. The
implementation would use an array of object pointers for most
operations in these namespaces. The compiler would assign indices
for global variables and module attributes at compile time.
The current implementation represents these namespaces as
dictionaries. A global name incurs a dictionary lookup each time
it is used; a builtin name incurs two dictionary lookups, a failed
lookup in the global namespace and a second lookup in the builtin
namespace.
This implementation should speed Python code that uses
module-level functions and variables. It should also eliminate
awkward coding styles that have evolved to speed access to these
names.
The implementation is complicated because the global and builtin
namespaces can be modified dynamically in ways that are impossible
for the compiler to detect. (Example: A module's namespace is
modified by a script after the module is imported.) As a result,
the implementation must maintain several auxiliary data structures
to preserve these dynamic features.
Introduction
This PEP proposes a new implementation of attribute access for
module objects that optimizes access to module variables known at
compile time. The module will store these variables in an array
and provide an interface to lookup attributes using array offsets.
For globals, builtins, and attributes of imported modules, the
compiler will generate code that uses the array offsets for fast
access.
[describe the key parts of the design: dlict, compiler support,
stupid name trick workarounds, optimization of other module's
globals]
The implementation will preserve existing semantics for module
namespaces, including the ability to modify module namespaces at
runtime in ways that affect the visibility of builtin names.
DLict design
The namespaces are implemented using a data structure that has
sometimes gone under the name dlict. It is a dictionary that has
numbered slots for some dictionary entries. The type must be
implemented in C to achieve acceptable performance. The new
type-class unification work should make this fairly easy. The
DLict will presumably be a subclass of dictionary with an
alternate storage module for some keys.
A Python implementation is included here to illustrate the basic
design:
"""A dictionary-list hybrid"""
import types
class DLict:
def __init__(self, names):
assert isinstance(names, types.DictType)
self.names = {}
self.list = [None] * size
self.empty = [1] * size
self.dict = {}
self.size = 0
def __getitem__(self, name):
i = self.names.get(name)
if i is None:
return self.dict[name]
if self.empty[i] is not None:
raise KeyError, name
return self.list[i]
def __setitem__(self, name, val):
i = self.names.get(name)
if i is None:
self.dict[name] = val
else:
self.empty[i] = None
self.list[i] = val
self.size += 1
def __delitem__(self, name):
i = self.names.get(name)
if i is None:
del self.dict[name]
else:
if self.empty[i] is not None:
raise KeyError, name
self.empty[i] = 1
self.list[i] = None
self.size -= 1
def keys(self):
if self.dict:
return self.names.keys() + self.dict.keys()
else:
return self.names.keys()
def values(self):
if self.dict:
return self.names.values() + self.dict.values()
else:
return self.names.values()
def items(self):
if self.dict:
return self.names.items()
else:
return self.names.items() + self.dict.items()
def __len__(self):
return self.size + len(self.dict)
def __cmp__(self, dlict):
c = cmp(self.names, dlict.names)
if c != 0:
return c
c = cmp(self.size, dlict.size)
if c != 0:
return c
for i in range(len(self.names)):
c = cmp(self.empty[i], dlict.empty[i])
if c != 0:
return c
if self.empty[i] is None:
c = cmp(self.list[i], dlict.empty[i])
if c != 0:
return c
return cmp(self.dict, dlict.dict)
def clear(self):
self.dict.clear()
for i in range(len(self.names)):
if self.empty[i] is None:
self.empty[i] = 1
self.list[i] = None
def update(self):
pass
def load(self, index):
"""dlict-special method to support indexed access"""
if self.empty[index] is None:
return self.list[index]
else:
raise KeyError, index # XXX might want reverse mapping
def store(self, index, val):
"""dlict-special method to support indexed access"""
self.empty[index] = None
self.list[index] = val
def delete(self, index):
"""dlict-special method to support indexed access"""
self.empty[index] = 1
self.list[index] = None
Compiler issues
The compiler currently collects the names of all global variables
in a module. These are names bound at the module level or bound
in a class or function body that declares them to be global.
The compiler would assign indices for each global name and add the
names and indices of the globals to the module's code object.
Each code object would then be bound irrevocably to the module it
was defined in. (Not sure if there are some subtle problems with
this.)
For attributes of imported modules, the module will store an
indirection record. Internally, the module will store a pointer
to the defining module and the offset of the attribute in the
defining module's global variable array. The offset would be
initialized the first time the name is looked up.
Runtime model
The PythonVM will be extended with new opcodes to access globals
and module attributes via a module-level array.
A function object would need to point to the module that defined
it in order to provide access to the module-level global array.
For module attributes stored in the dlict (call them static
attributes), the get/delattr implementation would need to track
access to these attributes using the old by-name interface. If a
static attribute is updated dynamically, e.g.
mod.__dict__["foo"] = 2
The implementation would need to update the array slot instead of
the backup dict.
Backwards compatibility
The dlict will need to maintain meta-information about whether a
slot is currently used or not. It will also need to maintain a
pointer to the builtin namespace. When a name is not currently
used in the global namespace, the lookup will have to fail over to
the builtin namespace.
In the reverse case, each module may need a special accessor
function for the builtin namespace that checks to see if a global
shadowing the builtin has been added dynamically. This check
would only occur if there was a dynamic change to the module's
dlict, i.e. when a name is bound that wasn't discovered at
compile-time.
These mechanisms would have little if any cost for the common case
whether a module's global namespace is not modified in strange
ways at runtime. They would add overhead for modules that did
unusual things with global names, but this is an uncommon practice
and probably one worth discouraging.
It may be desirable to disable dynamic additions to the global
namespace in some future version of Python. If so, the new
implementation could provide warnings.
Related PEPs
PEP 266, Optimizing Global Variable/Attribute Access, proposes a
different mechanism for optimizing access to global variables as
well as attributes of objects. The mechanism uses two new opcodes
TRACK_OBJECT and UNTRACK_OBJECT to create a slot in the local
variables array that aliases the global or object attribute. If
the object being aliases is rebound, the rebind operation is
responsible for updating the aliases.
The objecting tracking approach applies to a wider range of
objects than just module. It may also have a higher runtime cost,
because each function that uses a global or object attribute must
execute extra opcodes to register its interest in an object and
unregister on exit; the cost of registration is unclear, but
presumably involves a dynamically resizable data structure to hold
a list of callbacks.
The implementation proposed here avoids the need for registration,
because it does not create aliases. Instead it allows functions
that reference a global variable or module attribute to retain a
pointer to the location where the original binding is stored. A
second advantage is that the initial lookup is performed once per
module rather than once per function call.
Copyright
This document has been placed in the public domain.
pep-0268 Extended HTTP functionality and WebDAV
| PEP: | 268 |
|---|---|
| Title: | Extended HTTP functionality and WebDAV |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | gstein at lyra.org (Greg Stein) |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Aug-2001 |
| Python-Version: | 2.x |
| Post-History: | 21-Aug-2001 |
Contents
Rejection Notice
This PEP has been rejected. It has failed to generate sufficient community support in the six years since its proposal.
Abstract
This PEP discusses new modules and extended functionality for Python's HTTP support. Notably, the addition of authenticated requests, proxy support, authenticated proxy usage, and WebDAV [1] capabilities.
Rationale
Python has been quite popular as a result of its "batteries included" positioning. One of the most heavily used protocols, HTTP (see RFC 2616), has been included with Python for years (httplib). However, this support has not kept up with the full needs and requirements of many HTTP-based applications and systems. In addition, new protocols based on HTTP, such as WebDAV and XML-RPC, are becoming useful and are seeing increasing usage. Supplying this functionality meets Python's "batteries included" role and also keeps Python at the leading edge of new technologies.
While authentication and proxy support are two very notable features missing from Python's core HTTP processing, they are minimally handled as part of Python's URL handling (urllib and urllib2). However, applications that need fine-grained or sophisticated HTTP handling cannot make use of the features while they reside in urllib. Refactoring these features into a location where they can be directly associated with an HTTP connection will improve their utility for both urllib and for sophisticated applications.
The motivation for this PEP was from several people requesting these features directly, and from a number of feature requests on SourceForge. Since the exact form of the modules to be provided and the classes/architecture used could be subject to debate, this PEP was created to provide a focal point for those discussions.
Specification
Two modules will be added to the standard library: httpx (HTTP extended functionality), and davlib (WebDAV library).
[ suggestions for module names are welcome; davlib has some precedence, but something like webdav might be desirable ]
HTTP Authentication
The httpx module will provide a mixin for performing HTTP authentication (for both proxy and origin server authentication). This mixin (httpx.HandleAuthentication) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).
The mixin will delegate the authentication process to one or more "authenticator" objects, allowing multiple connections to share authenticators. The use of a separate object allows for a long term connection to an authentication system (e.g. LDAP). An authenticator for the Basic and Digest mechanisms (see RFC 2617) will be provided. User-supplied authenticator subclasses can be registered and used by the connections.
A "credentials" object (httpx.Credentials) is also associated with the mixin, and stores the credentials (e.g. username and password) needed by the authenticators. Subclasses of Credentials can be created to hold additional information (e.g. NT domain).
The mixin overrides the getresponse() method to detect 401 (Unauthorized) and 407 (Proxy Authentication Required) responses. When this is found, the response object, the connection, and the credentials are passed to the authenticator corresponding with the authentication scheme specified in the response (multiple authenticators are tried in decreasing order of security if multiple schemes are in the response). Each authenticator can examine the response headers and decide whether and how to resend the request with the correct authentication headers. If no authenticator can successfully handle the authentication, then an exception is raised.
Resending a request, with the appropriate credentials, is one of the more difficult portions of the authentication system. The difficulty arises in recording what was sent originally: the request line, the headers, and the body. By overriding putrequest, putheader, and endheaders, we can capture all but the body. Once the endheaders method is called, then we capture all calls to send() (until the next putrequest method call) to hold the body content. The mixin will have a configurable limit for the amount of data to hold in this fashion (e.g. only hold up to 100k of body content). Assuming that the entire body has been stored, then we can resend the request with the appropriate authentication information.
If the body is too large to be stored, then the getresponse() simply returns the response object, indicating the 401 or 407 error. Since the authentication information has been computed and cached (into the Credentials object; see below), the caller can simply regenerate the request. The mixin will attach the appropriate credentials.
A "protection space" (see RFC 2617, section 1.2) is defined as a tuple of the host, port, and authentication realm. When a request is initially sent to an HTTP server, we do not know the authentication realm (the realm is only returned when authentication fails). However, we do have the path from the URL, and that can be useful in determining the credentials to send to the server. The Basic authentication scheme is typically set up hierarchically: the credentials for /path can be tried for /path/subpath. The Digest authentication scheme has explicit support for the hierarchical setup. The httpx.Credentials object will store credentials for multiple protection spaces, and can be looked up in two differents ways:
- looked up using (host, port, path) -- this lookup scheme is used when generating a request for a path where we don't know the authentication realm.
- looked up using (host, port, realm) -- this mechanism is used during the authentication process when the server has specified that the Request-URI resides within a specific realm.
The HandleAuthentication mixin will override putrequest() to automatically insert credentials, if available. The URL from the putrequest is used to determine the appropriate authentication information to use.
It is also important to note that two sets of credentials are used, and stored by the mixin. One set for any proxy that may be used, and one used for the target origin server. Since proxies do not have paths, the protection spaces in the proxy credentials will always use "/" for storing and looking up via a path.
Proxy Handling
The httpx module will provide a mixin for using a proxy to perform HTTP(S) operations. This mixin (httpx.UseProxy) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).
The mixin will record the (host, port) of the proxy to use. XXX will be overridden to use this host/port combination for connections and to rewrite request URLs into the absoluteURIs referring to the origin server (these URIs are passed to the proxy server).
Proxy authentication is handled by the httpx.HandleAuthentication class since a user may directly use HTTP(S)Connection to speak with proxies.
WebDAV Features
The davlib module will provide a mixin for sending WebDAV requests to a WebDAV-enabled server. This mixin (davlib.DAVClient) can be combined with the HTTPConnection and the HTTPSConnection classes (the mixin may possibly work with the HTTP and HTTPS compatibility classes, but that is not a requirement).
The mixin provides methods to perform the various HTTP methods defined by HTTP in RFC 2616, and by WebDAV in RFC 2518.
A custom response object is used to decode 207 (Multi-Status) responses. The response object will use the standard library's xml package to parse the multistatus XML information, producing a simple structure of objects to hold the multistatus data. Multiple parsing schemes will be tried/used, in order of decreasing speed.
Reference Implementation
The actual (future/final) implementation is being developed in the /nondist/sandbox/Lib directory, until it is accepted and moved into the main Lib directory.
Copyright
This document has been placed in the public domain.
pep-0269 Pgen Module for Python
| PEP: | 269 |
|---|---|
| Title: | Pgen Module for Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jonathan Riehl <jriehl at spaceship.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 24-Aug-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
Much like the parser module exposes the Python parser, this PEP
proposes that the parser generator used to create the Python
parser, pgen, be exposed as a module in Python.
Rationale
Through the course of Pythonic history, there have been numerous
discussions about the creation of a Python compiler [1]. These
have resulted in several implementations of Python parsers, most
notably the parser module currently provided in the Python
standard library[2] and Jeremy Hylton's compiler module[3].
However, while multiple language changes have been proposed
[4][5], experimentation with the Python syntax has lacked the
benefit of a Python binding to the actual parser generator used to
build Python.
By providing a Python wrapper analogous to Fred Drake Jr.'s parser
wrapper, but targeted at the pgen library, the following
assertions are made:
1. Reference implementations of syntax changes will be easier to
develop. Currently, a reference implementation of a syntax
change would require the developer to use the pgen tool from
the command line. The resulting parser data structure would
then either have to be reworked to interface with a custom
CPython implementation, or wrapped as a C extension module.
2. Reference implementations of syntax changes will be easier to
distribute. Since the parser generator will be available in
Python, it should follow that the resulting parser will
accessible from Python. Therefore, reference implementations
should be available as pure Python code, versus using custom
versions of the existing CPython distribution, or as compilable
extension modules.
3. Reference implementations of syntax changes will be easier to
discuss with a larger audience. This somewhat falls out of the
second assertion, since the community of Python users is most
likely larger than the community of CPython developers.
4. Development of small languages in Python will be further
enhanced, since the additional module will be a fully
functional LL(1) parser generator.
Specification
The proposed module will be called pgen. The pgen module will
contain the following functions:
parseGrammarFile (fileName) -> AST
The parseGrammarFile() function will read the file pointed to
by fileName and create an AST object. The AST nodes will
contain the nonterminal, numeric values of the parser
generator meta-grammar. The output AST will be an instance of
the AST extension class as provided by the parser module.
Syntax errors in the input file will cause the SyntaxError
exception to be raised.
parseGrammarString (text) -> AST
The parseGrammarString() function will follow the semantics of
the parseGrammarFile(), but accept the grammar text as a
string for input, as opposed to the file name.
buildParser (grammarAst) -> DFA
The buildParser() function will accept an AST object for input
and return a DFA (deterministic finite automaton) data
structure. The DFA data structure will be a C extension
class, much like the AST structure is provided in the parser
module. If the input AST does not conform to the nonterminal
codes defined for the pgen meta-grammar, buildParser() will
throw a ValueError exception.
parseFile (fileName, dfa, start) -> AST
The parseFile() function will essentially be a wrapper for the
PyParser_ParseFile() C API function. The wrapper code will
accept the DFA C extension class, and the file name. An AST
instance that conforms to the lexical values in the token
module and the nonterminal values contained in the DFA will be
output.
parseString (text, dfa, start) -> AST
The parseString() function will operate in a similar fashion
to the parseFile() function, but accept the parse text as an
argument. Much like parseFile() will wrap the
PyParser_ParseFile() C API function, parseString() will wrap
the PyParser_ParseString() function.
symbolToStringMap (dfa) -> dict
The symbolToStringMap() function will accept a DFA instance
and return a dictionary object that maps from the DFA's
numeric values for its nonterminals to the string names of the
nonterminals as found in the original grammar specification
for the DFA.
stringToSymbolMap (dfa) -> dict
The stringToSymbolMap() function output a dictionary mapping
the nonterminal names of the input DFA to their corresponding
numeric values.
Extra credit will be awarded if the map generation functions and
parsing functions are also methods of the DFA extension class.
Implementation Plan
A cunning plan has been devised to accomplish this enhancement:
1. Rename the pgen functions to conform to the CPython naming
standards. This action may involve adding some header files to
the Include subdirectory.
2. Move the pgen C modules in the Makefile.pre.in from unique pgen
elements to the Python C library.
3. Make any needed changes to the parser module so the AST
extension class understands that there are AST types it may not
understand. Cursory examination of the AST extension class
shows that it keeps track of whether the tree is a suite or an
expression.
3. Code an additional C module in the Modules directory. The C
extension module will implement the DFA extension class and the
functions outlined in the previous section.
4. Add the new module to the build process. Black magic, indeed.
Limitations
Under this proposal, would be designers of Python 3000 will still
be constrained to Python's lexical conventions. The addition,
subtraction or modification of the Python lexer is outside the
scope of this PEP.
Reference Implementation
No reference implementation is currently provided. A patch
was provided at some point in
http://sourceforge.net/tracker/index.php?func=detail&aid=599331&group_id=5470&atid=305470
but that patch is no longer maintained.
References
[1] The (defunct) Python Compiler-SIG
http://www.python.org/sigs/compiler-sig/
[2] Parser Module Documentation
http://docs.python.org/library/parser.html
[3] Hylton, Jeremy.
http://docs.python.org/library/compiler.html
[4] Pelletier, Michel. "Python Interface Syntax", PEP-245.
http://www.python.org/dev/peps/pep-0245/
[5] The Python Types-SIG
http://www.python.org/sigs/types-sig/
Copyright
This document has been placed in the public domain.
pep-0270 uniq method for list objects
| PEP: | 270 |
|---|---|
| Title: | uniq method for list objects |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jason Petrone <jp at demonseed.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 21-Aug-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Notice
This PEP is withdrawn by the author. He writes:
Removing duplicate elements from a list is a common task, but
there are only two reasons I can see for making it a built-in.
The first is if it could be done much faster, which isn't the
case. The second is if it makes it significantly easier to
write code. The introduction of sets.py eliminates this
situation since creating a sequence without duplicates is just
a matter of choosing a different data structure: a set instead
of a list.
As described in PEP 218, sets are being added to the standard
library for Python 2.3.
Abstract
This PEP proposes adding a method for removing duplicate elements to
the list object.
Rationale
Removing duplicates from a list is a common task. I think it is
useful and general enough to belong as a method in list objects.
It also has potential for faster execution when implemented in C,
especially if optimization using hashing or sorted cannot be used.
On comp.lang.python there are many, many, posts[1] asking about
the best way to do this task. Its a little tricky to implement
optimally and it would be nice to save people the trouble of
figuring it out themselves.
Considerations
Tim Peters suggests trying to use a hash table, then trying to
sort, and finally falling back on brute force[2]. Should uniq
maintain list order at the expense of speed?
Is it spelled 'uniq' or 'unique'?
Reference Implementation
I've written the brute force version. Its about 20 lines of code
in listobject.c. Adding support for hash table and sorted
duplicate removal would only take another hour or so.
References
[1] http://groups.google.com/groups?as_q=duplicates&as_ugroup=comp.lang.python
[2] Tim Peters unique() entry in the Python cookbook:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560/index_txt
Copyright
This document has been placed in the public domain.
pep-0271 Prefixing sys.path by command line option
| PEP: | 271 |
|---|---|
| Title: | Prefixing sys.path by command line option |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Frédéric B. Giacometti <fred at arakne.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 15-Aug-2001 |
| Python-Version: | 2.2 |
| Post-History: |
Abstract
At present, setting the PYTHONPATH environment variable is the
only method for defining additional Python module search
directories.
This PEP introduces the '-P' valued option to the python command
as an alternative to PYTHONPATH.
Rationale
On Unix:
python -P $SOMEVALUE
will be equivalent to
env PYTHONPATH=$SOMEVALUE python
On Windows 2K:
python -P %SOMEVALUE%
will (almost) be equivalent to
set __PYTHONPATH=%PYTHONPATH% && set PYTHONPATH=%SOMEVALUE%\
&& python && set PYTHONPATH=%__PYTHONPATH%
Other Information
This option is equivalent to the 'java -classpath' option.
When to use this option
This option is intended to ease and make more robust the use of
Python in test or build scripts, for instance.
Reference Implementation
A patch implementing this is available from SourceForge:
http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=6916&aid=429614
with the patch discussion at:
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=429614&group_id=5470
Copyright
This document has been placed in the public domain.
pep-0272 API for Block Encryption Algorithms v1.0
| PEP: | 272 |
|---|---|
| Title: | API for Block Encryption Algorithms v1.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Final |
| Type: | Informational |
| Created: | 18-Sep-2001 |
| Post-History: | 17-Apr-2002, 29-May-2002 |
Abstract
This document specifies a standard API for secret-key block
encryption algorithms such as DES or Rijndael, making it easier to
switch between different algorithms and implementations.
Introduction
Encryption algorithms transform their input data (called
plaintext) in some way that is dependent on a variable key,
producing ciphertext. The transformation can easily be reversed
if and only if one knows the key. The key is a sequence of bits
chosen from some very large space of possible keys. There are two
classes of encryption algorithms: block ciphers and stream ciphers.
Block ciphers encrypt multibyte inputs of a fixed size (frequently
8 or 16 bytes long), and can be operated in various feedback
modes. The feedback modes supported in this specification are:
Number Constant Description
1 MODE_ECB Electronic Code Book
2 MODE_CBC Cipher Block Chaining
3 MODE_CFB Cipher Feedback
5 MODE_OFB Output Feedback
6 MODE_CTR Counter
These modes are to be implemented as described in NIST publication
SP 800-38A [1]. Descriptions of the first three feedback modes can
also be found in Bruce Schneier's book _Applied
Cryptography_ [2].
(The numeric value 4 is reserved for MODE_PGP, a variant of CFB
described in RFC 2440: "OpenPGP Message Format" [3]. This mode
isn't considered important enough to make it worth requiring it
for all block encryption ciphers, though supporting it is a nice
extra feature.)
In a strict formal sense, stream ciphers encrypt data bit-by-bit;
practically, stream ciphers work on a character-by-character
basis. This PEP only aims at specifying an interface for block
ciphers, though stream ciphers can support the interface described
here by fixing 'block_size' to 1. Feedback modes also don't make
sense for stream ciphers, so the only reasonable feedback mode
would be ECB mode.
Specification
Encryption modules can add additional functions, methods, and
attributes beyond those described in this PEP, but all of the
features described in this PEP must be present for a module to
claim compliance with it.
Secret-key encryption modules should define one function:
new(key, mode, [IV], **kwargs)
Returns a ciphering object, using the secret key contained in the
string 'key', and using the feedback mode 'mode', which must be
one of the constants from the table above.
If 'mode' is MODE_CBC or MODE_CFB, 'IV' must be provided and must
be a string of the same length as the block size. Not providing a
value of 'IV' will result in a ValueError exception being raised.
Depending on the algorithm, a module may support additional
keyword arguments to this function. Some keyword arguments are
specified by this PEP, and modules are free to add additional
keyword arguments. If a value isn't provided for a given keyword,
a secure default value should be used. For example, if an
algorithm has a selectable number of rounds between 1 and 16, and
1-round encryption is insecure and 8-round encryption is believed
secure, the default value for 'rounds' should be 8 or more.
(Module implementors can choose a very slow but secure value, too,
such as 16 in this example. This decision is left up to the
implementor.)
The following table lists keyword arguments defined by this PEP:
Keyword Meaning
counter Callable object that returns counter blocks
(see below; CTR mode only)
rounds Number of rounds of encryption to use
segment_size Size of data and ciphertext segments,
measured in bits (see below; CFB mode only)
The Counter feedback mode requires a sequence of input blocks,
called counters, that are used to produce the output. When 'mode'
is MODE_CTR, the 'counter' keyword argument must be provided, and
its value must be a callable object, such as a function or method.
Successive calls to this callable object must return a sequence of
strings that are of the length 'block_size' and that never
repeats. (Appendix B of the NIST publication gives a way to
generate such a sequence, but that's beyond the scope of this
PEP.)
The CFB mode operates on segments of the plaintext and ciphertext
that are 'segment_size' bits long. Therefore, when using this
mode, the input and output strings must be a multiple of
'segment_size' bits in length. 'segment_size' must be an integer
between 1 and block_size*8, inclusive. (The factor of 8 comes
from 'block_size' being measured in bytes and not in bits). The
default value for this parameter should be block_size*8.
Implementors are allowed to constrain 'segment_size' to be a
multiple of 8 for simplicity, but they're encouraged to support
arbitrary values for generality.
Secret-key encryption modules should define two variables:
block_size
An integer value; the size of the blocks encrypted by this
module, measured in bytes. For all feedback modes, the length
of strings passed to the encrypt() and decrypt() must be a
multiple of the block size.
key_size
An integer value; the size of the keys required by this
module, measured in bytes. If key_size is None, then the
algorithm accepts variable-length keys. This may mean the
module accepts keys of any random length, or that there are a
few different possible lengths, e.g. 16, 24, or 32 bytes. You
cannot pass a key of length 0 (that is, the null string '') as
a variable-length key.
Cipher objects should have two attributes:
block_size
An integer value equal to the size of the blocks encrypted by
this object. For algorithms with a variable block size, this
value is equal to the block size selected for this object.
IV
Contains the initial value which will be used to start a
cipher feedback mode; it will always be a string exactly one
block in length. After encrypting or decrypting a string,
this value is updated to reflect the modified feedback text.
It is read-only, and cannot be assigned a new value.
Cipher objects require the following methods:
decrypt(string)
Decrypts 'string', using the key-dependent data in the object
and with the appropriate feedback mode. The string's length
must be an exact multiple of the algorithm's block size or, in
CFB mode, of the segment size. Returns a string containing
the plaintext.
encrypt(string)
Encrypts a non-empty string, using the key-dependent data in
the object, and with the appropriate feedback mode. The
string's length must be an exact multiple of the algorithm's
block size or, in CFB mode, of the segment size. Returns a
string containing the ciphertext.
Here's an example, using a module named 'DES':
>>> import DES
>>> obj = DES.new('abcdefgh', DES.MODE_ECB)
>>> plaintext = "Guido van Rossum is a space alien."
>>> len(plaintext)
34
>>> obj.encrypt(plaintext)
Traceback (innermost last):
File "<stdin>", line 1, in ?
ValueError: Strings for DES must be a multiple of 8 in length
>>> ciphertext = obj.encrypt(plain+'XXXXXX') # Add padding
>>> ciphertext
'\021,\343Nq\214DY\337T\342pA\372\255\311s\210\363,\300j\330\250\312\347\342I\3215w\03561\303dgb/\006'
>>> obj.decrypt(ciphertext)
'Guido van Rossum is a space alien.XXXXXX'
References
[1] NIST publication SP 800-38A, "Recommendation for Block Cipher
Modes of Operation" (http://csrc.nist.gov/encryption/modes/)
[2] Applied Cryptography
[3] RFC2440: "OpenPGP Message Format" (http://rfc2440.x42.com,
http://www.faqs.org/rfcs/rfc2440.html)
Changes
2002-04: Removed references to stream ciphers; retitled PEP;
prefixed feedback mode constants with MODE_; removed PGP feedback
mode; added CTR and OFB feedback modes; clarified where numbers
are measured in bytes and where in bits.
2002-09: Clarified the discussion of key length by using
"variable-length keys" instead of "arbitrary-length".
Acknowledgements
Thanks to the readers of the python-crypto list for their comments on
this PEP.
Copyright
This document has been placed in the public domain.
pep-0273 Import Modules from Zip Archives
| PEP: | 273 |
|---|---|
| Title: | Import Modules from Zip Archives |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | James C. Ahlstrom <jim at interet.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 11-Oct-2001 |
| Python-Version: | 2.3 |
| Post-History: | 26-Oct-2001 |
Abstract
This PEP adds the ability to import Python modules
*.py, *.py[co] and packages from zip archives. The
same code is used to speed up normal directory imports
provided os.listdir is available.
Note
Zip imports were added to Python 2.3, but the final implementation
uses an approach different from the one described in this PEP.
The 2.3 implementation is SourceForge patch #652586, which adds
new import hooks described in PEP 302.
The rest of this PEP is therefore only of historical interest.
Specification
Currently, sys.path is a list of directory names as strings. If
this PEP is implemented, an item of sys.path can be a string
naming a zip file archive. The zip archive can contain a
subdirectory structure to support package imports. The zip
archive satisfies imports exactly as a subdirectory would.
The implementation is in C code in the Python core and works on
all supported Python platforms.
Any files may be present in the zip archive, but only files
*.py and *.py[co] are available for import. Zip import of
dynamic modules (*.pyd, *.so) is disallowed.
Just as sys.path currently has default directory names, a default
zip archive name is added too. Otherwise there is no way to
import all Python library files from an archive.
Subdirectory Equivalence
The zip archive must be treated exactly as a subdirectory tree so
we can support package imports based on current and future rules.
All zip data is taken from the Central Directory, the data must be
correct, and brain dead zip files are not accommodated.
Suppose sys.path contains "/A/B/SubDir" and "/C/D/E/Archive.zip",
and we are trying to import modfoo from the Q package. Then
import.c will generate a list of paths and extensions and will
look for the file. The list of generated paths does not change
for zip imports. Suppose import.c generates the path
"/A/B/SubDir/Q/R/modfoo.pyc". Then it will also generate the path
"/C/D/E/Archive.zip/Q/R/modfoo.pyc". Finding the SubDir path is
exactly equivalent to finding "Q/R/modfoo.pyc" in the archive.
Suppose you zip up /A/B/SubDir/* and all its subdirectories. Then
your zip file will satisfy imports just as your subdirectory did.
Well, not quite. You can't satisfy dynamic modules from a zip
file. Dynamic modules have extensions like .dll, .pyd, and .so.
They are operating system dependent, and probably can't be loaded
except from a file. It might be possible to extract the dynamic
module from the zip file, write it to a plain file and load it.
But that would mean creating temporary files, and dealing with all
the dynload_*.c, and that's probably not a good idea.
When trying to import *.pyc, if it is not available then
*.pyo will be used instead. And vice versa when looking for *.pyo.
If neither *.pyc nor *.pyo is available, or if the magic numbers
are invalid, then *.py will be compiled and used to satisfy the
import, but the compiled file will not be saved. Python would
normally write it to the same directory as *.py, but surely we
don't want to write to the zip file. We could write to the
directory of the zip archive, but that would clutter it up, not
good if it is /usr/bin for example.
Failing to write the compiled files will make zip imports very slow,
and the user will probably not figure out what is wrong. So it
is best to put *.pyc and *.pyo in the archive with the *.py.
Efficiency
The only way to find files in a zip archive is linear search. So
for each zip file in sys.path, we search for its names once, and
put the names plus other relevant data into a static Python
dictionary. The key is the archive name from sys.path joined with
the file name (including any subdirectories) within the archive.
This is exactly the name generated by import.c, and makes lookup
easy.
This same mechanism is used to speed up directory (non-zip) imports.
See below.
zlib
Compressed zip archives require zlib for decompression. Prior to
any other imports, we attempt an import of zlib. Import of
compressed files will fail with a message "missing zlib" unless
zlib is available.
Booting
Python imports site.py itself, and this imports os, nt, ntpath,
stat, and UserDict. It also imports sitecustomize.py which may
import more modules. Zip imports must be available before site.py
is imported.
Just as there are default directories in sys.path, there must be
one or more default zip archives too.
The problem is what the name should be. The name should be linked
with the Python version, so the Python executable can correctly
find its corresponding libraries even when there are multiple
Python versions on the same machine.
We add one name to sys.path. On Unix, the directory is
sys.prefix + "/lib", and the file name is
"python%s%s.zip" % (sys.version[0], sys.version[2]).
So for Python 2.2 and prefix /usr/local, the path
/usr/local/lib/python2.2/ is already on sys.path, and
/usr/local/lib/python22.zip would be added.
On Windows, the file is the full path to python22.dll, with
"dll" replaced by "zip". The zip archive name is always inserted
as the second item in sys.path. The first is the directory of the
main.py (thanks Tim).
Directory Imports
The static Python dictionary used to speed up zip imports can be
used to speed up normal directory imports too. For each item in
sys.path that is not a zip archive, we call os.listdir, and add
the directory contents to the dictionary. Then instead of calling
fopen() in a double loop, we just check the dictionary. This
greatly speeds up imports. If os.listdir doesn't exist, the
dictionary is not used.
Benchmarks
Case Original 2.2a3 Using os.listdir Zip Uncomp Zip Compr
---- ----------------- ----------------- ---------- ----------
1 3.2 2.5 3.2->1.02 2.3 2.5 2.3->0.87 1.66->0.93 1.5->1.07
2 2.8 3.9 3.0->1.32 Same as Case 1.
3 5.7 5.7 5.7->5.7 2.1 2.1 2.1->1.8 1.25->0.99 1.19->1.13
4 9.4 9.4 9.3->9.35 Same as Case 3.
Case 1: Local drive C:, sys.path has its default value.
Case 2: Local drive C:, directory with files is at the end of sys.path.
Case 3: Network drive, sys.path has its default value.
Case 4: Network drive, directory with files is at the end of sys.path.
Benchmarks were performed on a Pentium 4 clone, 1.4 GHz, 256 Meg.
The machine was running Windows 2000 with a Linux/Samba network server.
Times are in seconds, and are the time to import about 100 Lib modules.
Case 2 and 4 have the "correct" directory moved to the end of sys.path.
"Uncomp" means uncompressed zip archive, "Compr" means compressed.
Initial times are after a re-boot of the system; the time after
"->" is the time after repeated runs. Times to import from C:
after a re-boot are rather highly variable for the "Original" case,
but are more realistic.
Custom Imports
The logic demonstrates the ability to import using default searching
until a needed Python module (in this case, os) becomes available.
This can be used to bootstrap custom importers. For example, if
"importer()" in __init__.py exists, then it could be used for imports.
The "importer()" can freely import os and other modules, and these
will be satisfied from the default mechanism. This PEP does not
define any custom importers, and this note is for information only.
Implementation
A C implementation is available as SourceForge patch 492105.
Superceded by patch 652586 and current CVS.
http://python.org/sf/492105
A newer version (updated for recent CVS by Paul Moore) is 645650.
Superceded by patch 652586 and current CVS.
http://python.org/sf/645650
A competing implementation by Just van Rossum is 652586, which is
the basis for the final implementation of PEP 302. PEP 273 has
been implemented using PEP 302's import hooks.
http://python.org/sf/652586
Copyright
This document has been placed in the public domain.
pep-0274 Dict Comprehensions
| PEP: | 274 |
|---|---|
| Title: | Dict Comprehensions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 25-Oct-2001 |
| Python-Version: | 2.7, 3.0 (originally 2.3) |
| Post-History: | 29-Oct-2001 |
Abstract
PEP 202 introduces a syntactical extension to Python called the
"list comprehension"[1]. This PEP proposes a similar syntactical
extension called the "dictionary comprehension" or "dict
comprehension" for short. You can use dict comprehensions in ways
very similar to list comprehensions, except that they produce
Python dictionary objects instead of list objects.
Resolution
This PEP was originally written for inclusion in Python 2.3. It
was withdrawn after observation that substantially all of its
benefits were subsumed by generator expressions coupled with the
dict() constructor.
However, Python 2.7 and 3.0 introduces this exact feature, as well
as the closely related set comprehensions. On 2012-04-09, the PEP
was changed to reflect this reality by updating its Status to
Accepted, and updating the Python-Version field. The Open
Questions section was also removed since these have been long
resolved by the current implementation.
Proposed Solution
Dict comprehensions are just like list comprehensions, except that
you group the expression using curly braces instead of square
braces. Also, the left part before the `for' keyword expresses
both a key and a value, separated by a colon. The notation is
specifically designed to remind you of list comprehensions as
applied to dictionaries.
Rationale
There are times when you have some data arranged as a sequences of
length-2 sequences, and you want to turn that into a dictionary.
In Python 2.2, the dict() constructor accepts an argument that is
a sequence of length-2 sequences, used as (key, value) pairs to
initialize a new dictionary object.
However, the act of turning some data into a sequence of length-2
sequences can be inconvenient or inefficient from a memory or
performance standpoint. Also, for some common operations, such as
turning a list of things into a set of things for quick duplicate
removal or set inclusion tests, a better syntax can help code
clarity.
As with list comprehensions, an explicit for loop can always be
used (and in fact was the only way to do it in earlier versions of
Python). But as with list comprehensions, dict comprehensions can
provide a more syntactically succinct idiom that the traditional
for loop.
Semantics
The semantics of dict comprehensions can actually be demonstrated
in stock Python 2.2, by passing a list comprehension to the
built-in dictionary constructor:
>>> dict([(i, chr(65+i)) for i in range(4)])
is semantically equivalent to
>>> {i : chr(65+i) for i in range(4)}
The dictionary constructor approach has two distinct disadvantages
from the proposed syntax though. First, it isn't as legible as a
dict comprehension. Second, it forces the programmer to create an
in-core list object first, which could be expensive.
Examples
>>> print {i : chr(65+i) for i in range(4)}
{0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}
>>> print {k : v for k, v in someDict.iteritems()} == someDict.copy()
1
>>> print {x.lower() : 1 for x in list_of_email_addrs}
{'barry@zope.com' : 1, 'barry@python.org' : 1, 'guido@python.org' : 1}
>>> def invert(d):
... return {v : k for k, v in d.iteritems()}
...
>>> d = {0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}
>>> print invert(d)
{'A' : 0, 'B' : 1, 'C' : 2, 'D' : 3}
>>> {(k, v): k+v for k in range(4) for v in range(4)}
... {(3, 3): 6, (3, 2): 5, (3, 1): 4, (0, 1): 1, (2, 1): 3,
(0, 2): 2, (3, 0): 3, (0, 3): 3, (1, 1): 2, (1, 0): 1,
(0, 0): 0, (1, 2): 3, (2, 0): 2, (1, 3): 4, (2, 2): 4, (
2, 3): 5}
Implementation
All implementation details were resolved in the Python 2.7 and 3.0
time-frame.
References
[1] PEP 202, List Comprehensions
http://www.python.org/dev/peps/pep-0202/
Copyright
This document has been placed in the public domain.
pep-0275 Switching on Multiple Values
| PEP: | 0275 |
|---|---|
| Title: | Switching on Multiple Values |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 10-Nov-2001 |
| Python-Version: | 2.6 |
| Post-History: |
Rejection Notice
A similar PEP for Python 3000, PEP 3103 [2], was already rejected,
so this proposal has no chance of being accepted either.
Abstract
This PEP proposes strategies to enhance Python's performance
with respect to handling switching on a single variable having
one of multiple possible values.
Problem
Up to Python 2.5, the typical way of writing multi-value switches
has been to use long switch constructs of the following type:
if x == 'first state':
...
elif x == 'second state':
...
elif x == 'third state':
...
elif x == 'fourth state':
...
else:
# default handling
...
This works fine for short switch constructs, since the overhead of
repeated loading of a local (the variable x in this case) and
comparing it to some constant is low (it has a complexity of O(n)
on average). However, when using such a construct to write a state
machine such as is needed for writing parsers the number of
possible states can easily reach 10 or more cases.
The current solution to this problem lies in using a dispatch
table to find the case implementing method to execute depending on
the value of the switch variable (this can be tuned to have a
complexity of O(1) on average, e.g. by using perfect hash
tables). This works well for state machines which require complex
and lengthy processing in the different case methods. It does not
perform well for ones which only process one or two instructions
per case, e.g.
def handle_data(self, data):
self.stack.append(data)
A nice example of this is the state machine implemented in
pickle.py which is used to serialize Python objects. Other
prominent cases include XML SAX parsers and Internet protocol
handlers.
Proposed Solutions
This PEP proposes two different but not necessarily conflicting
solutions:
1. Adding an optimization to the Python compiler and VM
which detects the above if-elif-else construct and
generates special opcodes for it which use an read-only
dictionary for storing jump offsets.
2. Adding new syntax to Python which mimics the C style
switch statement.
The first solution has the benefit of not relying on adding new
keywords to the language, while the second looks cleaner. Both
involve some run-time overhead to assure that the switching
variable is immutable and hashable.
Both solutions use a dictionary lookup to find the right
jump location, so they both share the same problem space in
terms of requiring that both the switch variable and the
constants need to be compatible to the dictionary implementation
(hashable, comparable, a==b => hash(a)==hash(b)).
Solution 1: Optimizing if-elif-else
Implementation:
It should be possible for the compiler to detect an
if-elif-else construct which has the following signature:
if x == 'first':...
elif x == 'second':...
else:...
i.e. the left hand side always references the same variable,
the right hand side a hashable immutable builtin type. The
right hand sides need not be all of the same type, but they
should be comparable to the type of the left hand switch
variable.
The compiler could then setup a read-only (perfect) hash
table, store it in the constants and add an opcode SWITCH in
front of the standard if-elif-else byte code stream which
triggers the following run-time behaviour:
At runtime, SWITCH would check x for being one of the
well-known immutable types (strings, unicode, numbers) and
use the hash table for finding the right opcode snippet. If
this condition is not met, the interpreter should revert to
the standard if-elif-else processing by simply skipping the
SWITCH opcode and procedding with the usual if-elif-else byte
code stream.
Issues:
The new optimization should not change the current Python
semantics (by reducing the number of __cmp__ calls and adding
__hash__ calls in if-elif-else constructs which are affected
by the optimiztation). To assure this, switching can only
safely be implemented either if a "from __future__" style
flag is used, or the switching variable is one of the builtin
immutable types: int, float, string, unicode, etc. (not
subtypes, since it's not clear whether these are still
immutable or not)
To prevent post-modifications of the jump-table dictionary
(which could be used to reach protected code), the jump-table
will have to be a read-only type (e.g. a read-only
dictionary).
The optimization should only be used for if-elif-else
constructs which have a minimum number of n cases (where n is
a number which has yet to be defined depending on performance
tests).
Solution 2: Adding a switch statement to Python
New Syntax:
switch EXPR:
case CONSTANT:
SUITE
case CONSTANT:
SUITE
...
else:
SUITE
(modulo indentation variations)
The "else" part is optional. If no else part is given and
none of the defined cases matches, no action is taken and
the switch statement is ignored. This is in line with the
current if-behaviour. A user who wants to signal this
situation using an exception can define an else-branch
which then implements the intended action.
Note that the constants need not be all of the same type, but
they should be comparable to the type of the switch variable.
Implementation:
The compiler would have to compile this into byte code
similar to this:
def whatis(x):
switch(x):
case 'one':
print '1'
case 'two':
print '2'
case 'three':
print '3'
else:
print "D'oh!"
into (ommitting POP_TOP's and SET_LINENO's):
6 LOAD_FAST 0 (x)
9 LOAD_CONST 1 (switch-table-1)
12 SWITCH 26 (to 38)
14 LOAD_CONST 2 ('1')
17 PRINT_ITEM
18 PRINT_NEWLINE
19 JUMP 43
22 LOAD_CONST 3 ('2')
25 PRINT_ITEM
26 PRINT_NEWLINE
27 JUMP 43
30 LOAD_CONST 4 ('3')
33 PRINT_ITEM
34 PRINT_NEWLINE
35 JUMP 43
38 LOAD_CONST 5 ("D'oh!")
41 PRINT_ITEM
42 PRINT_NEWLINE
>>43 LOAD_CONST 0 (None)
46 RETURN_VALUE
Where the 'SWITCH' opcode would jump to 14, 22, 30 or 38
depending on 'x'.
Thomas Wouters has written a patch which demonstrates the
above. You can download it from [1].
Issues:
The switch statement should not implement fall-through
behaviour (as does the switch statement in C). Each case
defines a complete and independent suite; much like in a
if-elif-else statement. This also enables using break in
switch statments inside loops.
If the interpreter finds that the switch variable x is
not hashable, it should raise a TypeError at run-time
pointing out the problem.
There have been other proposals for the syntax which reuse
existing keywords and avoid adding two new ones ("switch" and
"case"). Others have argued that the keywords should use new
terms to avoid confusion with the C keywords of the same name
but slightly different semantics (e.g. fall-through without
break). Some of the proposed variants:
case EXPR:
of CONSTANT:
SUITE
of CONSTANT:
SUITE
else:
SUITE
case EXPR:
if CONSTANT:
SUITE
if CONSTANT:
SUITE
else:
SUITE
when EXPR:
in CONSTANT_TUPLE:
SUITE
in CONSTANT_TUPLE:
SUITE
...
else:
SUITE
The switch statement could be extended to allow multiple
values for one section (e.g. case 'a', 'b', 'c': ...). Another
proposed extension would allow ranges of values (e.g. case
10..14: ...). These should probably be post-poned, but already
kept in mind when designing and implementing a first version.
Examples:
The following examples all use a new syntax as proposed by
solution 2. However, all of these examples would work with
solution 1 as well.
switch EXPR: switch x:
case CONSTANT: case "first":
SUITE print x
case CONSTANT: case "second":
SUITE x = x**2
... print x
else: else:
SUITE print "whoops!"
case EXPR: case x:
of CONSTANT: of "first":
SUITE print x
of CONSTANT: of "second":
SUITE print x**2
else: else:
SUITE print "whoops!"
case EXPR: case state:
if CONSTANT: if "first":
SUITE state = "second"
if CONSTANT: if "second":
SUITE state = "third"
else: else:
SUITE state = "first"
when EXPR: when state:
in CONSTANT_TUPLE: in ("first", "second"):
SUITE print state
in CONSTANT_TUPLE: state = next_state(state)
SUITE in ("seventh",):
... print "done"
else: break # out of loop!
SUITE else:
print "middle state"
state = next_state(state)
Here's another nice application found by Jack Jansen (switching
on argument types):
switch type(x).__name__:
case 'int':
SUITE
case 'string':
SUITE
Scope
XXX Explain "from __future__ import switch"
Credits
Martin von LĂświs (issues with the optimization idea)
Thomas Wouters (switch statement + byte code compiler example)
Skip Montanaro (dispatching ideas, examples)
Donald Beaudry (switch syntax)
Greg Ewing (switch syntax)
Jack Jansen (type switching examples)
References
[1] https://sourceforge.net/tracker/index.php?func=detail&aid=481118&group_id=5470&atid=305470
[2] http://www.python.org/dev/peps/pep-3103
Copyright
This document has been placed in the public domain.
pep-0276 Simple Iterator for ints
| PEP: | 276 |
|---|---|
| Title: | Simple Iterator for ints |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jim Althoff <james_althoff at i2.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 12-Nov-2001 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
Python 2.1 added new functionality to support iterators[1].
Iterators have proven to be useful and convenient in many coding
situations. It is noted that the implementation of Python's
for-loop control structure uses the iterator protocol as of
release 2.1. It is also noted that Python provides iterators for
the following builtin types: lists, tuples, dictionaries, strings,
and files. This PEP proposes the addition of an iterator for the
builtin type int (types.IntType). Such an iterator would simplify
the coding of certain for-loops in Python.
BDFL Pronouncement
This PEP was rejected on 17 June 2005 with a note to python-dev.
Much of the original need was met by the enumerate() function which
was accepted for Python 2.3.
Also, the proposal both allowed and encouraged misuses such as:
>>> for i in 3: print i
0
1
2
Likewise, it was not helpful that the proposal would disable the
syntax error in statements like:
x, = 1
Specification
Define an iterator for types.intType (i.e., the builtin type
"int") that is returned from the builtin function "iter" when
called with an instance of types.intType as the argument.
The returned iterator has the following behavior:
- Assume that object i is an instance of types.intType (the
builtin type int) and that i > 0
- iter(i) returns an iterator object
- said iterator object iterates through the sequence of ints
0,1,2,...,i-1
Example:
iter(5) returns an iterator object that iterates through the
sequence of ints 0,1,2,3,4
- if i <= 0, iter(i) returns an "empty" iterator, i.e., one that
throws StopIteration upon the first call of its "next" method
In other words, the conditions and semantics of said iterator is
consistent with the conditions and semantics of the range() and
xrange() functions.
Note that the sequence 0,1,2,...,i-1 associated with the int i is
considered "natural" in the context of Python programming because
it is consistent with the builtin indexing protocol of sequences
in Python. Python lists and tuples, for example, are indexed
starting at 0 and ending at len(object)-1 (when using positive
indices). In other words, such objects are indexed with the
sequence 0,1,2,...,len(object)-1
Rationale
A common programming idiom is to take a collection of objects and
apply some operation to each item in the collection in some
established sequential order. Python provides the "for in"
looping control structure for handling this common idiom. Cases
arise, however, where it is necessary (or more convenient) to
access each item in an "indexed" collection by iterating through
each index and accessing each item in the collection using the
corresponding index.
For example, one might have a two-dimensional "table" object where one
requires the application of some operation to the first column of
each row in the table. Depending on the implementation of the table
it might not be possible to access first each row and then each
column as individual objects. It might, rather, be possible to
access a cell in the table using a row index and a column index.
In such a case it is necessary to use an idiom where one iterates
through a sequence of indices (indexes) in order to access the
desired items in the table. (Note that the commonly used
DefaultTableModel class in Java-Swing-Jython has this very protocol).
Another common example is where one needs to process two or more
collections in parallel. Another example is where one needs to
access, say, every second item in a collection.
There are many other examples where access to items in a
collection is facilitated by a computation on an index thus
necessitating access to the indices rather than direct access to
the items themselves.
Let's call this idiom the "indexed for-loop" idiom. Some
programming languages provide builtin syntax for handling this
idiom. In Python the common convention for implementing the
indexed for-loop idiom is to use the builtin range() or xrange()
function to generate a sequence of indices as in, for example:
for rowcount in range(table.getRowCount()):
print table.getValueAt(rowcount, 0)
or
for rowcount in xrange(table.getRowCount()):
print table.getValueAt(rowcount, 0)
From time to time there are discussions in the Python community
about the indexed for-loop idiom. It is sometimes argued that the
need for using the range() or xrange() function for this design
idiom is:
- Not obvious (to new-to-Python programmers),
- Error prone (easy to forget, even for experienced Python
programmers)
- Confusing and distracting for those who feel compelled to understand
the differences and recommended usage of xrange() vis-a-vis range()
- Unwieldy, especially when combined with the len() function,
i.e., xrange(len(sequence))
- Not as convenient as equivalent mechanisms in other languages,
- Annoying, a "wart", etc.
And from time to time proposals are put forth for ways in which
Python could provide a better mechanism for this idiom. Recent
examples include PEP 204, "Range Literals", and PEP 212, "Loop
Counter Iteration".
Most often, such proposal include changes to Python's syntax and
other "heavyweight" changes.
Part of the difficulty here is that advocating new syntax implies
a comprehensive solution for "general indexing" that has to
include aspects like:
- starting index value
- ending index value
- step value
- open intervals versus closed intervals versus half opened intervals
Finding a new syntax that is comprehensive, simple, general,
Pythonic, appealing to many, easy to implement, not in conflict
with existing structures, not excessively overloading of existing
structures, etc. has proven to be more difficult than one might
anticipate.
The proposal outlined in this PEP tries to address the problem by
suggesting a simple "lightweight" solution that helps the most
common case by using a proven mechanism that is already available
(as of Python 2.1): namely, iterators.
Because for-loops already use "iterator" protocol as of Python
2.1, adding an iterator for types.IntType as proposed in this PEP
would enable by default the following shortcut for the indexed
for-loop idiom:
for rowcount in table.getRowCount():
print table.getValueAt(rowcount, 0)
The following benefits for this approach vis-a-vis the current
mechanism of using the range() or xrange() functions are claimed
to be:
- Simpler,
- Less cluttered,
- Focuses on the problem at hand without the need to resort to
secondary implementation-oriented functions (range() and
xrange())
And compared to other proposals for change:
- Requires no new syntax
- Requires no new keywords
- Takes advantage of the new and well-established iterator mechanism
And generally:
- Is consistent with iterator-based "convenience" changes already
included (as of Python 2.1) for other builtin types such as:
lists, tuples, dictionaries, strings, and files.
Backwards Compatibility
The proposed mechanism is generally backwards compatible as it
calls for neither new syntax nor new keywords. All existing,
valid Python programs should continue to work unmodified.
However, this proposal is not perfectly backwards compatible in
the sense that certain statements that are currently invalid
would, under the current proposal, become valid.
Tim Peters has pointed out two such examples:
1) The common case where one forgets to include range() or
xrange(), for example:
for rowcount in table.getRowCount():
print table.getValueAt(rowcount, 0)
in Python 2.2 raises a TypeError exception.
Under the current proposal, the above statement would be valid
and would work as (presumably) intended. Presumably, this is a
good thing.
As noted by Tim, this is the common case of the "forgotten
range" mistake (which one currently corrects by adding a call
to range() or xrange()).
2) The (hopefully) very uncommon case where one makes a typing
mistake when using tuple unpacking. For example:
x, = 1
in Python 2.2 raises a TypeError exception.
Under the current proposal, the above statement would be valid
and would set x to 0. The PEP author has no data as to how
common this typing error is nor how difficult it would be to
catch such an error under the current proposal. He imagines
that it does not occur frequently and that it would be
relatively easy to correct should it happen.
Issues:
Extensive discussions concerning PEP 276 on the Python interest
mailing list suggests a range of opinions: some in favor, some
neutral, some against. Those in favor tend to agree with the
claims above of the usefulness, convenience, ease of learning,
and simplicity of a simple iterator for integers.
Issues with PEP 276 include:
- Using range/xrange is fine as is.
Response: Some posters feel this way. Other disagree.
- Some feel that iterating over the sequence "0, 1, 2, ..., n-1"
for an integer n is not intuitive. "for i in 5:" is considered
(by some) to be "non-obvious", for example. Some dislike this
usage because it doesn't have "the right feel". Some dislike it
because they believe that this type of usage forces one to view
integers as a sequences and this seems wrong to them. Some
dislike it because they prefer to view for-loops as dealing
with explicit sequences rather than with arbitrary iterators.
Response: Some like the proposed idiom and see it as simple,
elegant, easy to learn, and easy to use. Some are neutral on
this issue. Others, as noted, dislike it.
- Is it obvious that iter(5) maps to the sequence 0,1,2,3,4?
Response: Given, as noted above, that Python has a strong
convention for indexing sequences starting at 0 and stopping at
(inclusively) the index whose value is one less than the length
of the sequence, it is argued that the proposed sequence is
reasonably intuitive to the Python programmer while being useful
and practical. More importantly, it is argued that once learned
this convention is very easy to remember. Note that the doc
string for the range function makes a reference to the
natural and useful association between range(n) and the indices
for a list whose length is n.
- Possible ambiguity
for i in 10: print i
might be mistaken for
for i in (10,): print i
Response: This is exactly the same situation with strings in
current Python (replace 10 with 'spam' in the above, for
example).
- Too general: in the newest releases of Python there are
contexts -- as with for-loops -- where iterators are called
implicitly. Some fear that having an iterator invoked for
an integer in one of the context (excluding for-loops) might
lead to unexpected behavior and bugs. The "x, = 1" example
noted above is an a case in point.
Response: From the author's perspective the examples of the
above that were identified in the PEP 276 discussions did
not appear to be ones that would be accidentally misused
in ways that would lead to subtle and hard-to-detect errors.
In addition, it seems that there is a way to deal with this
issue by using a variation of what is outlined in the
specification section of this proposal. Instead of adding
an __iter__ method to class int, change the for-loop handling
code to convert (in essense) from
for i in n: # when isinstance(n,int) is 1
to
for i in xrange(n):
This approach gives the same results in a for-loop as an
__iter__ method would but would prevent iteration on integer
values in any other context. Lists and tuples, for example,
don't have __iter__ and are handled with special code.
Integer values would be one more special case.
- "i in n" seems very unnatural.
Response: Some feel that "i in len(mylist)" would be easily
understandable and useful. Some don't like it, particularly
when a literal is used as in "i in 5". If the variant
mentioned in the response to the previous issue is implemented,
this issue is moot. If not, then one could also address this
issue by defining a __contains__ method in class int that would
always raise a TypeError. This would then make the behavior of
"i in n" identical to that of current Python.
- Might dissuade newbies from using the indexed for-loop idiom when
the standard "for item in collection:" idiom is clearly better.
Response: The standard idiom is so nice when it fits that it
needs neither extra "carrot" nor "stick". On the other hand,
one does notice cases of overuse/misuse of the standard idiom
(due, most likely, to the awkwardness of the indexed for-loop
idiom), as in:
for item in sequence:
print sequence.index(item)
- Why not propose even bigger changes?
The majority of disagreement with PEP 276 came from those who
favor much larger changes to Python to address the more general
problem of specifying a sequence of integers where such
a specification is general enough to handle the starting value,
ending value, and stepping value of the sequence and also
addresses variations of open, closed, and half-open (half-closed)
integer intervals. Many suggestions of such were discussed.
These include:
- adding Haskell-like notation for specifying a sequence of
integers in a literal list,
- various uses of slicing notation to specify sequences,
- changes to the syntax of for-in loops to allow the use of
relational operators in the loop header,
- creation of an integer-interval class along with methods that
overload relational operators or division operators
to provide "slicing" on integer-interval objects,
- and more.
It should be noted that there was much debate but not an
overwhelming concensus for any of these larger-scale suggestions.
Clearly, PEP 276 does not propose such a large-scale change
and instead focuses on a specific problem area. Towards the
end of the discussion period, several posters expressed favor
for the narrow focus and simplicity of PEP 276 vis-a-vis the more
ambitious suggestions that were advanced. There did appear to be
concensus for the need for a PEP for any such larger-scale,
alternative suggestion. In light of this recognition, details of
the various alternative suggestions are not discussed here further.
Implementation
An implementation is not available at this time but is expected
to be straightforward. The author has implemented a subclass of
int with an __iter__ method (written in Python) as a means to test
out the ideas in this proposal, however.
References
[1] PEP 234, Iterators
http://www.python.org/dev/peps/pep-0234/
[2] PEP 204, Range Literals
http://www.python.org/dev/peps/pep-0204/
[3] PEP 212, Loop Counter Iteration
http://www.python.org/dev/peps/pep-0212/
Copyright
This document has been placed in the public domain.
pep-0277 Unicode file name support for Windows NT
| PEP: | 277 |
|---|---|
| Title: | Unicode file name support for Windows NT |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neil Hodgson <neilh at scintilla.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 11-Jan-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP discusses supporting access to all files possible on
Windows NT by passing Unicode file names directly to the system's
wide-character functions.
Rationale
Python 2.2 on Win32 platforms converts Unicode file names passed
to open and to functions in the os module into the 'mbcs' encoding
before passing the result to the operating system. This is often
successful in the common case where the script is operating with
the locale set to the same value as when the file was created.
Most machines are set up as one locale and rarely if ever changed
from this locale. For some users, locale is changed more often
and on servers there are often files saved by users using
different locales.
On Windows NT and descendent operating systems, including Windows
2000 and Windows XP, wide-character APIs are available that
provide direct access to all file names, including those that are
not representable using the current locale. The purpose of this
proposal is to provide access to these wide-character APIs through
the standard Python file object and posix module and so provide
access to all files on Windows NT.
Specification
On Windows platforms which provide wide-character file APIs, when
Unicode arguments are provided to file APIs, wide-character calls
are made instead of the standard C library and posix calls.
The Python file object is extended to use a Unicode file name
argument directly rather than converting it. This affects the
file object constructor file(filename[, mode[, bufsize]]) and also
the open function which is an alias of this constructor. When a
Unicode filename argument is used here then the name attribute of
the file object will be Unicode. The representation of a file
object, repr(f) will display Unicode file names as an escaped
string in a similar manner to the representation of Unicode
strings.
The posix module contains functions that take file or directory
names: chdir, listdir, mkdir, open, remove, rename, rmdir, stat,
and _getfullpathname. These will use Unicode arguments directly
rather than converting them. For the rename function, this
behaviour is triggered when either of the arguments is Unicode and
the other argument converted to Unicode using the default
encoding.
The listdir function currently returns a list of strings. Under
this proposal, it will return a list of Unicode strings when its
path argument is Unicode.
Restrictions
On the consumer Windows operating systems, Windows 95, Windows 98,
and Windows ME, there are no wide-character file APIs so behaviour
is unchanged under this proposal. It may be possible in the
future to extend this proposal to cover these operating systems as
the VFAT-32 file system used by them does support Unicode file
names but access is difficult and so implementing this would
require much work. The "Microsoft Layer for Unicode" could be a
starting point for implementing this.
Python can be compiled with the size of Unicode characters set to
4 bytes rather than 2 by defining PY_UNICODE_TYPE to be a 4 byte
type and Py_UNICODE_SIZE to be 4. As the Windows API does not
accept 4 byte characters, the features described in this proposal
will not work in this mode so the implementation falls back to the
current 'mbcs' encoding technique. This restriction could be lifted
in the future by performing extra conversions using
PyUnicode_AsWideChar but for now that would add too much
complexity for a very rarely used feature.
Reference Implementation
An experimental implementation is available from
[2] http://scintilla.sourceforge.net/winunichanges.zip
[3] An updated version is available at
http://python.org/sf/594001
References
[1] Microsoft Windows APIs
http://msdn.microsoft.com/
Copyright
This document has been placed in the public domain.
pep-0278 Universal Newline Support
| PEP: | 278 |
|---|---|
| Title: | Universal Newline Support |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jack Jansen <jack at cwi.nl> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 14-Jan-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP discusses a way in which Python can support I/O on files
which have a newline format that is not the native format on the
platform, so that Python on each platform can read and import
files with CR (Macintosh), LF (Unix) or CR LF (Windows) line
endings.
It is more and more common to come across files that have an end
of line that does not match the standard on the current platform:
files downloaded over the net, remotely mounted filesystems on a
different platform, Mac OS X with its double standard of Mac and
Unix line endings, etc.
Many tools such as editors and compilers already handle this
gracefully, it would be good if Python did so too.
Specification
Universal newline support is enabled by default,
but can be disabled during the configure of Python.
In a Python with universal newline support the feature is
automatically enabled for all import statements and execfile()
calls. There is no special support for eval() or exec.
In a Python with universal newline support open() the mode
parameter can also be "U", meaning "open for input as a text file
with universal newline interpretation". Mode "rU" is also allowed,
for symmetry with "rb". Mode "U" cannot be
combined with other mode flags such as "+". Any line ending in the
input file will be seen as a '\n' in Python, so little other code has
to change to handle universal newlines.
Conversion of newlines happens in all calls that read data: read(),
readline(), readlines(), etc.
There is no special support for output to file with a different
newline convention, and so mode "wU" is also illegal.
A file object that has been opened in universal newline mode gets
a new attribute "newlines" which reflects the newline convention
used in the file. The value for this attribute is one of None (no
newline read yet), "\r", "\n", "\r\n" or a tuple containing all the
newline types seen.
Rationale
Universal newline support is implemented in C, not in Python.
This is done because we want files with a foreign newline
convention to be import-able, so a Python Lib directory can be
shared over a remote file system connection, or between MacPython
and Unix-Python on Mac OS X. For this to be feasible the
universal newline convention needs to have a reasonably small
impact on performance, which means a Python implementation is not
an option as it would bog down all imports. And because of files
with multiple newline conventions, which Visual C++ and other
Windows tools will happily produce, doing a quick check for the
newlines used in a file (handing off the import to C code if a
platform-local newline is seen) will not work. Finally, a C
implementation also allows tracebacks and such (which open the
Python source module) to be handled easily.
There is no output implementation of universal newlines, Python
programs are expected to handle this by themselves or write files
with platform-local convention otherwise. The reason for this is
that input is the difficult case, outputting different newlines to
a file is already easy enough in Python.
Also, an output implementation would be much more difficult than an
input implementation, surprisingly: a lot of output is done through
PyXXX_Print() methods, and at this point the file object is not
available anymore, only a FILE *. So, an output implementation would
need to somehow go from the FILE* to the file object, because that
is where the current newline delimiter is stored.
The input implementation has no such problem: there are no cases in
the Python source tree where files are partially read from C,
partially from Python, and such cases are expected to be rare in
extension modules. If such cases exist the only problem is that the
newlines attribute of the file object is not updated during the
fread() or fgets() calls that are done direct from C.
A partial output implementation, where strings passed to fp.write()
would be converted to use fp.newlines as their line terminator but
all other output would not is far too surprising, in my view.
Because there is no output support for universal newlines there is
also no support for a mode "rU+": the surprise factor of the
previous paragraph would hold to an even stronger degree.
There is no support for universal newlines in strings passed to
eval() or exec. It is envisioned that such strings always have the
standard \n line feed, if the strings come from a file that file can
be read with universal newlines.
I think there are no special issues with unicode. utf-16 shouldn't
pose any new problems, as such files need to be opened in binary
mode anyway. Interaction with utf-8 is fine too: values 0x0a and 0x0d
cannot occur as part of a multibyte sequence.
Universal newline files should work fine with iterators and
xreadlines() as these eventually call the normal file
readline/readlines methods.
While universal newlines are automatically enabled for import they
are not for opening, where you have to specifically say open(...,
"U"). This is open to debate, but here are a few reasons for this
design:
- Compatibility. Programs which already do their own
interpretation of \r\n in text files would break. Examples of such
programs would be editors which warn you when you open a file with
a different newline convention. If universal newlines was made the
default such an editor would silently convert your line endings to
the local convention on save. Programs which open binary files as
text files on Unix would also break (but it could be argued they
deserve it :-).
- Interface clarity. Universal newlines are only supported for
input files, not for input/output files, as the semantics would
become muddy. Would you write Mac newlines if all reads so far
had encountered Mac newlines? But what if you then later read a
Unix newline?
The newlines attribute is included so that programs that really
care about the newline convention, such as text editors, can
examine what was in a file. They can then save (a copy of) the
file with the same newline convention (or, in case of a file with
mixed newlines, ask the user what to do, or output in platform
convention).
Feedback is explicitly solicited on one item in the reference
implementation: whether or not the universal newlines routines
should grab the global interpreter lock. Currently they do not,
but this could be considered living dangerously, as they may
modify fields in a FileObject. But as these routines are
replacements for fgets() and fread() as well it may be difficult
to decide whether or not the lock is held when the routine is
called. Moreover, the only danger is that if two threads read the
same FileObject at the same time an extraneous newline may be seen
or the "newlines" attribute may inadvertently be set to mixed. I
would argue that if you read the same FileObject in two threads
simultaneously you are asking for trouble anyway.
Note that no globally accessible pointers are manipulated in the
fgets() or fread() replacement routines, just some integer-valued
flags, so the chances of core dumps are zero (he said:-).
Universal newline support can be disabled during configure because it does
have a small performance penalty, and moreover the implementation has
not been tested on all concievable platforms yet. It might also be silly
on some platforms (WinCE or Palm devices, for instance). If universal
newline support is not enabled then file objects do not have the "newlines"
attribute, so testing whether the current Python has it can be done with a
simple
if hasattr(open, 'newlines'):
print 'We have universal newline support'
Note that this test uses the open() function rather than the file
type so that it won't fail for versions of Python where the file
type was not available (the file type was added to the built-in
namespace in the same release as the universal newline feature was
added).
Additionally, note that this test fails again on Python versions
>= 2.5, when open() was made a function again and is not synonymous
with the file type anymore.
Reference Implementation
A reference implementation is available in SourceForge patch
#476814: http://www.python.org/sf/476814
References
None.
Copyright
This document has been placed in the public domain.
pep-0279 The enumerate() built-in function
| PEP: | 279 |
|---|---|
| Title: | The enumerate() built-in function |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 30-Jan-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP introduces a new built-in function, enumerate() to
simplify a commonly used looping idiom. It provides all iterable
collections with the same advantage that iteritems() affords to
dictionaries -- a compact, readable, reliable index notation.
Rationale
Python 2.2 introduced the concept of an iterable interface as
proposed in PEP 234 [3]. The iter() factory function was provided
as common calling convention and deep changes were made to use
iterators as a unifying theme throughout Python. The unification
came in the form of establishing a common iterable interface for
mappings, sequences, and file objects.
Generators, as proposed in PEP 255 [1], were introduced as a means
for making it easier to create iterators, especially ones with
complex internal execution or variable states. The availability
of generators makes it possible to improve on the loop counter
ideas in PEP 212 [2]. Those ideas provided a clean syntax for
iteration with indices and values, but did not apply to all
iterable objects. Also, that approach did not have the memory
friendly benefit provided by generators which do not evaluate the
entire sequence all at once.
The new proposal is to add a built-in function, enumerate() which
was made possible once iterators and generators became available.
It provides all iterables with the same advantage that iteritems()
affords to dictionaries -- a compact, readable, reliable index
notation. Like zip(), it is expected to become a commonly used
looping idiom.
This suggestion is designed to take advantage of the existing
implementation and require little additional effort to
incorporate. It is backwards compatible and requires no new
keywords. The proposal will go into Python 2.3 when generators
become final and are not imported from __future__.
BDFL Pronouncements
The new built-in function is ACCEPTED.
Specification for a new built-in:
def enumerate(collection):
'Generates an indexed series: (0,coll[0]), (1,coll[1]) ...'
i = 0
it = iter(collection)
while 1:
yield (i, it.next())
i += 1
Note A: PEP 212 Loop Counter Iteration [2] discussed several
proposals for achieving indexing. Some of the proposals only work
for lists unlike the above function which works for any generator,
xrange, sequence, or iterable object. Also, those proposals were
presented and evaluated in the world prior to Python 2.2 which did
not include generators. As a result, the non-generator version in
PEP 212 had the disadvantage of consuming memory with a giant list
of tuples. The generator version presented here is fast and
light, works with all iterables, and allows users to abandon the
sequence in mid-stream with no loss of computation effort.
There are other PEPs which touch on related issues: integer
iterators, integer for-loops, and one for modifying the arguments
to range and xrange. The enumerate() proposal does not preclude
the other proposals and it still meets an important need even if
those are adopted -- the need to count items in any iterable. The
other proposals give a means of producing an index but not the
corresponding value. This is especially problematic if a sequence
is given which doesn't support random access such as a file
object, generator, or sequence defined with __getitem__.
Note B: Almost all of the PEP reviewers welcomed the function but
were divided as to whether there should be any built-ins. The
main argument for a separate module was to slow the rate of
language inflation. The main argument for a built-in was that the
function is destined to be part of a core programming style,
applicable to any object with an iterable interface. Just as
zip() solves the problem of looping over multiple sequences, the
enumerate() function solves the loop counter problem.
If only one built-in is allowed, then enumerate() is the most
important general purpose tool, solving the broadest class of
problems while improving program brevity, clarity and reliability.
Note C: Various alternative names were discussed:
iterindexed()-- five syllables is a mouthful
index() -- nice verb but could be confused the .index() method
indexed() -- widely liked however adjectives should be avoided
indexer() -- noun did not read well in a for-loop
count() -- direct and explicit but often used in other contexts
itercount() -- direct, explicit and hated by more than one person
iteritems() -- conflicts with key:value concept for dictionaries
itemize() -- confusing because amap.items() != list(itemize(amap))
enum() -- pithy; less clear than enumerate; too similar to enum
in other languages where it has a different meaning
All of the names involving 'count' had the further disadvantage of
implying that the count would begin from one instead of zero.
All of the names involving 'index' clashed with usage in database
languages where indexing implies a sorting operation rather than
linear sequencing.
Note D: This function was originally proposed with optional start
and stop arguments. GvR pointed out that the function call
enumerate(seqn,4,6) had an alternate, plausible interpretation as
a slice that would return the fourth and fifth elements of the
sequence. To avoid the ambiguity, the optional arguments were
dropped even though it meant losing flexibility as a loop counter.
That flexibility was most important for the common case of
counting from one, as in:
for linenum, line in enumerate(source,1): print linenum, line
Comments from GvR: filter and map should die and be subsumed into list
comprehensions, not grow more variants. I'd rather introduce
built-ins that do iterator algebra (e.g. the iterzip that I've
often used as an example).
I like the idea of having some way to iterate over a sequence
and its index set in parallel. It's fine for this to be a
built-in.
I don't like the name "indexed"; adjectives do not make good
function names. Maybe iterindexed()?
Comments from Ka-Ping Yee: I'm also quite happy with everything you
proposed ... and the extra built-ins (really 'indexed' in
particular) are things I have wanted for a long time.
Comments from Neil Schemenauer: The new built-ins sound okay. Guido
may be concerned with increasing the number of built-ins too
much. You might be better off selling them as part of a
module. If you use a module then you can add lots of useful
functions (Haskell has lots of them that we could steal).
Comments for Magnus Lie Hetland: I think indexed would be a useful and
natural built-in function. I would certainly use it a lot. I
like indexed() a lot; +1. I'm quite happy to have it make PEP
281 obsolete. Adding a separate module for iterator utilities
seems like a good idea.
Comments from the Community: The response to the enumerate() proposal
has been close to 100% favorable. Almost everyone loves the
idea.
Author response: Prior to these comments, four built-ins were proposed.
After the comments, xmap xfilter and xzip were withdrawn. The
one that remains is vital for the language and is proposed by
itself. Indexed() is trivially easy to implement and can be
documented in minutes. More importantly, it is useful in
everyday programming which does not otherwise involve explicit
use of generators.
This proposal originally included another function iterzip().
That was subsequently implemented as the izip() function in
the itertools module.
References
[1] PEP 255 Simple Generators
http://www.python.org/dev/peps/pep-0255/
[2] PEP 212 Loop Counter Iteration
http://www.python.org/dev/peps/pep-0212/
[3] PEP 234 Iterators
http://www.python.org/dev/peps/pep-0234/
Copyright
This document has been placed in the public domain.
pep-0280 Optimizing access to globals
| PEP: | 280 |
|---|---|
| Title: | Optimizing access to globals |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 10-Feb-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Deferral
While this PEP is a nice idea, no-one has yet emerged to do the work of
hashing out the differences between this PEP, PEP 266 and PEP 267.
Hence, it is being deferred.
Abstract
This PEP describes yet another approach to optimizing access to
module globals, providing an alternative to PEP 266 (Optimizing
Global Variable/Attribute Access by Skip Montanaro) and PEP 267
(Optimized Access to Module Namespaces by Jeremy Hylton).
The expectation is that eventually one approach will be picked and
implemented; possibly multiple approaches will be prototyped
first.
Description
(Note: Jason Orendorff writes: """I implemented this once, long
ago, for Python 1.5-ish, I believe. I got it to the point where
it was only 15% slower than ordinary Python, then abandoned it.
;) In my implementation, "cells" were real first-class objects,
and "celldict" was a copy-and-hack version of dictionary. I
forget how the rest worked.""" Reference:
http://mail.python.org/pipermail/python-dev/2002-February/019876.html)
Let a cell be a really simple Python object, containing a pointer
to a Python object and a pointer to a cell. Both pointers may be
NULL. A Python implementation could be:
class cell(object):
def __init__(self):
self.objptr = NULL
self.cellptr = NULL
The cellptr attribute is used for chaining cells together for
searching built-ins; this will be explained later.
Let a celldict be a mapping from strings (the names of a module's
globals) to objects (the values of those globals), implemented
using a dict of cells. A Python implementation could be:
class celldict(object):
def __init__(self):
self.__dict = {} # dict of cells
def getcell(self, key):
c = self.__dict.get(key)
if c is None:
c = cell()
self.__dict[key] = c
return c
def cellkeys(self):
return self.__dict.keys()
def __getitem__(self, key):
c = self.__dict.get(key)
if c is None:
raise KeyError, key
value = c.objptr
if value is NULL:
raise KeyError, key
else:
return value
def __setitem__(self, key, value):
c = self.__dict.get(key)
if c is None:
c = cell()
self.__dict[key] = c
c.objptr = value
def __delitem__(self, key):
c = self.__dict.get(key)
if c is None or c.objptr is NULL:
raise KeyError, key
c.objptr = NULL
def keys(self):
return [k for k, c in self.__dict.iteritems()
if c.objptr is not NULL]
def items(self):
return [k, c.objptr for k, c in self.__dict.iteritems()
if c.objptr is not NULL]
def values(self):
preturn [c.objptr for c in self.__dict.itervalues()
if c.objptr is not NULL]
def clear(self):
for c in self.__dict.values():
c.objptr = NULL
# Etc.
It is possible that a cell exists corresponding to a given key,
but the cell's objptr is NULL; let's call such a cell empty. When
the celldict is used as a mapping, it is as if empty cells don't
exist. However, once added, a cell is never deleted from a
celldict, and it is possible to get at empty cells using the
getcell() method.
The celldict implementation never uses the cellptr attribute of
cells.
We change the module implementation to use a celldict for its
__dict__. The module's getattr, setattr and delattr operations
now map to getitem, setitem and delitem on the celldict. The type
of <module>.__dict__ and globals() is probably the only backwards
incompatibility.
When a module is initialized, its __builtins__ is initialized from
the __builtin__ module's __dict__, which is itself a celldict.
For each cell in __builtins__, the new module's __dict__ adds a
cell with a NULL objptr, whose cellptr points to the corresponding
cell of __builtins__. Python pseudo-code (ignoring rexec):
import __builtin__
class module(object):
def __init__(self):
self.__dict__ = d = celldict()
d['__builtins__'] = bd = __builtin__.__dict__
for k in bd.cellkeys():
c = self.__dict__.getcell(k)
c.cellptr = bd.getcell(k)
def __getattr__(self, k):
try:
return self.__dict__[k]
except KeyError:
raise IndexError, k
def __setattr__(self, k, v):
self.__dict__[k] = v
def __delattr__(self, k):
del self.__dict__[k]
The compiler generates LOAD_GLOBAL_CELL <i> (and STORE_GLOBAL_CELL
<i> etc.) opcodes for references to globals, where <i> is a small
index with meaning only within one code object like the const
index in LOAD_CONST. The code object has a new tuple, co_globals,
giving the names of the globals referenced by the code indexed by
<i>. No new analysis is required to be able to do this.
When a function object is created from a code object and a celldict,
the function object creates an array of cell pointers by asking the
celldict for cells corresponding to the names in the code object's
co_globals. If the celldict doesn't already have a cell for a
particular name, it creates and an empty one. This array of cell
pointers is stored on the function object as func_cells. When a
function object is created from a regular dict instead of a
celldict, func_cells is a NULL pointer.
When the VM executes a LOAD_GLOBAL_CELL <i> instruction, it gets
cell number <i> from func_cells. It then looks in the cell's
PyObject pointer, and if not NULL, that's the global value. If it
is NULL, it follows the cell's cell pointer to the next cell, if it
is not NULL, and looks in the PyObject pointer in that cell. If
that's also NULL, or if there is no second cell, NameError is
raised. (It could follow the chain of cell pointers until a NULL
cell pointer is found; but I have no use for this.) Similar for
STORE_GLOBAL_CELL <i>, except it doesn't follow the cell pointer
chain -- it always stores in the first cell.
There are fallbacks in the VM for the case where the function's
globals aren't a celldict, and hence func_cells is NULL. In that
case, the code object's co_globals is indexed with <i> to find the
name of the corresponding global and this name is used to index the
function's globals dict.
Additional Ideas
- Never make func_cell a NULL pointer; instead, make up an array
of empty cells, so that LOAD_GLOBAL_CELL can index func_cells
without a NULL check.
- Make c.cellptr equal to c when a cell is created, so that
LOAD_GLOBAL_CELL can always dereference c.cellptr without a NULL
check.
With these two additional ideas added, here's Python pseudo-code
for LOAD_GLOBAL_CELL:
def LOAD_GLOBAL_CELL(self, i):
# self is the frame
c = self.func_cells[i]
obj = c.objptr
if obj is not NULL:
return obj # Existing global
return c.cellptr.objptr # Built-in or NULL
- Be more aggressive: put the actual values of builtins into module
dicts, not just pointers to cells containing the actual values.
There are two points to this: (1) Simplify and speed access, which
is the most common operation. (2) Support faithful emulation of
extreme existing corner cases.
WRT #2, the set of builtins in the scheme above is captured at the
time a module dict is first created. Mutations to the set of builtin
names following that don't get reflected in the module dicts. Example:
consider files main.py and cheater.py:
[main.py]
import cheater
def f():
cheater.cheat()
return pachinko()
print f()
[cheater.py]
def cheat():
import __builtin__
__builtin__.pachinko = lambda: 666
If main.py is run under Python 2.2 (or before), 666 is printed. But
under the proposal, __builtin__.pachinko doesn't exist at the time
main's __dict__ is initialized. When the function object for
f is created, main.__dict__ grows a pachinko cell mapping to two
NULLs. When cheat() is called, __builtin__.__dict__ grows a pachinko
cell too, but main.__dict__ doesn't know-- and will never know --about
that. When f's return stmt references pachinko, in will still find
the double-NULLs in main.__dict__'s pachinko cell, and so raise
NameError.
A similar (in cause) break in compatibility can occur if a module
global foo is del'ed, but a builtin foo was created prior to that
but after the module dict was first created. Then the builtin foo
becomes visible in the module under 2.2 and before, but remains
invisible under the proposal.
Mutating builtins is extremely rare (most programs never mutate the
builtins, and it's hard to imagine a plausible use for frequent
mutation of the builtins -- I've never seen or heard of one), so it
doesn't matter how expensive mutating the builtins becomes. OTOH,
referencing globals and builtins is very common. Combining those
observations suggests a more aggressive caching of builtins in module
globals, speeding access at the expense of making mutations of the
builtins (potentially much) more expensive to keep the caches in
synch.
Much of the scheme above remains the same, and most of the rest is
just a little different. A cell changes to:
class cell(object):
def __init__(self, obj=NULL, builtin=0):
self.objptr = obj
self.builtinflag = builtin
and a celldict maps strings to this version of cells. builtinflag
is true when and only when objptr contains a value obtained from
the builtins; in other words, it's true when and only when a cell
is acting as a cached value. When builtinflag is false, objptr is
the value of a module global (possibly NULL). celldict changes to:
class celldict(object):
def __init__(self, builtindict=()):
self.basedict = builtindict
self.__dict = d = {}
for k, v in builtindict.items():
d[k] = cell(v, 1)
def __getitem__(self, key):
c = self.__dict.get(key)
if c is None or c.objptr is NULL or c.builtinflag:
raise KeyError, key
return c.objptr
def __setitem__(self, key, value):
c = self.__dict.get(key)
if c is None:
c = cell()
self.__dict[key] = c
c.objptr = value
c.builtinflag = 0
def __delitem__(self, key):
c = self.__dict.get(key)
if c is None or c.objptr is NULL or c.builtinflag:
raise KeyError, key
c.objptr = NULL
# We may have unmasked a builtin. Note that because
# we're checking the builtin dict for that *now*, this
# still works if the builtin first came into existence
# after we were constructed. Note too that del on
# namespace dicts is rare, so the expensse of this check
# shouldn't matter.
if key in self.basedict:
c.objptr = self.basedict[key]
assert c.objptr is not NULL # else "in" lied
c.builtinflag = 1
else:
# There is no builtin with the same name.
assert not c.builtinflag
def keys(self):
return [k for k, c in self.__dict.iteritems()
if c.objptr is not NULL and not c.builtinflag]
def items(self):
return [k, c.objptr for k, c in self.__dict.iteritems()
if c.objptr is not NULL and not c.builtinflag]
def values(self):
preturn [c.objptr for c in self.__dict.itervalues()
if c.objptr is not NULL and not c.builtinflag]
def clear(self):
for c in self.__dict.values():
if not c.builtinflag:
c.objptr = NULL
# Etc.
The speed benefit comes from simplifying LOAD_GLOBAL_CELL, which
I expect is executed more frequently than all other namespace
operations combined:
def LOAD_GLOBAL_CELL(self, i):
# self is the frame
c = self.func_cells[i]
return c.objptr # may be NULL (also true before)
That is, accessing builtins and accessing module globals are equally
fast. For module globals, a NULL-pointer test+branch is saved. For
builtins, an additional pointer chase is also saved.
The other part needed to make this fly is expensive, propagating
mutations of builtins into the module dicts that were initialized
from the builtins. This is much like, in 2.2, propagating changes
in new-style base classes to their descendants: the builtins need to
maintain a list of weakrefs to the modules (or module dicts)
initialized from the builtin's dict. Given a mutation to the builtin
dict (adding a new key, changing the value associated with an
existing key, or deleting a key), traverse the list of module dicts
and make corresponding mutations to them. This is straightforward;
for example, if a key is deleted from builtins, execute
reflect_bltin_del in each module:
def reflect_bltin_del(self, key):
c = self.__dict.get(key)
assert c is not None # else we were already out of synch
if c.builtinflag:
# Put us back in synch.
c.objptr = NULL
c.builtinflag = 0
# Else we're shadowing the builtin, so don't care that
# the builtin went away.
Note that c.builtinflag protects from us erroneously deleting a
module global of the same name. Adding a new (key, value) builtin
pair is similar:
def reflect_bltin_new(self, key, value):
c = self.__dict.get(key)
if c is None:
# Never heard of it before: cache the builtin value.
self.__dict[key] = cell(value, 1)
elif c.objptr is NULL:
# This used to exist in the module or the builtins,
# but doesn't anymore; rehabilitate it.
assert not c.builtinflag
c.objptr = value
c.builtinflag = 1
else:
# We're shadowing it already.
assert not c.builtinflag
Changing the value of an existing builtin:
def reflect_bltin_change(self, key, newvalue):
c = self.__dict.get(key)
assert c is not None # else we were already out of synch
if c.builtinflag:
# Put us back in synch.
c.objptr = newvalue
# Else we're shadowing the builtin, so don't care that
# the builtin changed.
FAQs
Q. Will it still be possible to:
a) install new builtins in the __builtin__ namespace and have
them available in all already loaded modules right away ?
b) override builtins (e.g. open()) with my own copies
(e.g. to increase security) in a way that makes these new
copies override the previous ones in all modules ?
A. Yes, this is the whole point of this design. In the original
approach, when LOAD_GLOBAL_CELL finds a NULL in the second
cell, it should go back to see if the __builtins__ dict has
been modified (the pseudo code doesn't have this yet). Tim's
"more aggressive" alternative also takes care of this.
Q. How does the new scheme get along with the restricted execution
model?
A. It is intended to support that fully.
Q. What happens when a global is deleted?
A. The module's celldict would have a cell with a NULL objptr for
that key. This is true in both variations, but the "aggressive"
variation goes on to see whether this unmasks a builtin of the
same name, and if so copies its value (just a pointer-copy of the
ultimate PyObject*) into the cell's objptr and sets the cell's
builtinflag to true.
Q. What would the C code for LOAD_GLOBAL_CELL look like?
A. The first version, with the first two bullets under "Additional
ideas" incorporated, could look like this:
case LOAD_GLOBAL_CELL:
cell = func_cells[oparg];
x = cell->objptr;
if (x == NULL) {
x = cell->cellptr->objptr;
if (x == NULL) {
... error recovery ...
break;
}
}
Py_INCREF(x);
PUSH(x);
continue;
We could even write it like this (idea courtesy of Ka-Ping Yee):
case LOAD_GLOBAL_CELL:
cell = func_cells[oparg];
x = cell->cellptr->objptr;
if (x != NULL) {
Py_INCREF(x);
PUSH(x);
continue;
}
... error recovery ...
break;
In modern CPU architectures, this reduces the number of
branches taken for built-ins, which might be a really good
thing, while any decent memory cache should realize that
cell->cellptr is the same as cell for regular globals and hence
this should be very fast in that case too.
For the aggressive variant:
case LOAD_GLOBAL_CELL:
cell = func_cells[oparg];
x = cell->objptr;
if (x != NULL) {
Py_INCREF(x);
PUSH(x);
continue;
}
... error recovery ...
break;
Q. What happens in the module's top-level code where there is
presumably no func_cells array?
A. We could do some code analysis and create a func_cells array,
or we could use LOAD_NAME which should use PyMapping_GetItem on
the globals dict.
Graphics
Ka-Ping Yee supplied a drawing of the state of things after
"import spam", where spam.py contains:
import eggs
i = -2
max = 3
def foo(n):
y = abs(i) + max
return eggs.ham(y + n)
The drawing is at http://web.lfw.org/repo/cells.gif; a larger
version is at http://lfw.org/repo/cells-big.gif; the source is at
http://lfw.org/repo/cells.ai.
Comparison
XXX Here, a comparison of the three approaches could be added.
Copyright
This document has been placed in the public domain.
pep-0281 Loop Counter Iteration with range and xrange
| PEP: | 281 |
|---|---|
| Title: | Loop Counter Iteration with range and xrange |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Magnus Lie Hetland <magnus at hetland.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 11-Feb-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP describes yet another way of exposing the loop counter in for-loops. It basically proposes that the functionality of the function indices() from PEP 212 [1] be included in the existing functions range() and xrange().
Pronouncement
In commenting on PEP 279's enumerate() function, this PEP's author offered, "I'm quite happy to have it make PEP 281 obsolete." Subsequently, PEP 279 was accepted into Python 2.3. On 17 June 2005, the BDFL concurred with it being obsolete and hereby rejected the PEP. For the record, he found some of the examples to somewhat jarring in appearance: >>> range(range(5), range(10), range(2)) [5, 7, 9]
Motivation
It is often desirable to loop over the indices of a sequence. PEP
212 describes several ways of doing this, including adding a
built-in function called indices, conceptually defined as
def indices(sequence):
return range(len(sequence))
On the assumption that adding functionality to an existing built-in
function may be less intrusive than adding a new built-in function,
this PEP proposes adding this functionality to the existing
functions range() and xrange().
Specification
It is proposed that all three arguments to the built-in functions
range() and xrange() are allowed to be objects with a length
(i.e. objects implementing the __len__ method). If an argument
cannot be interpreted as an integer (i.e. it has no __int__
method), its length will be used instead.
Examples:
>>> range(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(range(5), range(10))
[5, 6, 7, 8, 9]
>>> range(range(5), range(10), range(2))
[5, 7, 9]
>>> list(xrange(range(10)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(xrange(xrange(10)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Number the lines of a file:
lines = file.readlines()
for num in range(lines):
print num, lines[num]
Alternatives
A natural alternative to the above specification is allowing
xrange() to access its arguments in a lazy manner. Thus, instead
of using their length explicitly, xrange can return one index for
each element of the stop argument until the end is reached. A
similar lazy treatment makes little sense for the start and step
arguments since their length must be calculated before iteration
can begin. (Actually, the length of the step argument isn't needed
until the second element is returned.)
A pseudo-implementation (using only the stop argument, and assuming
that it is iterable) is:
def xrange(stop):
i = 0
for x in stop:
yield i
i += 1
Testing whether to use int() or lazy iteration could be done by
checking for an __iter__ attribute. (This example assumes the
presence of generators, but could easily have been implemented as a
plain iterator object.)
It may be questionable whether this feature is truly useful, since
one would not be able to access the elements of the iterable object
inside the for loop through indexing.
Example:
# Printing the numbers of the lines of a file:
for num in range(file):
print num # The line itself is not accessible
A more controversial alternative (to deal with this) would be to
let range() behave like the function irange() of PEP 212 when
supplied with a sequence.
Example:
>>> range(5)
[0, 1, 2, 3, 4]
>>> range('abcde')
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]
Backwards Compatibility
The proposal could cause backwards incompatibilities if arguments are used which implement both __int__ and __len__ (or __iter__ in the case of lazy iteration with xrange). The author does not believe that this is a significant problem.
References and Footnotes
[1] PEP 212, Loop Counter Iteration http://www.python.org/dev/peps/pep-0212/
Copyright
This document has been placed in the public domain.
pep-0282 A Logging System
| PEP: | 282 |
|---|---|
| Title: | A Logging System |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | vinay_sajip at red-dove.com (Vinay Sajip), Trent Mick <trentm at activestate.com> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 4-Feb-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP describes a proposed logging package for Python's
standard library.
Basically the system involves the user creating one or more logger
objects on which methods are called to log debugging notes,
general information, warnings, errors etc. Different logging
'levels' can be used to distinguish important messages from less
important ones.
A registry of named singleton logger objects is maintained so that
1) different logical logging streams (or 'channels') exist
(say, one for 'zope.zodb' stuff and another for
'mywebsite'-specific stuff)
2) one does not have to pass logger object references around.
The system is configurable at runtime. This configuration
mechanism allows one to tune the level and type of logging done
while not touching the application itself.
Motivation
If a single logging mechanism is enshrined in the standard
library, 1) logging is more likely to be done 'well', and 2)
multiple libraries will be able to be integrated into larger
applications which can be logged reasonably coherently.
Influences
This proposal was put together after having studied the
following logging packages:
o java.util.logging in JDK 1.4 (a.k.a. JSR047) [1]
o log4j [2]
o the Syslog package from the Protomatter project [3]
o MAL's mx.Log package [4]
Simple Example
This shows a very simple example of how the logging package can be
used to generate simple logging output on stderr.
--------- mymodule.py -------------------------------
import logging
log = logging.getLogger("MyModule")
def doIt():
log.debug("Doin' stuff...")
#do stuff...
raise TypeError, "Bogus type error for testing"
-----------------------------------------------------
--------- myapp.py ----------------------------------
import mymodule, logging
logging.basicConfig()
log = logging.getLogger("MyApp")
log.info("Starting my app")
try:
mymodule.doIt()
except Exception, e:
log.exception("There was a problem.")
log.info("Ending my app")
-----------------------------------------------------
% python myapp.py
INFO:MyApp: Starting my app
DEBUG:MyModule: Doin' stuff...
ERROR:MyApp: There was a problem.
Traceback (most recent call last):
File "myapp.py", line 9, in ?
mymodule.doIt()
File "mymodule.py", line 7, in doIt
raise TypeError, "Bogus type error for testing"
TypeError: Bogus type error for testing
INFO:MyApp: Ending my app
The above example shows the default output format. All
aspects of the output format should be configurable, so that
you could have output formatted like this:
2002-04-19 07:56:58,174 MyModule DEBUG - Doin' stuff...
or just
Doin' stuff...
Control Flow
Applications make logging calls on *Logger* objects. Loggers are
organized in a hierarchical namespace and child Loggers inherit
some logging properties from their parents in the namespace.
Logger names fit into a "dotted name" namespace, with dots
(periods) indicating sub-namespaces. The namespace of logger
objects therefore corresponds to a single tree data structure.
"" is the root of the namespace
"Zope" would be a child node of the root
"Zope.ZODB" would be a child node of "Zope"
These Logger objects create *LogRecord* objects which are passed
to *Handler* objects for output. Both Loggers and Handlers may
use logging *levels* and (optionally) *Filters* to decide if they
are interested in a particular LogRecord. When it is necessary to
output a LogRecord externally, a Handler can (optionally) use a
*Formatter* to localize and format the message before sending it
to an I/O stream.
Each Logger keeps track of a set of output Handlers. By default
all Loggers also send their output to all Handlers of their
ancestor Loggers. Loggers may, however, also be configured to
ignore Handlers higher up the tree.
The APIs are structured so that calls on the Logger APIs can be
cheap when logging is disabled. If logging is disabled for a
given log level, then the Logger can make a cheap comparison test
and return. If logging is enabled for a given log level, the
Logger is still careful to minimize costs before passing the
LogRecord into the Handlers. In particular, localization and
formatting (which are relatively expensive) are deferred until the
Handler requests them.
The overall Logger hierarchy can also have a level associated with
it, which takes precedence over the levels of individual Loggers.
This is done through a module-level function:
def disable(lvl):
"""
Do not generate any LogRecords for requests with a severity less
than 'lvl'.
"""
...
Levels
The logging levels, in increasing order of importance, are:
DEBUG
INFO
WARN
ERROR
CRITICAL
The term CRITICAL is used in preference to FATAL, which is used by
log4j. The levels are conceptually the same - that of a serious,
or very serious, error. However, FATAL implies death, which in
Python implies a raised and uncaught exception, traceback, and
exit. Since the logging module does not enforce such an outcome
from a FATAL-level log entry, it makes sense to use CRITICAL in
preference to FATAL.
These are just integer constants, to allow simple comparison of
importance. Experience has shown that too many levels can be
confusing, as they lead to subjective interpretation of which
level should be applied to any particular log request.
Although the above levels are strongly recommended, the logging
system should not be prescriptive. Users may define their own
levels, as well as the textual representation of any levels. User
defined levels must, however, obey the constraints that they are
all positive integers and that they increase in order of
increasing severity.
User-defined logging levels are supported through two module-level
functions:
def getLevelName(lvl):
"""Return the text for level 'lvl'."""
...
def addLevelName(lvl, lvlName):
"""
Add the level 'lvl' with associated text 'levelName', or
set the textual representation of existing level 'lvl' to be
'lvlName'."""
...
Loggers
Each Logger object keeps track of a log level (or threshold) that
it is interested in, and discards log requests below that level.
A *Manager* class instance maintains the hierarchical namespace of
named Logger objects. Generations are denoted with dot-separated
names: Logger "foo" is the parent of Loggers "foo.bar" and
"foo.baz".
The Manager class instance is a singleton and is not directly
exposed to users, who interact with it using various module-level
functions.
The general logging method is:
class Logger:
def log(self, lvl, msg, *args, **kwargs):
"""Log 'str(msg) % args' at logging level 'lvl'."""
...
However, convenience functions are defined for each logging level:
class Logger:
def debug(self, msg, *args, **kwargs): ...
def info(self, msg, *args, **kwargs): ...
def warn(self, msg, *args, **kwargs): ...
def error(self, msg, *args, **kwargs): ...
def critical(self, msg, *args, **kwargs): ...
Only one keyword argument is recognized at present - "exc_info".
If true, the caller wants exception information to be provided in
the logging output. This mechanism is only needed if exception
information needs to be provided at *any* logging level. In the
more common case, where exception information needs to be added to
the log only when errors occur, i.e. at the ERROR level, then
another convenience method is provided:
class Logger:
def exception(self, msg, *args): ...
This should only be called in the context of an exception handler,
and is the preferred way of indicating a desire for exception
information in the log. The other convenience methods are
intended to be called with exc_info only in the unusual situation
where you might want to provide exception information in the
context of an INFO message, for example.
The "msg" argument shown above will normally be a format string;
however, it can be any object x for which str(x) returns the
format string. This facilitates, for example, the use of an
object which fetches a locale- specific message for an
internationalized/localized application, perhaps using the
standard gettext module. An outline example:
class Message:
"""Represents a message"""
def __init__(self, id):
"""Initialize with the message ID"""
def __str__(self):
"""Return an appropriate localized message text"""
...
logger.info(Message("abc"), ...)
Gathering and formatting data for a log message may be expensive,
and a waste if the logger was going to discard the message anyway.
To see if a request will be honoured by the logger, the
isEnabledFor() method can be used:
class Logger:
def isEnabledFor(self, lvl):
"""
Return true if requests at level 'lvl' will NOT be
discarded.
"""
...
so instead of this expensive and possibly wasteful DOM to XML
conversion:
...
hamletStr = hamletDom.toxml()
log.info(hamletStr)
...
one can do this:
if log.isEnabledFor(logging.INFO):
hamletStr = hamletDom.toxml()
log.info(hamletStr)
When new loggers are created, they are initialized with a level
which signifies "no level". A level can be set explicitly using
the setLevel() method:
class Logger:
def setLevel(self, lvl): ...
If a logger's level is not set, the system consults all its
ancestors, walking up the hierarchy until an explicitly set level
is found. That is regarded as the "effective level" of the
logger, and can be queried via the getEffectiveLevel() method:
def getEffectiveLevel(self): ...
Loggers are never instantiated directly. Instead, a module-level
function is used:
def getLogger(name=None): ...
If no name is specified, the root logger is returned. Otherwise,
if a logger with that name exists, it is returned. If not, a new
logger is initialized and returned. Here, "name" is synonymous
with "channel name".
Users can specify a custom subclass of Logger to be used by the
system when instantiating new loggers:
def setLoggerClass(klass): ...
The passed class should be a subclass of Logger, and its __init__
method should call Logger.__init__.
Handlers
Handlers are responsible for doing something useful with a given
LogRecord. The following core Handlers will be implemented:
- StreamHandler: A handler for writing to a file-like object.
- FileHandler: A handler for writing to a single file or set
of rotating files.
- SocketHandler: A handler for writing to remote TCP ports.
- DatagramHandler: A handler for writing to UDP sockets, for
low-cost logging. Jeff Bauer already had such a system [5].
- MemoryHandler: A handler that buffers log records in memory
until the buffer is full or a particular condition occurs
[1].
- SMTPHandler: A handler for sending to email addresses via SMTP.
- SysLogHandler: A handler for writing to Unix syslog via UDP.
- NTEventLogHandler: A handler for writing to event logs on
Windows NT, 2000 and XP.
- HTTPHandler: A handler for writing to a Web server with
either GET or POST semantics.
Handlers can also have levels set for them using the
setLevel() method:
def setLevel(self, lvl): ...
The FileHandler can be set up to create a rotating set of log
files. In this case, the file name passed to the constructor is
taken as a "base" file name. Additional file names for the
rotation are created by appending .1, .2, etc. to the base file
name, up to a maximum as specified when rollover is requested.
The setRollover method is used to specify a maximum size for a log
file and a maximum number of backup files in the rotation.
def setRollover(maxBytes, backupCount): ...
If maxBytes is specified as zero, no rollover ever occurs and the
log file grows indefinitely. If a non-zero size is specified,
when that size is about to be exceeded, rollover occurs. The
rollover method ensures that the base file name is always the most
recent, .1 is the next most recent, .2 the next most recent after
that, and so on.
There are many additional handlers implemented in the test/example
scripts provided with [6] - for example, XMLHandler and
SOAPHandler.
LogRecords
A LogRecord acts as a receptacle for information about a
logging event. It is little more than a dictionary, though it
does define a getMessage method which merges a message with
optional runarguments.
Formatters
A Formatter is responsible for converting a LogRecord to a string
representation. A Handler may call its Formatter before writing a
record. The following core Formatters will be implemented:
- Formatter: Provide printf-like formatting, using the % operator.
- BufferingFormatter: Provide formatting for multiple
messages, with header and trailer formatting support.
Formatters are associated with Handlers by calling setFormatter()
on a handler:
def setFormatter(self, form): ...
Formatters use the % operator to format the logging message. The
format string should contain %(name)x and the attribute dictionary
of the LogRecord is used to obtain message-specific data. The
following attributes are provided:
%(name)s Name of the logger (logging channel)
%(levelno)s Numeric logging level for the message (DEBUG,
INFO, WARN, ERROR, CRITICAL)
%(levelname)s Text logging level for the message ("DEBUG", "INFO",
"WARN", "ERROR", "CRITICAL")
%(pathname)s Full pathname of the source file where the logging
call was issued (if available)
%(filename)s Filename portion of pathname
%(module)s Module from which logging call was made
%(lineno)d Source line number where the logging call was issued
(if available)
%(created)f Time when the LogRecord was created (time.time()
return value)
%(asctime)s Textual time when the LogRecord was created
%(msecs)d Millisecond portion of the creation time
%(relativeCreated)d Time in milliseconds when the LogRecord was created,
relative to the time the logging module was loaded
(typically at application startup time)
%(thread)d Thread ID (if available)
%(message)s The result of record.getMessage(), computed just as
the record is emitted
If a formatter sees that the format string includes "(asctime)s",
the creation time is formatted into the LogRecord's asctime
attribute. To allow flexibility in formatting dates, Formatters
are initialized with a format string for the message as a whole,
and a separate format string for date/time. The date/time format
string should be in time.strftime format. The default value for
the message format is "%(message)s". The default date/time format
is ISO8601.
The formatter uses a class attribute, "converter", to indicate how
to convert a time from seconds to a tuple. By default, the value
of "converter" is "time.localtime". If needed, a different
converter (e.g. "time.gmtime") can be set on an individual
formatter instance, or the class attribute changed to affect all
formatter instances.
Filters
When level-based filtering is insufficient, a Filter can be called
by a Logger or Handler to decide if a LogRecord should be output.
Loggers and Handlers can have multiple filters installed, and any
one of them can veto a LogRecord being output.
class Filter:
def filter(self, record):
"""
Return a value indicating true if the record is to be
processed. Possibly modify the record, if deemed
appropriate by the filter.
"""
The default behaviour allows a Filter to be initialized with a
Logger name. This will only allow through events which are
generated using the named logger or any of its children. For
example, a filter initialized with "A.B" will allow events logged
by loggers "A.B", "A.B.C", "A.B.C.D", "A.B.D" etc. but not "A.BB",
"B.A.B" etc. If initialized with the empty string, all events are
passed by the Filter. This filter behaviour is useful when it is
desired to focus attention on one particular area of an
application; the focus can be changed simply by changing a filter
attached to the root logger.
There are many examples of Filters provided in [6].
Configuration
The main benefit of a logging system like this is that one can
control how much and what logging output one gets from an
application without changing that application's source code.
Therefore, although configuration can be performed through the
logging API, it must also be possible to change the logging
configuration without changing an application at all. For
long-running programs like Zope, it should be possible to change
the logging configuration while the program is running.
Configuration includes the following:
- What logging level a logger or handler should be interested in.
- What handlers should be attached to which loggers.
- What filters should be attached to which handlers and loggers.
- Specifying attributes specific to certain handlers and filters.
In general each application will have its own requirements for how
a user may configure logging output. However, each application
will specify the required configuration to the logging system
through a standard mechanism.
The most simple configuration is that of a single handler, writing
to stderr, attached to the root logger. This configuration is set
up by calling the basicConfig() function once the logging module
has been imported.
def basicConfig(): ...
For more sophisticated configurations, this PEP makes no specific
proposals, for the following reasons:
- A specific proposal may be seen as prescriptive.
- Without the benefit of wide practical experience in the
Python community, there is no way to know whether any given
configuration approach is a good one. That practice can't
really come until the logging module is used, and that means
until *after* Python 2.3 has shipped.
- There is a likelihood that different types of applications
may require different configuration approaches, so that no
"one size fits all".
The reference implementation [6] has a working configuration file
format, implemented for the purpose of proving the concept and
suggesting one possible alternative. It may be that separate
extension modules, not part of the core Python distribution, are
created for logging configuration and log viewing, supplemental
handlers and other features which are not of interest to the bulk
of the community.
Thread Safety
The logging system should support thread-safe operation without
any special action needing to be taken by its users.
Module-Level Functions
To support use of the logging mechanism in short scripts and small
applications, module-level functions debug(), info(), warn(),
error(), critical() and exception() are provided. These work in
the same way as the correspondingly named methods of Logger - in
fact they delegate to the corresponding methods on the root
logger. A further convenience provided by these functions is that
if no configuration has been done, basicConfig() is automatically
called.
At application exit, all handlers can be flushed by calling the function
def shutdown(): ...
This will flush and close all handlers.
Implementation
The reference implementation is Vinay Sajip's logging module [6].
Packaging
The reference implementation is implemented as a single module.
This offers the simplest interface - all users have to do is
"import logging" and they are in a position to use all the
functionality available.
References
[1] java.util.logging
http://java.sun.com/j2se/1.4/docs/guide/util/logging/
[2] log4j: a Java logging package
http://jakarta.apache.org/log4j/docs/index.html
[3] Protomatter's Syslog
http://protomatter.sourceforge.net/1.1.6/index.html
http://protomatter.sourceforge.net/1.1.6/javadoc/com/protomatter/syslog/syslog-whitepaper.html
[4] MAL mentions his mx.Log logging module:
http://mail.python.org/pipermail/python-dev/2002-February/019767.html
[5] Jeff Bauer's Mr. Creosote
http://starship.python.net/crew/jbauer/creosote/
[6] Vinay Sajip's logging module.
http://www.red-dove.com/python_logging.html
Copyright
This document has been placed in the public domain.
pep-0283 Python 2.3 Release Schedule
| PEP: | 283 |
|---|---|
| Title: | Python 2.3 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum |
| Status: | Final |
| Type: | Informational |
| Created: | 27-Feb-2002 |
| Python-Version: | 2.3 |
| Post-History: | 27-Feb-2002 |
Abstract
This document describes the development and release schedule for
Python 2.3. The schedule primarily concerns itself with PEP-sized
items. Small features may be added up to and including the first
beta release. Bugs may be fixed until the final release.
There will be at least two alpha releases, two beta releases, and
one release candidate. Alpha and beta releases will be spaced at
least 4 weeks apart (except if an emergency release must be made
to correct a blunder in the previous release; then the blunder
release does not count). Release candidates will be spaced at
least one week apart (excepting again blunder corrections).
alpha 1 -- 31 Dec 2002
alpha 2 -- 19 Feb 2003
beta 1 -- 25 Apr 2003
beta 2 -- 29 Jun 2003
candidate 1 -- 18 Jul 2003
candidate 2 -- 24 Jul 2003
final -- 29 Jul 2003
Release Manager
Barry Warsaw, Jeremy Hylton, Tim Peters
Completed features for 2.3
This list is not complete. See Doc/whatsnew/whatsnew23.tex in CVS
for more, and of course Misc/NEWS for the full list.
- Tk 8.4 update.
- The bool type and its constants, True and False (PEP 285).
- PyMalloc was greatly enhanced and is enabled by default.
- Universal newline support (PEP 278).
- PEP 263 Defining Python Source Code Encodings Lemburg
Implemented (at least phase 1, which is all that's planned for
2.3).
- Extended slice notation for all built-in sequences. The patch
by Michael Hudson is now all checked in.
- Speed up list iterations by filling tp_iter and other tweaks.
See http://www.python.org/sf/560736; also done for xrange and
tuples.
- Timeout sockets. http://www.python.org/sf/555085
- Stage B0 of the int/long integration (PEP 237). This means
issuing a FutureWarning about situations where hex or oct
conversions or left shifts returns a different value for an int
than for a long with the same value. The semantics do *not*
change in Python 2.3; that will happen in Python 2.4.
- Nuke SET_LINENO from all code objects (providing a different way
to set debugger breakpoints). This can boost pystone by >5%.
http://www.python.org/sf/587993, now checked in. (Unfortunately
the pystone boost didn't happen. What happened?)
- Write a pymemcompat.h that people can bundle with their
extensions and then use the 2.3 memory interface with all
Pythons in the range 1.5.2 to 2.3. (Michael Hudson checked in
Misc/pymemcompat.h.)
- Add a new concept, "pending deprecation", with associated
warning PendingDeprecationWarning. This warning is normally
suppressed, but can be enabled by a suitable -W option. Only a
few things use this at this time.
- Warn when an extension type's tp_compare returns anything except
-1, 0 or 1. http://www.python.org/sf/472523
- Warn for assignment to None (in various forms).
- PEP 218 Adding a Built-In Set Object Type Wilson
Alex Martelli contributed a new version of Greg Wilson's
prototype, and I've reworked that quite a bit. It's in the
standard library now as the module "sets", although some details
may still change until the first beta release. (There are no
plans to make this a built-in type, for now.)
- PEP 293 Codec error handling callbacks Dรถrwald
Fully implemented. Error handling in unicode.encode or
str.decode can now be customized.
- PEP 282 A Logging System Mick
Vinay Sajip's implementation has been packagized and imported.
(Documentation and unit tests still pending.)
http://www.python.org/sf/578494
- A modified MRO (Method Resolution Order) algorithm. Consensus
is that we should adopt C3. Samuele Pedroni has contributed a
draft implementation in C, see http://www.python.org/sf/619475
This has now been checked in.
- A new command line option parser. Greg Ward's Optik package
(http://optik.sf.net) has been adopted, converted to a single
module named optparse. See also
http://www.python.org/sigs/getopt-sig/
- A standard datetime type. This started as a wiki:
http://www.zope.org/Members/fdrake/DateTimeWiki/FrontPage . A
prototype was coded in nondist/sandbox/datetime/. Tim Peters
has finished the C implementation and checked it in.
- PEP 273 Import Modules from Zip Archives Ahlstrom
Implemented as a part of the PEP 302 implementation work.
- PEP 302 New Import Hooks JvR
Implemented (though the 2.3a1 release contained some bugs that
have been fixed post-release).
- A new pickling protocol. See PEP 307.
- PEP 305 (CSV File API, by Skip Montanaro et al.) is in; this is
the csv module.
- Raymond Hettinger's itertools module is in.
- PEP 311 (Simplified GIL Acquisition for Extensions, by Mark
Hammond) has been included in beta 1.
- Two new PyArg_Parse*() format codes, 'k' returns an unsigned C
long int that receives the lower LONG_BIT bits of the Python
argument, truncating without range checking. 'K' returns an
unsigned C long long int that receives the lower LONG_LONG_BIT
bits, truncating without range checking. (SF 595026; Thomas
Heller did this work.)
- A new version of IDLE was imported from the IDLEfork project
(http://idlefork.sf.net). The code now lives in the idlelib
package in the standard library and the idle script is installed
by setup.py.
Planned features for 2.3
Too late for anything more to get done here.
Ongoing tasks
The following are ongoing TO-DO items which we should attempt to
work on without hoping for completion by any particular date.
- Documentation: complete the distribution and installation
manuals.
- Documentation: complete the documentation for new-style
classes.
- Look over the Demos/ directory and update where required (Andrew
Kuchling has done a lot of this)
- New tests.
- Fix doc bugs on SF.
- Remove use of deprecated features in the core.
- Document deprecated features appropriately.
- Mark deprecated C APIs with Py_DEPRECATED.
- Deprecate modules which are unmaintained, or perhaps make a new
category for modules 'Unmaintained'
- In general, lots of cleanup so it is easier to move forward.
Open issues
There are some issues that may need more work and/or thought
before the final release (and preferably before the first beta
release): No issues remaining.
Features that did not make it into Python 2.3
- The import lock could use some redesign. (SF 683658.)
- Set API issues; is the sets module perfect?
I expect it's good enough to stop polishing it until we've had
more widespread user experience.
- A nicer API to open text files, replacing the ugly (in some
people's eyes) "U" mode flag. There's a proposal out there to
have a new built-in type textfile(filename, mode, encoding).
(Shouldn't it have a bufsize argument too?)
Ditto.
- New widgets for Tkinter???
Has anyone gotten the time for this? *Are* there any new
widgets in Tk 8.4? Note that we've got better Tix support
already (though not on Windows yet).
- Fredrik Lundh's basetime proposal:
http://effbot.org/ideas/time-type.htm
I believe this is dead now.
- PEP 304 (Controlling Generation of Bytecode Files by Montanaro)
seems to have lost steam.
- For a class defined inside another class, the __name__ should be
"outer.inner", and pickling should work. (SF 633930. I'm no
longer certain this is easy or even right.)
- reST is going to be used a lot in Zope3. Maybe it could become
a standard library module? (Since reST's author thinks it's too
instable, I'm inclined not to do this.)
- Decide on a clearer deprecation policy (especially for modules)
and act on it. For a start, see this message from Neal Norwitz:
http://mail.python.org/pipermail/python-dev/2002-April/023165.html
There seems insufficient interest in moving this further in an
organized fashion, and it's not particularly important.
- Provide alternatives for common uses of the types module;
Skip Montanaro has posted a proto-PEP for this idea:
http://mail.python.org/pipermail/python-dev/2002-May/024346.html
There hasn't been any progress on this, AFAICT.
- Use pending deprecation for the types and string modules. This
requires providing alternatives for the parts that aren't
covered yet (e.g. string.whitespace and types.TracebackType).
It seems we can't get consensus on this.
- Deprecate the buffer object.
http://mail.python.org/pipermail/python-dev/2002-July/026388.html
http://mail.python.org/pipermail/python-dev/2002-July/026408.html
It seems that this is never going to be resolved.
- PEP 269 Pgen Module for Python Riehl
(Some necessary changes are in; the pgen module itself needs to
mature more.)
- Add support for the long-awaited Python catalog. Kapil
Thangavelu has a Zope-based implementation that he demoed at
OSCON 2002. Now all we need is a place to host it and a person
to champion it. (Some changes to distutils to support this are
in, at least.)
- PEP 266 Optimizing Global Variable/Attribute Access Montanaro
PEP 267 Optimized Access to Module Namespaces Hylton
PEP 280 Optimizing access to globals van Rossum
These are basically three friendly competing proposals. Jeremy
has made a little progress with a new compiler, but it's going
slow and the compiler is only the first step. Maybe we'll be
able to refactor the compiler in this release. I'm tempted to
say we won't hold our breath. In the mean time, Oren Tirosh has
a much simpler idea that may give a serious boost to the
performance of accessing globals and built-ins, by optimizing
and inlining the dict access:
http://tothink.com/python/fastnames/
- Lazily tracking tuples?
http://mail.python.org/pipermail/python-dev/2002-May/023926.html
http://www.python.org/sf/558745
Not much enthusiasm I believe.
- PEP 286 Enhanced Argument Tuples von Loewis
I haven't had the time to review this thoroughly. It seems a
deep optimization hack (also makes better correctness guarantees
though).
- Make 'as' a keyword. It has been a pseudo-keyword long enough.
Too much effort to bother.
Copyright
This document has been placed in the public domain.
pep-0284 Integer for-loops
| PEP: | 284 |
|---|---|
| Title: | Integer for-loops |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Eppstein <eppstein at ics.uci.edu>, Greg Ewing <greg.ewing at canterbury.ac.nz> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 1-Mar-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP proposes to simplify iteration over intervals of
integers, by extending the range of expressions allowed after a
"for" keyword to allow three-way comparisons such as
for lower <= var < upper:
in place of the current
for item in list:
syntax. The resulting loop or list iteration will loop over all
values of var that make the comparison true, starting from the
left endpoint of the given interval.
Pronouncement
This PEP is rejected. There were a number of fixable issues with
the proposal (see the fixups listed in Raymond Hettinger's
python-dev post on 18 June 2005). However, even with the fixups the
proposal did not garner support. Specifically, Guido did not buy
the premise that the range() format needed fixing, "The whole point
(15 years ago) of range() was to *avoid* needing syntax to specify a
loop over numbers. I think it's worked out well and there's nothing
that needs to be fixed (except range() needs to become an iterator,
which it will in Python 3.0)."
Rationale
One of the most common uses of for-loops in Python is to iterate
over an interval of integers. Python provides functions range()
and xrange() to generate lists and iterators for such intervals,
which work best for the most frequent case: half-open intervals
increasing from zero. However, the range() syntax is more awkward
for open or closed intervals, and lacks symmetry when reversing
the order of iteration. In addition, the call to an unfamiliar
function makes it difficult for newcomers to Python to understand
code that uses range() or xrange().
The perceived lack of a natural, intuitive integer iteration
syntax has led to heated debate on python-list, and spawned at
least four PEPs before this one. PEP 204 [1] (rejected) proposed
to re-use Python's slice syntax for integer ranges, leading to a
terser syntax but not solving the readability problem of
multi-argument range(). PEP 212 [2] (deferred) proposed several
syntaxes for directly converting a list to a sequence of integer
indices, in place of the current idiom
range(len(list))
for such conversion, and PEP 281 [3] proposes to simplify the same
idiom by allowing it to be written as
range(list).
PEP 276 [4] proposes to allow automatic conversion of integers to
iterators, simplifying the most common half-open case but not
addressing the complexities of other types of interval.
Additional alternatives have been discussed on python-list.
The solution described here is to allow a three-way comparison
after a "for" keyword, both in the context of a for-loop and of a
list comprehension:
for lower <= var < upper:
This would cause iteration over an interval of consecutive
integers, beginning at the left bound in the comparison and ending
at the right bound. The exact comparison operations used would
determine whether the interval is open or closed at either end and
whether the integers are considered in ascending or descending
order.
This syntax closely matches standard mathematical notation, so is
likely to be more familiar to Python novices than the current
range() syntax. Open and closed interval endpoints are equally
easy to express, and the reversal of an integer interval can be
formed simply by swapping the two endpoints and reversing the
comparisons. In addition, the semantics of such a loop would
closely resemble one way of interpreting the existing Python
for-loops:
for item in list
iterates over exactly those values of item that cause the
expression
item in list
to be true. Similarly, the new format
for lower <= var < upper:
would iterate over exactly those integer values of var that cause
the expression
lower <= var < upper
to be true.
Specification
We propose to extend the syntax of a for statement, currently
for_stmt: "for" target_list "in" expression_list ":" suite
["else" ":" suite]
as described below:
for_stmt: "for" for_test ":" suite ["else" ":" suite]
for_test: target_list "in" expression_list |
or_expr less_comp or_expr less_comp or_expr |
or_expr greater_comp or_expr greater_comp or_expr
less_comp: "<" | "<="
greater_comp: ">" | ">="
Similarly, we propose to extend the syntax of list comprehensions,
currently
list_for: "for" expression_list "in" testlist [list_iter]
by replacing it with:
list_for: "for" for_test [list_iter]
In all cases the expression formed by for_test would be subject to
the same precedence rules as comparisons in expressions. The two
comp_operators in a for_test must be required to be both of
similar types, unlike chained comparisons in expressions which do
not have such a restriction.
We refer to the two or_expr's occurring on the left and right
sides of the for-loop syntax as the bounds of the loop, and the
middle or_expr as the variable of the loop. When a for-loop using
the new syntax is executed, the expressions for both bounds will
be evaluated, and an iterator object created that iterates through
all integers between the two bounds according to the comparison
operations used. The iterator will begin with an integer equal or
near to the left bound, and then step through the remaining
integers with a step size of +1 or -1 if the comparison operation
is in the set described by less_comp or greater_comp respectively.
The execution will then proceed as if the expression had been
for variable in iterator
where "variable" refers to the variable of the loop and "iterator"
refers to the iterator created for the given integer interval.
The values taken by the loop variable in an integer for-loop may
be either plain integers or long integers, according to the
magnitude of the bounds. Both bounds of an integer for-loop must
evaluate to a real numeric type (integer, long, or float). Any
other value will cause the for-loop statement to raise a TypeError
exception.
Issues
The following issues were raised in discussion of this and related
proposals on the Python list.
- Should the right bound be evaluated once, or every time through
the loop? Clearly, it only makes sense to evaluate the left
bound once. For reasons of consistency and efficiency, we have
chosen the same convention for the right bound.
- Although the new syntax considerably simplifies integer
for-loops, list comprehensions using the new syntax are not as
simple. We feel that this is appropriate since for-loops are
more frequent than comprehensions.
- The proposal does not allow access to integer iterator objects
such as would be created by xrange. True, but we see this as a
shortcoming in the general list-comprehension syntax, beyond the
scope of this proposal. In addition, xrange() will still be
available.
- The proposal does not allow increments other than 1 and -1.
More general arithmetic progressions would need to be created by
range() or xrange(), or by a list comprehension syntax such as
[2*x for 0 <= x <= 100]
- The position of the loop variable in the middle of a three-way
comparison is not as apparent as the variable in the present
for item in list
syntax, leading to a possible loss of readability. We feel that
this loss is outweighed by the increase in readability from a
natural integer iteration syntax.
- To some extent, this PEP addresses the same issues as PEP 276
[4]. We feel that the two PEPs are not in conflict since PEP
276 is primarily concerned with half-open ranges starting in 0
(the easy case of range()) while this PEP is primarily concerned
with simplifying all other cases. However, if this PEP is
approved, its new simpler syntax for integer loops could to some
extent reduce the motivation for PEP 276.
- It is not clear whether it makes sense to allow floating point
bounds for an integer loop: if a float represents an inexact
value, how can it be used to determine an exact sequence of
integers? On the other hand, disallowing float bounds would
make it difficult to use floor() and ceiling() in integer
for-loops, as it is difficult to use them now with range(). We
have erred on the side of flexibility, but this may lead to some
implementation difficulties in determining the smallest and
largest integer values that would cause a given comparison to be
true.
- Should types other than int, long, and float be allowed as
bounds? Another choice would be to convert all bounds to
integers by int(), and allow as bounds anything that can be so
converted instead of just floats. However, this would change
the semantics: 0.3 <= x is not the same as int(0.3) <= x, and it
would be confusing for a loop with 0.3 as lower bound to start
at zero. Also, in general int(f) can be very far from f.
Implementation
An implementation is not available at this time. Implementation
is not expected to pose any great difficulties: the new syntax
could, if necessary, be recognized by parsing a general expression
after each "for" keyword and testing whether the top level
operation of the expression is "in" or a three-way comparison.
The Python compiler would convert any instance of the new syntax
into a loop over the items in a special iterator object.
References
[1] PEP 204, Range Literals
http://www.python.org/dev/peps/pep-0204/
[2] PEP 212, Loop Counter Iteration
http://www.python.org/dev/peps/pep-0212/
[3] PEP 281, Loop Counter Iteration with range and xrange
http://www.python.org/dev/peps/pep-0281/
[4] PEP 276, Simple Iterator for ints
http://www.python.org/dev/peps/pep-0276/
Copyright
This document has been placed in the public domain.
pep-0285 Adding a bool type
| PEP: | 285 |
|---|---|
| Title: | Adding a bool type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 8-Mar-2002 |
| Python-Version: | 2.3 |
| Post-History: | 8-Mar-2002, 30-Mar-2002, 3-Apr-2002 |
Abstract
This PEP proposes the introduction of a new built-in type, bool,
with two constants, False and True. The bool type would be a
straightforward subtype (in C) of the int type, and the values
False and True would behave like 0 and 1 in most respects (for
example, False==0 and True==1 would be true) except repr() and
str(). All built-in operations that conceptually return a Boolean
result will be changed to return False or True instead of 0 or 1;
for example, comparisons, the "not" operator, and predicates like
isinstance().
Review
I've collected enough feedback to last me a lifetime, so I declare
the review period officially OVER. I had Chinese food today; my
fortune cookie said "Strong and bitter words indicate a weak
cause." It reminded me of some of the posts against this
PEP... :-)
Anyway, here are my BDFL pronouncements. (Executive summary: I'm
not changing a thing; all variants are rejected.)
1) Should this PEP be accepted?
=> Yes.
There have been many arguments against the PEP. Many of them
were based on misunderstandings. I've tried to clarify some of
the most common misunderstandings below in the main text of the
PEP. The only issue that weighs at all for me is the tendency
of newbies to write "if x == True" where "if x" would suffice.
More about that below too. I think this is not a sufficient
reason to reject the PEP.
2) Should str(True) return "True" or "1"? "1" might reduce
backwards compatibility problems, but looks strange.
(repr(True) would always return "True".)
=> "True".
Almost all reviewers agree with this.
3) Should the constants be called 'True' and 'False' (similar to
None) or 'true' and 'false' (as in C++, Java and C99)?
=> True and False.
Most reviewers agree that consistency within Python is more
important than consistency with other languages.
4) Should we strive to eliminate non-Boolean operations on bools
in the future, through suitable warnings, so that for example
True+1 would eventually (in Python 3000) be illegal?
=> No.
There's a small but vocal minority that would prefer to see
"textbook" bools that don't support arithmetic operations at
all, but most reviewers agree with me that bools should always
allow arithmetic operations.
5) Should operator.truth(x) return an int or a bool?
=> bool.
Tim Peters believes it should return an int, but almost all
other reviewers agree that it should return a bool. My
rationale: operator.truth() exists to force a Boolean context
on its argument (it calls the C API PyObject_IsTrue()).
Whether the outcome is reported as int or bool is secondary; if
bool exists there's no reason not to use it. (Under the PEP,
operator.truth() now becomes an alias for bool(); that's fine.)
6) Should bool inherit from int?
=> Yes.
In an ideal world, bool might be better implemented as a
separate integer type that knows how to perform mixed-mode
arithmetic. However, inheriting bool from int eases the
implementation enormously (in part since all C code that calls
PyInt_Check() will continue to work -- this returns true for
subclasses of int). Also, I believe this is right in terms of
substitutability: code that requires an int can be fed a bool
and it will behave the same as 0 or 1. Code that requires a
bool may not work when it is given an int; for example, 3 & 4
is 0, but both 3 and 4 are true when considered as truth
values.
7) Should the name 'bool' be changed?
=> No.
Some reviewers have argued for boolean instead of bool, because
this would be easier to understand (novices may have heard of
Boolean algebra but may not make the connection with bool) or
because they hate abbreviations. My take: Python uses
abbreviations judiciously (like 'def', 'int', 'dict') and I
don't think these are a burden to understanding. To a newbie,
it doesn't matter whether it's called a waffle or a bool; it's
a new word, and they learn quickly what it means.
One reviewer has argued to make the name 'truth'. I find this
an unattractive name, and would actually prefer to reserve this
term (in documentation) for the more abstract concept of truth
values that already exists in Python. For example: "when a
container is interpreted as a truth value, an empty container
is considered false and a non-empty one is considered true."
8) Should we strive to require that Boolean operations (like "if",
"and", "not") have a bool as an argument in the future, so that
for example "if []:" would become illegal and would have to be
writen as "if bool([]):" ???
=> No!!!
Some people believe that this is how a language with a textbook
Boolean type should behave. Because it was brought up, others
have worried that I might agree with this position. Let me
make my position on this quite clear. This is not part of the
PEP's motivation and I don't intend to make this change. (See
also the section "Clarification" below.)
Rationale
Most languages eventually grow a Boolean type; even C99 (the new
and improved C standard, not yet widely adopted) has one.
Many programmers apparently feel the need for a Boolean type; most
Python documentation contains a bit of an apology for the absence
of a Boolean type. I've seen lots of modules that defined
constants "False=0" and "True=1" (or similar) at the top and used
those. The problem with this is that everybody does it
differently. For example, should you use "FALSE", "false",
"False", "F" or even "f"? And should false be the value zero or
None, or perhaps a truth value of a different type that will print
as "true" or "false"? Adding a standard bool type to the language
resolves those issues.
Some external libraries (like databases and RPC packages) need to
be able to distinguish between Boolean and integral values, and
while it's usually possible to craft a solution, it would be
easier if the language offered a standard Boolean type. This also
applies to Jython: some Java classes have separately overloaded
methods or constructors for int and boolean arguments. The bool
type can be used to select the boolean variant. (The same is
apparently the case for some COM interfaces.)
The standard bool type can also serve as a way to force a value to
be interpreted as a Boolean, which can be used to normalize
Boolean values. When a Boolean value needs to be normalized to
one of two values, bool(x) is much clearer than "not not x" and
much more concise than
if x:
return 1
else:
return 0
Here are some arguments derived from teaching Python. When
showing people comparison operators etc. in the interactive shell,
I think this is a bit ugly:
>>> a = 13
>>> b = 12
>>> a > b
1
>>>
If this was:
>>> a > b
True
>>>
it would require a millisecond less thinking each time a 0 or 1
was printed.
There's also the issue (which I've seen baffling even experienced
Pythonistas who had been away from the language for a while) that
if you see:
>>> cmp(a, b)
1
>>> cmp(a, a)
0
>>>
you might be tempted to believe that cmp() also returned a truth
value, whereas in reality it can return three different values
(-1, 0, 1). If ints were not (normally) used to represent
Booleans results, this would stand out much more clearly as
something completely different.
Specification
The following Python code specifies most of the properties of the
new type:
class bool(int):
def __new__(cls, val=0):
# This constructor always returns an existing instance
if val:
return True
else:
return False
def __repr__(self):
if self:
return "True"
else:
return "False"
__str__ = __repr__
def __and__(self, other):
if isinstance(other, bool):
return bool(int(self) & int(other))
else:
return int.__and__(self, other)
__rand__ = __and__
def __or__(self, other):
if isinstance(other, bool):
return bool(int(self) | int(other))
else:
return int.__or__(self, other)
__ror__ = __or__
def __xor__(self, other):
if isinstance(other, bool):
return bool(int(self) ^ int(other))
else:
return int.__xor__(self, other)
__rxor__ = __xor__
# Bootstrap truth values through sheer willpower
False = int.__new__(bool, 0)
True = int.__new__(bool, 1)
The values False and True will be singletons, like None. Because
the type has two values, perhaps these should be called
"doubletons"? The real implementation will not allow other
instances of bool to be created.
True and False will properly round-trip through pickling and
marshalling; for example pickle.loads(pickle.dumps(True)) will
return True, and so will marshal.loads(marshal.dumps(True)).
All built-in operations that are defined to return a Boolean
result will be changed to return False or True instead of 0 or 1.
In particular, this affects comparisons (<, <=, ==, !=, >, >=, is,
is not, in, not in), the unary operator 'not', the built-in
functions callable(), hasattr(), isinstance() and issubclass(),
the dict method has_key(), the string and unicode methods
endswith(), isalnum(), isalpha(), isdigit(), islower(), isspace(),
istitle(), isupper(), and startswith(), the unicode methods
isdecimal() and isnumeric(), and the 'closed' attribute of file
objects. The predicates in the operator module are also changed
to return a bool, including operator.truth().
Because bool inherits from int, True+1 is valid and equals 2, and
so on. This is important for backwards compatibility: because
comparisons and so on currently return integer values, there's no
way of telling what uses existing applications make of these
values.
It is expected that over time, the standard library will be
updated to use False and True when appropriate (but not to require
a bool argument type where previous an int was allowed). This
change should not pose additional problems and is not specified in
detail by this PEP.
C API
The header file "boolobject.h" defines the C API for the bool
type. It is included by "Python.h" so there is no need to include
it directly.
The existing names Py_False and Py_True reference the unique bool
objects False and True (previously these referenced static int
objects with values 0 and 1, which were not unique amongst int
values).
A new API, PyObject *PyBool_FromLong(long), takes a C long int
argument and returns a new reference to either Py_False (when the
argument is zero) or Py_True (when it is nonzero).
To check whether an object is a bool, the macro PyBool_Check() can
be used.
The type of bool instances is PyBoolObject *.
The bool type object is available as PyBool_Type.
Clarification
This PEP does *not* change the fact that almost all object types
can be used as truth values. For example, when used in an if
statement, an empty list is false and a non-empty one is true;
this does not change and there is no plan to ever change this.
The only thing that changes is the preferred values to represent
truth values when returned or assigned explicitly. Previously,
these preferred truth values were 0 and 1; the PEP changes the
preferred values to False and True, and changes built-in
operations to return these preferred values.
Compatibility
Because of backwards compatibility, the bool type lacks many
properties that some would like to see. For example, arithmetic
operations with one or two bool arguments is allowed, treating
False as 0 and True as 1. Also, a bool may be used as a sequence
index.
I don't see this as a problem, and I don't want evolve the
language in this direction either. I don't believe that a
stricter interpretation of "Booleanness" makes the language any
clearer.
Another consequence of the compatibility requirement is that the
expression "True and 6" has the value 6, and similarly the
expression "False or None" has the value None. The "and" and "or"
operators are usefully defined to return the first argument that
determines the outcome, and this won't change; in particular, they
don't force the outcome to be a bool. Of course, if both
arguments are bools, the outcome is always a bool. It can also
easily be coerced into being a bool by writing for example "bool(x
and y)".
Resolved Issues
(See also the Review section above.)
- Because the repr() or str() of a bool value is different from an
int value, some code (for example doctest-based unit tests, and
possibly database code that relies on things like "%s" % truth)
may fail. It is easy to work around this (without explicitly
referencing the bool type), and it is expected that this only
affects a very small amount of code that can easily be fixed.
- Other languages (C99, C++, Java) name the constants "false" and
"true", in all lowercase. For Python, I prefer to stick with
the example set by the existing built-in constants, which all
use CapitalizedWords: None, Ellipsis, NotImplemented (as well as
all built-in exceptions). Python's built-in namespace uses all
lowercase for functions and types only.
- It has been suggested that, in order to satisfy user
expectations, for every x that is considered true in a Boolean
context, the expression x == True should be true, and likewise
if x is considered false, x == False should be true. In
particular newbies who have only just learned about Boolean
variables are likely to write
if x == True: ...
instead of the correct form,
if x: ...
There seem to be strong psychological and linguistic reasons why
many people are at first uncomfortable with the latter form, but
I believe that the solution should be in education rather than
in crippling the language. After all, == is general seen as a
transitive operator, meaning that from a==b and b==c we can
deduce a==c. But if any comparison to True were to report
equality when the other operand was a true value of any type,
atrocities like 6==True==7 would hold true, from which one could
infer the falsehood 6==7. That's unacceptable. (In addition,
it would break backwards compatibility. But even if it didn't,
I'd still be against this, for the stated reasons.)
Newbies should also be reminded that there's never a reason to
write
if bool(x): ...
since the bool is implicit in the "if". Explicit is *not*
better than implicit here, since the added verbiage impairs
redability and there's no other interpretation possible. There
is, however, sometimes a reason to write
b = bool(x)
This is useful when it is unattractive to keep a reference to an
arbitrary object x, or when normalization is required for some
other reason. It is also sometimes appropriate to write
i = int(bool(x))
which converts the bool to an int with the value 0 or 1. This
conveys the intention to henceforth use the value as an int.
Implementation
A complete implementation in C has been uploaded to the
SourceForge patch manager:
http://python.org/sf/528022
This will soon be checked into CVS for python 2.3a0.
Copyright
This document has been placed in the public domain.
pep-0286 Enhanced Argument Tuples
| PEP: | 286 |
|---|---|
| Title: | Enhanced Argument Tuples |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Deferred |
| Type: | Standards Track |
| Created: | 3-Mar-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
PyArg_ParseTuple is confronted with difficult memory management if
an argument converter creates new memory. To deal with these
cases, a specialized argument type is proposed.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred
for lack of a current champion interested in promoting the goals of the
PEP and collecting and incorporating feedback, and with sufficient
available time to do so effectively.
The resolution of this PEP may also be affected by the resolution of
PEP 426, which proposes the use of a preprocessing step to generate
some aspects of C API interface code.
Problem description
Today, argument tuples keep references to the function arguments,
which are guaranteed to live as long as the argument tuple exists
which is at least as long as the function call is being executed.
In some cases, parsing an argument will allocate new memory, which
is then to be released by the caller. This has two problems:
1. In case of failure, the application cannot know what memory to
release; most callers don't even know that they have the
responsibility to release that memory. Example for this are
the N converter (bug #416288) and the es# converter (bug
#501716).
2. Even for successful argument parsing, it is still inconvenient
for the caller to be responsible for releasing the memory. In
some cases, this is unnecessarily inefficient. For example,
the es converter copies the conversion result into memory, even
though there already is a string object that has the right
contents.
Proposed solution
A new type 'argument tuple' is introduced. This type derives from
tuple, adding an __dict__ member (at tp_dictoffset -4). Instances
of this type might get the following attributes:
- 'failobjects', a list of objects which need to be deallocated
in case of success
- 'okobjects', a list of object which will be released when the
argument tuple is released
To manage this type, the following functions will be added, and
used appropriately in ceval.c and getargs.c:
- PyArgTuple_New(int);
- PyArgTuple_AddFailObject(PyObject*, PyObject*);
- PyArgTuple_AddFailMemory(PyObject*, void*);
- PyArgTuple_AddOkObject(PyObject*, PyObject*);
- PyArgTuple_AddOkMemory(PyObject*, void*);
- PyArgTuple_ClearFailed(PyObject*);
When argument parsing fails, all fail objects will be released
through Py_DECREF, and all fail memory will be released through
PyMem_Free. If parsing succeeds, the references to the fail
objects and fail memory are dropped, without releasing anything.
When the argument tuple is released, all ok objects and memory
will be released.
If those functions are called with an object of a different type,
a warning is issued and no further action is taken; usage of the
affected converters without using argument tuples is deprecated.
Affected converters
The following converters will add fail memory and fail objects: N,
es, et, es#, et# (unless memory is passed into the converter)
New converters
To simplify Unicode conversion, the e* converters are duplicated
as E* converters (Es, Et, Es#, Et#). The usage of the E*
converters is identical to that of the e* converters, except that
the application will not need to manage the resulting memory.
This will be implemented through registration of Ok objects with
the argument tuple. The e* converters are deprecated.
Copyright
This document has been placed in the public domain.
pep-0287 reStructuredText Docstring Format
| PEP: | 287 |
|---|---|
| Title: | reStructuredText Docstring Format |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | David Goodger <goodger at python.org> |
| Discussions-To: | <doc-sig at python.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 25-Mar-2002 |
| Post-History: | 02-Apr-2002 |
| Replaces: | 216 |
Contents
Abstract
When plaintext hasn't been expressive enough for inline documentation, Python programmers have sought out a format for docstrings. This PEP proposes that the reStructuredText markup [5] be adopted as a standard markup format for structured plaintext documentation in Python docstrings, and for PEPs and ancillary documents as well. reStructuredText is a rich and extensible yet easy-to-read, what-you-see-is-what-you-get plaintext markup syntax.
Only the low-level syntax of docstrings is addressed here. This PEP is not concerned with docstring semantics or processing at all (see PEP 256 for a "Road Map to the Docstring PEPs"). Nor is it an attempt to deprecate pure plaintext docstrings, which are always going to be legitimate. The reStructuredText markup is an alternative for those who want more expressive docstrings.
Benefits
Programmers are by nature a lazy breed. We reuse code with functions, classes, modules, and subsystems. Through its docstring syntax, Python allows us to document our code from within. The "holy grail" of the Python Documentation Special Interest Group (Doc-SIG [6]) has been a markup syntax and toolset to allow auto-documentation, where the docstrings of Python systems can be extracted in context and processed into useful, high-quality documentation for multiple purposes.
Document markup languages have three groups of customers: the authors who write the documents, the software systems that process the data, and the readers, who are the final consumers and the most important group. Most markups are designed for the authors and software systems; readers are only meant to see the processed form, either on paper or via browser software. ReStructuredText is different: it is intended to be easily readable in source form, without prior knowledge of the markup. ReStructuredText is entirely readable in plaintext format, and many of the markup forms match common usage (e.g., *emphasis*), so it reads quite naturally. Yet it is rich enough to produce complex documents, and extensible so that there are few limits. Of course, to write reStructuredText documents some prior knowledge is required.
The markup offers functionality and expressivity, while maintaining easy readability in the source text. The processed form (HTML etc.) makes it all accessible to readers: inline live hyperlinks; live links to and from footnotes; automatic tables of contents (with live links!); tables; images for diagrams etc.; pleasant, readable styled text.
The reStructuredText parser is available now, part of the Docutils [24] project. Standalone reStructuredText documents and PEPs can be converted to HTML; other output format writers are being worked on and will become available over time. Work is progressing on a Python source "Reader" which will implement auto-documentation from docstrings. Authors of existing auto-documentation tools are encouraged to integrate the reStructuredText parser into their projects, or better yet, to join forces to produce a world-class toolset for the Python standard library.
Tools will become available in the near future, which will allow programmers to generate HTML for online help, XML for multiple purposes, and eventually PDF, DocBook, and LaTeX for printed documentation, essentially "for free" from the existing docstrings. The adoption of a standard will, at the very least, benefit docstring processing tools by preventing further "reinventing the wheel".
Eventually PyDoc, the one existing standard auto-documentation tool, could have reStructuredText support added. In the interim it will have no problem with reStructuredText markup, since it treats all docstrings as preformatted plaintext.
Goals
These are the generally accepted goals for a docstring format, as discussed in the Doc-SIG:
- It must be readable in source form by the casual observer.
- It must be easy to type with any standard text editor.
- It must not need to contain information which can be deduced from parsing the module.
- It must contain sufficient information (structure) so it can be converted to any reasonable markup format.
- It must be possible to write a module's entire documentation in docstrings, without feeling hampered by the markup language.
reStructuredText meets and exceeds all of these goals, and sets its own goals as well, even more stringent. See Docstring-Significant Features below.
The goals of this PEP are as follows:
To establish reStructuredText as a standard structured plaintext format for docstrings (inline documentation of Python modules and packages), PEPs, README-type files and other standalone documents. "Accepted" status will be sought through Python community consensus and eventual BDFL pronouncement.
Please note that reStructuredText is being proposed as a standard, not the only standard. Its use will be entirely optional. Those who don't want to use it need not.
To solicit and address any related concerns raised by the Python community.
To encourage community support. As long as multiple competing markups are out there, the development community remains fractured. Once a standard exists, people will start to use it, and momentum will inevitably gather.
To consolidate efforts from related auto-documentation projects. It is hoped that interested developers will join forces and work on a joint/merged/common implementation.
Once reStructuredText is a Python standard, effort can be focused on tools instead of arguing for a standard. Python needs a standard set of documentation tools.
With regard to PEPs, one or both of the following strategies may be applied:
- Keep the existing PEP section structure constructs (one-line section headers, indented body text). Subsections can either be forbidden, or supported with reStructuredText-style underlined headers in the indented body text.
- Replace the PEP section structure constructs with the reStructuredText syntax. Section headers will require underlines, subsections will be supported out of the box, and body text need not be indented (except for block quotes).
Strategy (b) is recommended, and its implementation is complete.
Support for RFC 2822 headers has been added to the reStructuredText parser for PEPs (unambiguous given a specific context: the first contiguous block of the document). It may be desired to concretely specify what over/underline styles are allowed for PEP section headers, for uniformity.
Rationale
The lack of a standard syntax for docstrings has hampered the development of standard tools for extracting and converting docstrings into documentation in standard formats (e.g., HTML, DocBook, TeX). There have been a number of proposed markup formats and variations, and many tools tied to these proposals, but without a standard docstring format they have failed to gain a strong following and/or floundered half-finished.
Throughout the existence of the Doc-SIG, consensus on a single standard docstring format has never been reached. A lightweight, implicit markup has been sought, for the following reasons (among others):
- Docstrings written within Python code are available from within the interactive interpreter, and can be "print"ed. Thus the use of plaintext for easy readability.
- Programmers want to add structure to their docstrings, without sacrificing raw docstring readability. Unadorned plaintext cannot be transformed ("up-translated") into useful structured formats.
- Explicit markup (like XML or TeX) is widely considered unreadable by the uninitiated.
- Implicit markup is aesthetically compatible with the clean and minimalist Python syntax.
Many alternative markups for docstrings have been proposed on the Doc-SIG over the years; a representative sample is listed below. Each is briefly analyzed in terms of the goals stated above. Please note that this is not intended to be an exclusive list of all existing markup systems; there are many other markups (Texinfo, Doxygen, TIM, YODL, AFT, ...) which are not mentioned.
XML [7], SGML [8], DocBook [9], HTML [10], XHTML [11]
XML and SGML are explicit, well-formed meta-languages suitable for all kinds of documentation. XML is a variant of SGML. They are best used behind the scenes, because to untrained eyes they are verbose, difficult to type, and too cluttered to read comfortably as source. DocBook, HTML, and XHTML are all applications of SGML and/or XML, and all share the same basic syntax and the same shortcomings.
-
TeX is similar to XML/SGML in that it's explicit, but not very easy to write, and not easy for the uninitiated to read.
-
Most Perl modules are documented in a format called POD (Plain Old Documentation). This is an easy-to-type, very low level format with strong integration with the Perl parser. Many tools exist to turn POD documentation into other formats: info, HTML and man pages, among others. However, the POD syntax takes after Perl itself in terms of readability.
-
Special comments before Java classes and functions serve to document the code. A program to extract these, and turn them into HTML documentation is called javadoc, and is part of the standard Java distribution. However, JavaDoc has a very intimate relationship with HTML, using HTML tags for most markup. Thus it shares the readability problems of HTML.
Setext [15], StructuredText [16]
Early on, variants of Setext (Structure Enhanced Text), including Zope Corp's StructuredText, were proposed for Python docstring formatting. Hereafter these variants will collectively be called "STexts". STexts have the advantage of being easy to read without special knowledge, and relatively easy to write.
Although used by some (including in most existing Python auto-documentation tools), until now STexts have failed to become standard because:
- STexts have been incomplete. Lacking "essential" constructs that people want to use in their docstrings, STexts are rendered less than ideal. Note that these "essential" constructs are not universal; everyone has their own requirements.
- STexts have been sometimes surprising. Bits of text are unexpectedly interpreted as being marked up, leading to user frustration.
- SText implementations have been buggy.
- Most STexts have have had no formal specification except for the implementation itself. A buggy implementation meant a buggy spec, and vice-versa.
- There has been no mechanism to get around the SText markup rules when a markup character is used in a non-markup context. In other words, no way to escape markup.
Proponents of implicit STexts have vigorously opposed proposals for explicit markup (XML, HTML, TeX, POD, etc.), and the debates have continued off and on since 1996 or earlier.
reStructuredText is a complete revision and reinterpretation of the SText idea, addressing all of the problems listed above.
Specification
The specification and user documentaton for reStructuredText is quite extensive. Rather than repeating or summarizing it all here, links to the originals are provided.
Please first take a look at A ReStructuredText Primer [17], a short and gentle introduction. The Quick reStructuredText [18] user reference quickly summarizes all of the markup constructs. For complete and extensive details, please refer to the following documents:
- An Introduction to reStructuredText [19]
- reStructuredText Markup Specification [20]
- reStructuredText Directives [21]
In addition, Problems With StructuredText [22] explains many markup decisions made with regards to StructuredText, and A Record of reStructuredText Syntax Alternatives [23] records markup decisions made independently.
Docstring-Significant Features
A markup escaping mechanism.
Backslashes (\) are used to escape markup characters when needed for non-markup purposes. However, the inline markup recognition rules have been constructed in order to minimize the need for backslash-escapes. For example, although asterisks are used for emphasis, in non-markup contexts such as "*" or "(*)" or "x * y", the asterisks are not interpreted as markup and are left unchanged. For many non-markup uses of backslashes (e.g., describing regular expressions), inline literals or literal blocks are applicable; see the next item.
Markup to include Python source code and Python interactive sessions: inline literals, literal blocks, and doctest blocks.
Inline literals use double-backquotes to indicate program I/O or code snippets. No markup interpretation (including backslash-escape [\] interpretation) is done within inline literals.
Literal blocks (block-level literal text, such as code excerpts or ASCII graphics) are indented, and indicated with a double-colon ("::") at the end of the preceding paragraph (right here -->):
if literal_block: text = 'is left as-is' spaces_and_linebreaks = 'are preserved' markup_processing = NoneDoctest blocks begin with ">>> " and end with a blank line. Neither indentation nor literal block double-colons are required. For example:
Here's a doctest block: >>> print 'Python-specific usage examples; begun with ">>>"' Python-specific usage examples; begun with ">>>" >>> print '(cut and pasted from interactive sessions)' (cut and pasted from interactive sessions)
Markup that isolates a Python identifier: interpreted text.
Text enclosed in single backquotes is recognized as "interpreted text", whose interpretation is application-dependent. In the context of a Python docstring, the default interpretation of interpreted text is as Python identifiers. The text will be marked up with a hyperlink connected to the documentation for the identifier given. Lookup rules are the same as in Python itself: LGB namespace lookups (local, global, builtin). The "role" of the interpreted text (identifying a class, module, function, etc.) is determined implicitly from the namespace lookup. For example:
class Keeper(Storer): """ Keep data fresher longer. Extend `Storer`. Class attribute `instances` keeps track of the number of `Keeper` objects instantiated. """ instances = 0 """How many `Keeper` objects are there?""" def __init__(self): """ Extend `Storer.__init__()` to keep track of instances. Keep count in `self.instances` and data in `self.data`. """ Storer.__init__(self) self.instances += 1 self.data = [] """Store data in a list, most recent last.""" def storedata(self, data): """ Extend `Storer.storedata()`; append new `data` to a list (in `self.data`). """ self.data = dataEach piece of interpreted text is looked up according to the local namespace of the block containing its docstring.
Markup that isolates a Python identifier and specifies its type: interpreted text with roles.
Although the Python source context reader is designed not to require explicit roles, they may be used. To classify identifiers explicitly, the role is given along with the identifier in either prefix or suffix form:
Use :method:`Keeper.storedata` to store the object's data in `Keeper.data`:instance_attribute:.
The syntax chosen for roles is verbose, but necessarily so (if anyone has a better alternative, please post it to the Doc-SIG [6]). The intention of the markup is that there should be little need to use explicit roles; their use is to be kept to an absolute minimum.
Markup for "tagged lists" or "label lists": field lists.
Field lists represent a mapping from field name to field body. These are mostly used for extension syntax, such as "bibliographic field lists" (representing document metadata such as author, date, and version) and extension attributes for directives (see below). They may be used to implement methodologies (docstring semantics), such as identifying parameters, exceptions raised, etc.; such usage is beyond the scope of this PEP.
A modified RFC 2822 syntax is used, with a colon before as well as after the field name. Field bodies are more versatile as well; they may contain multiple field bodies (even nested field lists). For example:
:Date: 2002-03-22 :Version: 1 :Authors: - Me - Myself - IStandard RFC 2822 header syntax cannot be used for this construct because it is ambiguous. A word followed by a colon at the beginning of a line is common in written text.
Markup extensibility: directives and substitutions.
Directives are used as an extension mechanism for reStructuredText, a way of adding support for new block-level constructs without adding new syntax. Directives for images, admonitions (note, caution, etc.), and tables of contents generation (among others) have been implemented. For example, here's how to place an image:
.. image:: mylogo.png
Substitution definitions allow the power and flexibility of block-level directives to be shared by inline text. For example:
The |biohazard| symbol must be used on containers used to dispose of medical waste. .. |biohazard| image:: biohazard.png
Section structure markup.
Section headers in reStructuredText use adornment via underlines (and possibly overlines) rather than indentation. For example:
This is a Section Title ======================= This is a Subsection Title -------------------------- This paragraph is in the subsection. This is Another Section Title ============================= This paragraph is in the second section.
Questions & Answers
Is reStructuredText rich enough?
Yes, it is for most people. If it lacks some construct that is required for a specific application, it can be added via the directive mechanism. If a useful and common construct has been overlooked and a suitably readable syntax can be found, it can be added to the specification and parser.
Is reStructuredText too rich?
For specific applications or individuals, perhaps. In general, no.
Since the very beginning, whenever a docstring markup syntax has been proposed on the Doc-SIG [6], someone has complained about the lack of support for some construct or other. The reply was often something like, "These are docstrings we're talking about, and docstrings shouldn't have complex markup." The problem is that a construct that seems superfluous to one person may be absolutely essential to another.
reStructuredText takes the opposite approach: it provides a rich set of implicit markup constructs (plus a generic extension mechanism for explicit markup), allowing for all kinds of documents. If the set of constructs is too rich for a particular application, the unused constructs can either be removed from the parser (via application-specific overrides) or simply omitted by convention.
Why not use indentation for section structure, like StructuredText does? Isn't it more "Pythonic"?
Guido van Rossum wrote the following in a 2001-06-13 Doc-SIG post:
I still think that using indentation to indicate sectioning is wrong. If you look at how real books and other print publications are laid out, you'll notice that indentation is used frequently, but mostly at the intra-section level. Indentation can be used to offset lists, tables, quotations, examples, and the like. (The argument that docstrings are different because they are input for a text formatter is wrong: the whole point is that they are also readable without processing.)
I reject the argument that using indentation is Pythonic: text is not code, and different traditions and conventions hold. People have been presenting text for readability for over 30 centuries. Let's not innovate needlessly.
See Section Structure via Indentation [25] in Problems With StructuredText [22] for further elaboration.
Why use reStructuredText for PEPs? What's wrong with the existing standard?
The existing standard for PEPs is very limited in terms of general expressibility, and referencing is especially lacking for such a reference-rich document type. PEPs are currently converted into HTML, but the results (mostly monospaced text) are less than attractive, and most of the value-added potential of HTML (especially inline hyperlinks) is untapped.
Making reStructuredText a standard markup for PEPs will enable much richer expression, including support for section structure, inline markup, graphics, and tables. In several PEPs there are ASCII graphics diagrams, which are all that plaintext documents can support. Since PEPs are made available in HTML form, the ability to include proper diagrams would be immediately useful.
Current PEP practices allow for reference markers in the form "[1]" in the text, and the footnotes/references themselves are listed in a section toward the end of the document. There is currently no hyperlinking between the reference marker and the footnote/reference itself (it would be possible to add this to pep2html.py, but the "markup" as it stands is ambiguous and mistakes would be inevitable). A PEP with many references (such as this one ;-) requires a lot of flipping back and forth. When revising a PEP, often new references are added or unused references deleted. It is painful to renumber the references, since it has to be done in two places and can have a cascading effect (insert a single new reference 1, and every other reference has to be renumbered; always adding new references to the end is suboptimal). It is easy for references to go out of sync.
PEPs use references for two purposes: simple URL references and footnotes. reStructuredText differentiates between the two. A PEP might contain references like this:
Abstract This PEP proposes adding frungible doodads [1] to the core. It extends PEP 9876 [2] via the BCA [3] mechanism. ... References and Footnotes [1] http://www.example.org/ [2] PEP 9876, Let's Hope We Never Get Here http://www.python.org/dev/peps/pep-9876/ [3] "Bogus Complexity Addition"Reference 1 is a simple URL reference. Reference 2 is a footnote containing text and a URL. Reference 3 is a footnote containing text only. Rewritten using reStructuredText, this PEP could look like this:
Abstract ======== This PEP proposes adding `frungible doodads`_ to the core. It extends PEP 9876 [#pep9876]_ via the BCA [#]_ mechanism. ... References & Footnotes ====================== .. _frungible doodads: http://www.example.org/ .. [#pep9876] PEP 9876, Let's Hope We Never Get Here .. [#] "Bogus Complexity Addition"
URLs and footnotes can be defined close to their references if desired, making them easier to read in the source text, and making the PEPs easier to revise. The "References and Footnotes" section can be auto-generated with a document tree transform. Footnotes from throughout the PEP would be gathered and displayed under a standard header. If URL references should likewise be written out explicitly (in citation form), another tree transform could be used.
URL references can be named ("frungible doodads"), and can be referenced from multiple places in the document without additional definitions. When converted to HTML, references will be replaced with inline hyperlinks (HTML <a> tags). The two footnotes are automatically numbered, so they will always stay in sync. The first footnote also contains an internal reference name, "pep9876", so it's easier to see the connection between reference and footnote in the source text. Named footnotes can be referenced multiple times, maintaining consistent numbering.
The "#pep9876" footnote could also be written in the form of a citation:
It extends PEP 9876 [PEP9876]_ ... .. [PEP9876] PEP 9876, Let's Hope We Never Get Here
Footnotes are numbered, whereas citations use text for their references.
Wouldn't it be better to keep the docstring and PEP proposals separate?
The PEP markup proposal may be removed if it is deemed that there is no need for PEP markup, or it could be made into a separate PEP. If accepted, PEP 1, PEP Purpose and Guidelines [1], and PEP 9, Sample PEP Template [2] will be updated.
It seems natural to adopt a single consistent markup standard for all uses of structured plaintext in Python, and to propose it all in one place.
The existing pep2html.py script converts the existing PEP format to HTML. How will the new-format PEPs be converted to HTML?
A new version of pep2html.py with integrated reStructuredText parsing has been completed. The Docutils project supports PEPs with a "PEP Reader" component, including all functionality currently in pep2html.py (auto-recognition of PEP & RFC references, email masking, etc.).
Who's going to convert the existing PEPs to reStructuredText?
PEP authors or volunteers may convert existing PEPs if they like, but there is no requirement to do so. The reStructuredText-based PEPs will coexist with the old PEP standard. The pep2html.py mentioned in answer 6 processes both old and new standards.
Why use reStructuredText for README and other ancillary files?
The reasoning given for PEPs in answer 4 above also applies to README and other ancillary files. By adopting a standard markup, these files can be converted to attractive cross-referenced HTML and put up on python.org. Developers of other projects can also take advantage of this facility for their own documentation.
Won't the superficial similarity to existing markup conventions cause problems, and result in people writing invalid markup (and not noticing, because the plaintext looks natural)? How forgiving is reStructuredText of "not quite right" markup?
There will be some mis-steps, as there would be when moving from one programming language to another. As with any language, proficiency grows with experience. Luckily, reStructuredText is a very little language indeed.
As with any syntax, there is the possibility of syntax errors. It is expected that a user will run the processing system over their input and check the output for correctness.
In a strict sense, the reStructuredText parser is very unforgiving (as it should be; "In the face of ambiguity, refuse the temptation to guess" [3] applies to parsing markup as well as computer languages). Here's design goal 3 from An Introduction to reStructuredText [19]:
Unambiguous. The rules for markup must not be open for interpretation. For any given input, there should be one and only one possible output (including error output).
While unforgiving, at the same time the parser does try to be helpful by producing useful diagnostic output ("system messages"). The parser reports problems, indicating their level of severity (from least to most: debug, info, warning, error, severe). The user or the client software can decide on reporting thresholds; they can ignore low-level problems or cause high-level problems to bring processing to an immediate halt. Problems are reported during the parse as well as included in the output, often with two-way links between the source of the problem and the system message explaining it.
Will the docstrings in the Python standard library modules be converted to reStructuredText?
No. Python's library reference documentation is maintained separately from the source. Docstrings in the Python standard library should not try to duplicate the library reference documentation. The current policy for docstrings in the Python standard library is that they should be no more than concise hints, simple and markup-free (although many do contain ad-hoc implicit markup).
I want to write all my strings in Unicode. Will anything break?
The parser fully supports Unicode. Docutils supports arbitrary input and output encodings.
Why does the community need a new structured text design?
The existing structured text designs are deficient, for the reasons given in "Rationale" above. reStructuredText aims to be a complete markup syntax, within the limitations of the "readable plaintext" medium.
What is wrong with existing documentation methodologies?
What existing methodologies? For Python docstrings, there is no official standard markup format, let alone a documentation methodology akin to JavaDoc. The question of methodology is at a much higher level than syntax (which this PEP addresses). It is potentially much more controversial and difficult to resolve, and is intentionally left out of this discussion.
References & Footnotes
| [1] | PEP 1, PEP Guidelines, Warsaw, Hylton (http://www.python.org/dev/peps/pep-0001/) |
| [2] | PEP 9, Sample PEP Template, Warsaw (http://www.python.org/dev/peps/pep-0009/) |
| [3] | From The Zen of Python (by Tim Peters) [26] (or just "import this" in Python) |
| [4] | PEP 216, Docstring Format, Zadka (http://www.python.org/dev/peps/pep-0216/) |
| [5] | http://docutils.sourceforge.net/rst.html |
| [6] | (1, 2, 3, 4) http://www.python.org/sigs/doc-sig/ |
| [7] | http://www.w3.org/XML/ |
| [8] | http://www.oasis-open.org/cover/general.html |
| [9] | http://docbook.org/tdg/en/html/docbook.html |
| [10] | http://www.w3.org/MarkUp/ |
| [11] | http://www.w3.org/MarkUp/#xhtml1 |
| [12] | http://www.tug.org/interest.html |
| [13] | http://perldoc.perl.org/perlpod.html |
| [14] | http://java.sun.com/j2se/javadoc/ |
| [15] | http://docutils.sourceforge.net/mirror/setext.html |
| [16] | http://www.zope.org/DevHome/Members/jim/StructuredTextWiki/FrontPage |
| [17] | http://docutils.sourceforge.net/docs/user/rst/quickstart.html |
| [18] | http://docutils.sourceforge.net/docs/user/rst/quickref.html |
| [19] | (1, 2) http://docutils.sourceforge.net/docs/ref/rst/introduction.html |
| [20] | http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html |
| [21] | http://docutils.sourceforge.net/docs/ref/rst/directives.html |
| [22] | (1, 2) http://docutils.sourceforge.net/docs/dev/rst/problems.html |
| [23] | http://docutils.sourceforge.net/docs/dev/rst/alternatives.html |
| [24] | http://docutils.sourceforge.net/ |
| [25] | http://docutils.sourceforge.net/docs/dev/rst/problems.html#section-structure-via-indentation |
| [26] | http://www.python.org/doc/Humor.html#zen |
Copyright
This document has been placed in the public domain.
Acknowledgements
Some text is borrowed from PEP 216, Docstring Format [4], by Moshe Zadka.
Special thanks to all members past & present of the Python Doc-SIG [6].
pep-0288 Generators Attributes and Exceptions
| PEP: | 288 |
|---|---|
| Title: | Generators Attributes and Exceptions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 21-Mar-2002 |
| Python-Version: | 2.5 |
| Post-History: |
Abstract
This PEP proposes to enhance generators by providing mechanisms for
raising exceptions and sharing data with running generators.
Status
This PEP is withdrawn. The exception raising mechanism was extended
and subsumed into PEP 343. The attribute passing capability
never built a following, did not have a clear implementation,
and did not have a clean way for the running generator to access
its own namespace.
Rationale
Currently, only class based iterators can provide attributes and
exception handling. However, class based iterators are harder to
write, less compact, less readable, and slower. A better solution
is to enable these capabilities for generators.
Enabling attribute assignments allows data to be passed to and from
running generators. The approach of sharing data using attributes
pervades Python. Other approaches exist but are somewhat hackish
in comparison.
Another evolutionary step is to add a generator method to allow
exceptions to be passed to a generator. Currently, there is no
clean method for triggering exceptions from outside the generator.
Also, generator exception passing helps mitigate the try/finally
prohibition for generators. The need is especially acute for
generators needing to flush buffers or close resources upon termination.
The two proposals are backwards compatible and require no new
keywords. They are being recommended for Python version 2.5.
Specification for Generator Attributes
Essentially, the proposal is to emulate attribute writing for classes.
The only wrinkle is that generators lack a way to refer to instances of
themselves. So, the proposal is to provide a function for discovering
the reference. For example:
def mygen(filename):
self = sys.get_generator()
myfile = open(filename)
for line in myfile:
if len(line) < 10:
continue
self.pos = myfile.tell()
yield line.upper()
g = mygen('sample.txt')
line1 = g.next()
print 'Position', g.pos
Uses for generator attributes include:
1. Providing generator clients with extra information (as shown
above).
2. Externally setting control flags governing generator operation
(possibly telling a generator when to step in or step over
data groups).
3. Writing lazy consumers with complex execution states
(an arithmetic encoder output stream for example).
4. Writing co-routines (as demonstrated in Dr. Mertz's articles [1]).
The control flow of 'yield' and 'next' is unchanged by this
proposal. The only change is that data can passed to and from the
generator. Most of the underlying machinery is already in place,
only the access function needs to be added.
Specification for Generator Exception Passing:
Add a .throw(exception) method to the generator interface:
def logger():
start = time.time()
log = []
try:
while True:
log.append(time.time() - start)
yield log[-1]
except WriteLog:
writelog(log)
g = logger()
for i in [10,20,40,80,160]:
testsuite(i)
g.next()
g.throw(WriteLog)
There is no existing work-around for triggering an exception
inside a generator. It is the only case in Python where active
code cannot be excepted to or through.
Generator exception passing also helps address an intrinsic
limitation on generators, the prohibition against their using
try/finally to trigger clean-up code [2].
Note A: The name of the throw method was selected for several
reasons. Raise is a keyword and so cannot be used as a method
name. Unlike raise which immediately raises an exception from the
current execution point, throw will first return to the generator
and then raise the exception. The word throw is suggestive of
putting the exception in another location. The word throw is
already associated with exceptions in other languages.
Alternative method names were considered: resolve(), signal(),
genraise(), raiseinto(), and flush(). None of these fit as well
as throw().
Note B: To keep the throw() syntax simple only the instance
version of the raise syntax would be supported (no variants for
"raise string" or "raise class, instance").
Calling "g.throw(instance)" would correspond to writing
"raise instance" immediately after the most recent yield.
References
[1] Dr. David Mertz's draft columns for Charming Python:
http://gnosis.cx/publish/programming/charming_python_b5.txt
http://gnosis.cx/publish/programming/charming_python_b7.txt
[2] PEP 255 Simple Generators:
http://www.python.org/dev/peps/pep-0255/
[3] Proof-of-concept recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/164044
Copyright
This document has been placed in the public domain.
pep-0289 Generator Expressions
| PEP: | 289 |
|---|---|
| Title: | Generator Expressions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | python at rcn.com (Raymond Hettinger) |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-Jan-2002 |
| Python-Version: | 2.4 |
| Post-History: | 22-Oct-2003 |
Contents
Abstract
This PEP introduces generator expressions as a high performance, memory efficient generalization of list comprehensions [1] and generators [2].
Rationale
Experience with list comprehensions has shown their wide-spread utility throughout Python. However, many of the use cases do not need to have a full list created in memory. Instead, they only need to iterate over the elements one at a time.
For instance, the following summation code will build a full list of squares in memory, iterate over those values, and, when the reference is no longer needed, delete the list:
sum([x*x for x in range(10)])
Memory is conserved by using a generator expression instead:
sum(x*x for x in range(10))
Similar benefits are conferred on constructors for container objects:
s = Set(word for line in page for word in line.split()) d = dict( (k, func(k)) for k in keylist)
Generator expressions are especially useful with functions like sum(), min(), and max() that reduce an iterable input to a single value:
max(len(line) for line in file if line.strip())
Generator expressions also address some examples of functionals coded with lambda:
reduce(lambda s, a: s + a.myattr, data, 0) reduce(lambda s, a: s + a[3], data, 0)
These simplify to:
sum(a.myattr for a in data) sum(a[3] for a in data)
List comprehensions greatly reduced the need for filter() and map(). Likewise, generator expressions are expected to minimize the need for itertools.ifilter() and itertools.imap(). In contrast, the utility of other itertools will be enhanced by generator expressions:
dotproduct = sum(x*y for x,y in itertools.izip(x_vector, y_vector))
Having a syntax similar to list comprehensions also makes it easy to convert existing code into an generator expression when scaling up application.
Early timings showed that generators had a significant performance advantage over list comprehensions. However, the latter were highly optimized for Py2.4 and now the performance is roughly comparable for small to mid-sized data sets. As the data volumes grow larger, generator expressions tend to perform better because they do not exhaust cache memory and they allow Python to re-use objects between iterations.
BDFL Pronouncements
This PEP is ACCEPTED for Py2.4.
The Details
(None of this is exact enough in the eye of a reader from Mars, but I hope the examples convey the intention well enough for a discussion in c.l.py. The Python Reference Manual should contain a 100% exact semantic and syntactic specification.)
The semantics of a generator expression are equivalent to creating an anonymous generator function and calling it. For example:
g = (x**2 for x in range(10)) print g.next()
is equivalent to:
def __gen(exp): for x in exp: yield x**2 g = __gen(iter(range(10))) print g.next()Only the outermost for-expression is evaluated immediately, the other expressions are deferred until the generator is run:
g = (tgtexp for var1 in exp1 if exp2 for var2 in exp3 if exp4)
is equivalent to:
def __gen(bound_exp): for var1 in bound_exp: if exp2: for var2 in exp3: if exp4: yield tgtexp g = __gen(iter(exp1)) del __genThe syntax requires that a generator expression always needs to be directly inside a set of parentheses and cannot have a comma on either side. With reference to the file Grammar/Grammar in CVS, two rules change:
The rule:
atom: '(' [testlist] ')'changes to:
atom: '(' [testlist_gexp] ')'where testlist_gexp is almost the same as listmaker, but only allows a single test after 'for' ... 'in':
testlist_gexp: test ( gen_for | (',' test)* [','] )The rule for arglist needs similar changes.
This means that you can write:
sum(x**2 for x in range(10))
but you would have to write:
reduce(operator.add, (x**2 for x in range(10)))
and also:
g = (x**2 for x in range(10))
i.e. if a function call has a single positional argument, it can be a generator expression without extra parentheses, but in all other cases you have to parenthesize it.
The exact details were checked in to Grammar/Grammar version 1.49.
The loop variable (if it is a simple variable or a tuple of simple variables) is not exposed to the surrounding function. This facilitates the implementation and makes typical use cases more reliable. In some future version of Python, list comprehensions will also hide the induction variable from the surrounding code (and, in Py2.4, warnings will be issued for code accessing the induction variable).
For example:
x = "hello" y = list(x for x in "abc") print x # prints "hello", not "c"
List comprehensions will remain unchanged. For example:
[x for x in S] # This is a list comprehension. [(x for x in S)] # This is a list containing one generator # expression.Unfortunately, there is currently a slight syntactic difference. The expression:
[x for x in 1, 2, 3]
is legal, meaning:
[x for x in (1, 2, 3)]
But generator expressions will not allow the former version:
(x for x in 1, 2, 3)
is illegal.
The former list comprehension syntax will become illegal in Python 3.0, and should be deprecated in Python 2.4 and beyond.
List comprehensions also "leak" their loop variable into the surrounding scope. This will also change in Python 3.0, so that the semantic definition of a list comprehension in Python 3.0 will be equivalent to list(<generator expression>). Python 2.4 and beyond should issue a deprecation warning if a list comprehension's loop variable has the same name as a variable used in the immediately surrounding scope.
Early Binding versus Late Binding
After much discussion, it was decided that the first (outermost) for-expression should be evaluated immediately and that the remaining expressions be evaluated when the generator is executed.
Asked to summarize the reasoning for binding the first expression, Guido offered [5]:
Consider sum(x for x in foo()). Now suppose there's a bug in foo() that raises an exception, and a bug in sum() that raises an exception before it starts iterating over its argument. Which exception would you expect to see? I'd be surprised if the one in sum() was raised rather the one in foo(), since the call to foo() is part of the argument to sum(), and I expect arguments to be processed before the function is called. OTOH, in sum(bar(x) for x in foo()), where sum() and foo() are bugfree, but bar() raises an exception, we have no choice but to delay the call to bar() until sum() starts iterating -- that's part of the contract of generators. (They do nothing until their next() method is first called.)
Various use cases were proposed for binding all free variables when the generator is defined. And some proponents felt that the resulting expressions would be easier to understand and debug if bound immediately.
However, Python takes a late binding approach to lambda expressions and has no precedent for automatic, early binding. It was felt that introducing a new paradigm would unnecessarily introduce complexity.
After exploring many possibilities, a consensus emerged that binding issues were hard to understand and that users should be strongly encouraged to use generator expressions inside functions that consume their arguments immediately. For more complex applications, full generator definitions are always superior in terms of being obvious about scope, lifetime, and binding [6].
Reduction Functions
The utility of generator expressions is greatly enhanced when combined with reduction functions like sum(), min(), and max(). The heapq module in Python 2.4 includes two new reduction functions: nlargest() and nsmallest(). Both work well with generator expressions and keep no more than n items in memory at one time.
Acknowledgements
- Raymond Hettinger first proposed the idea of "generator comprehensions" in January 2002.
- Peter Norvig resurrected the discussion in his proposal for Accumulation Displays.
- Alex Martelli provided critical measurements that proved the performance benefits of generator expressions. He also provided strong arguments that they were a desirable thing to have.
- Phillip Eby suggested "iterator expressions" as the name.
- Subsequently, Tim Peters suggested the name "generator expressions".
- Armin Rigo, Tim Peters, Guido van Rossum, Samuele Pedroni, Hye-Shik Chang and Raymond Hettinger teased out the issues surrounding early versus late binding [5].
- Jiwon Seo single handedly implemented various versions of the proposal including the final version loaded into CVS. Along the way, there were periodic code reviews by Hye-Shik Chang and Raymond Hettinger. Guido van Rossum made the key design decisions after comments from Armin Rigo and newsgroup discussions. Raymond Hettinger provided the test suite, documentation, tutorial, and examples [6].
References
| [1] | PEP 202 List Comprehensions http://www.python.org/dev/peps/pep-0202/ |
| [2] | PEP 255 Simple Generators http://www.python.org/dev/peps/pep-0255/ |
| [3] | Peter Norvig's Accumulation Display Proposal http://www.norvig.com/pyacc.html |
| [4] | Jeff Epler had worked up a patch demonstrating the previously proposed bracket and yield syntax http://python.org/sf/795947 |
| [5] | (1, 2) Discussion over the relative merits of early versus late binding http://mail.python.org/pipermail/python-dev/2004-April/044555.html |
| [6] | (1, 2) Patch discussion and alternative patches on Source Forge http://www.python.org/sf/872326 |
Copyright
This document has been placed in the public domain.
pep-0290 Code Migration and Modernization
| PEP: | 290 |
|---|---|
| Title: | Code Migration and Modernization |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 6-Jun-2002 |
| Post-History: |
Contents
Abstract
This PEP is a collection of procedures and ideas for updating Python applications when newer versions of Python are installed.
The migration tips highlight possible areas of incompatibility and make suggestions on how to find and resolve those differences. The modernization procedures show how older code can be updated to take advantage of new language features.
Rationale
This repository of procedures serves as a catalog or checklist of known migration issues and procedures for addressing those issues.
Migration issues can arise for several reasons. Some obsolete features are slowly deprecated according to the guidelines in PEP 4 [1]. Also, some code relies on undocumented behaviors which are subject to change between versions. Some code may rely on behavior which was subsequently shown to be a bug and that behavior changes when the bug is fixed.
Modernization options arise when new versions of Python add features that allow improved clarity or higher performance than previously available.
Guidelines for New Entries
Developers with commit access may update this PEP directly. Others can send their ideas to a developer for possible inclusion.
While a consistent format makes the repository easier to use, feel free to add or subtract sections to improve clarity.
Grep patterns may be supplied as tool to help maintainers locate code for possible updates. However, fully automated search/replace style regular expressions are not recommended. Instead, each code fragment should be evaluated individually.
The contra-indications section is the most important part of a new entry. It lists known situations where the update SHOULD NOT be applied.
Migration Issues
Comparison Operators Not a Shortcut for Producing 0 or 1
Prior to Python 2.3, comparison operations returned 0 or 1 rather than True or False. Some code may have used this as a shortcut for producing zero or one in places where their boolean counterparts are not appropriate. For example:
def identity(m=1):
"""Create and m-by-m identity matrix"""
return [[i==j for i in range(m)] for j in range(m)]
In Python 2.2, a call to identity(2) would produce:
[[1, 0], [0, 1]]
In Python 2.3, the same call would produce:
[[True, False], [False, True]]
Since booleans are a subclass of integers, the matrix would continue to calculate normally, but it will not print as expected. The list comprehension should be changed to read:
return [[int(i==j) for i in range(m)] for j in range(m)]
There are similiar concerns when storing data to be used by other applications which may expect a number instead of True or False.
Modernization Procedures
Procedures are grouped by the Python version required to be able to take advantage of the modernization.
Python 2.4 or Later
Inserting and Popping at the Beginning of Lists
Python's lists are implemented to perform best with appends and pops on the right. Use of pop(0) or insert(0, x) triggers O(n) data movement for the entire list. To help address this need, Python 2.4 introduces a new container, collections.deque() which has efficient append and pop operations on the both the left and right (the trade-off is much slower getitem/setitem access). The new container is especially helpful for implementing data queues:
Pattern:
c = list(data) --> c = collections.deque(data) c.pop(0) --> c.popleft() c.insert(0, x) --> c.appendleft()
Locating:
grep pop(0 or grep insert(0
Simplifying Custom Sorts
In Python 2.4, the sort method for lists and the new sorted built-in function both accept a key function for computing sort keys. Unlike the cmp function which gets applied to every comparison, the key function gets applied only once to each record. It is much faster than cmp and typically more readable while using less code. The key function also maintains the stability of the sort (records with the same key are left in their original order.
Original code using a comparison function:
names.sort(lambda x,y: cmp(x.lower(), y.lower()))
Alternative original code with explicit decoration:
tempnames = [(n.lower(), n) for n in names] tempnames.sort() names = [original for decorated, original in tempnames]
Revised code using a key function:
names.sort(key=str.lower) # case-insensitive sort
Locating: grep sort *.py
Replacing Common Uses of Lambda
In Python 2.4, the operator module gained two new functions, itemgetter() and attrgetter() that can replace common uses of the lambda keyword. The new functions run faster and are considered by some to improve readability.
Pattern:
lambda r: r[2] --> itemgetter(2)
lambda r: r.myattr --> attrgetter('myattr')
Typical contexts:
sort(studentrecords, key=attrgetter('gpa')) # set a sort field
map(attrgetter('lastname'), studentrecords) # extract a field
Locating: grep lambda *.py
Simplified Reverse Iteration
Python 2.4 introduced the reversed builtin function for reverse iteration. The existing approaches to reverse iteration suffered from wordiness, performance issues (speed and memory consumption), and/or lack of clarity. A preferred style is to express the sequence in a forwards direction, apply reversed to the result, and then loop over the resulting fast, memory friendly iterator.
Original code expressed with half-open intervals:
for i in range(n-1, -1, -1):
print seqn[i]
Alternative original code reversed in multiple steps:
rseqn = list(seqn)
rseqn.reverse()
for value in rseqn:
print value
Alternative original code expressed with extending slicing:
for value in seqn[::-1]:
print value
Revised code using the reversed function:
for value in reversed(seqn):
print value
Python 2.3 or Later
Testing String Membership
In Python 2.3, for string2 in string1, the length restriction on string2 is lifted; it can now be a string of any length. When searching for a substring, where you don't care about the position of the substring in the original string, using the in operator makes the meaning clear.
Pattern:
string1.find(string2) >= 0 --> string2 in string1 string1.find(string2) != -1 --> string2 in string1
Replace apply() with a Direct Function Call
In Python 2.3, apply() was marked for Pending Deprecation because it was made obsolete by Python 1.6's introduction of * and ** in function calls. Using a direct function call was always a little faster than apply() because it saved the lookup for the builtin. Now, apply() is even slower due to its use of the warnings module.
Pattern:
apply(f, args, kwds) --> f(*args, **kwds)
Note: The Pending Deprecation was removed from apply() in Python 2.3.3 since it creates pain for people who need to maintain code that works with Python versions as far back as 1.5.2, where there was no alternative to apply(). The function remains deprecated, however.
Python 2.2 or Later
Testing Dictionary Membership
For testing dictionary membership, use the 'in' keyword instead of the 'has_key()' method. The result is shorter and more readable. The style becomes consistent with tests for membership in lists. The result is slightly faster because has_key requires an attribute search and uses a relatively expensive function call.
Pattern:
if d.has_key(k): --> if k in d:
Contra-indications:
Some dictionary-like objects may not define a __contains__() method:
if dictlike.has_key(k)
Locating: grep has_key
Looping Over Dictionaries
Use the new iter methods for looping over dictionaries. The iter methods are faster because they do not have to create a new list object with a complete copy of all of the keys, values, or items. Selecting only keys, values, or items (key/value pairs) as needed saves the time for creating throwaway object references and, in the case of items, saves a second hash look-up of the key.
Pattern:
for key in d.keys(): --> for key in d:
for value in d.values(): --> for value in d.itervalues():
for key, value in d.items():
--> for key, value in d.iteritems():
Contra-indications:
If you need a list, do not change the return type:
def getids(): return d.keys()
Some dictionary-like objects may not define iter methods:
for k in dictlike.keys():
Iterators do not support slicing, sorting or other operations:
k = d.keys(); j = k[:]
Dictionary iterators prohibit modifying the dictionary:
for k in d.keys(): del[k]
stat Methods
Replace stat constants or indices with new os.stat attributes and methods. The os.stat attributes and methods are not order-dependent and do not require an import of the stat module.
Pattern:
os.stat("foo")[stat.ST_MTIME] --> os.stat("foo").st_mtime
os.stat("foo")[stat.ST_MTIME] --> os.path.getmtime("foo")
Locating: grep os.stat or grep stat.S
Reduce Dependency on types Module
The types module is likely to be deprecated in the future. Use built-in constructor functions instead. They may be slightly faster.
Pattern:
isinstance(v, types.IntType) --> isinstance(v, int) isinstance(s, types.StringTypes) --> isinstance(s, basestring)
Full use of this technique requires Python 2.3 or later (basestring was introduced in Python 2.3), but Python 2.2 is sufficient for most uses.
Locating: grep types *.py | grep import
Avoid Variable Names that Clash with the __builtins__ Module
In Python 2.2, new built-in types were added for dict and file. Scripts should avoid assigning variable names that mask those types. The same advice also applies to existing builtins like list.
Pattern:
file = open('myfile.txt') --> f = open('myfile.txt')
dict = obj.__dict__ --> d = obj.__dict__
Locating: grep 'file ' *.py
Python 2.1 or Later
whrandom Module Deprecated
All random-related methods have been collected in one place, the random module.
Pattern:
import whrandom --> import random
Locating: grep whrandom
Python 2.0 or Later
String Methods
The string module is likely to be deprecated in the future. Use string methods instead. They're faster too.
Pattern:
import string ; string.method(s, ...) --> s.method(...) c in string.whitespace --> c.isspace()
Locating: grep string *.py | grep import
startswith and endswith String Methods
Use these string methods instead of slicing. No slice has to be created and there's no risk of miscounting.
Pattern:
"foobar"[:3] == "foo" --> "foobar".startswith("foo")
"foobar"[-3:] == "bar" --> "foobar".endswith("bar")
The atexit Module
The atexit module supports multiple functions to be executed upon program termination. Also, it supports parameterized functions. Unfortunately, its implementation conflicts with the sys.exitfunc attribute which only supports a single exit function. Code relying on sys.exitfunc may interfere with other modules (including library modules) that elect to use the newer and more versatile atexit module.
Pattern:
sys.exitfunc = myfunc --> atexit.register(myfunc)
Python 1.5 or Later
Class-Based Exceptions
String exceptions are deprecated, so derive from the Exception base class. Unlike the obsolete string exceptions, class exceptions all derive from another exception or the Exception base class. This allows meaningful groupings of exceptions. It also allows an "except Exception" clause to catch all exceptions.
Pattern:
NewError = 'NewError' --> class NewError(Exception): pass
All Python Versions
Testing for None
Since there is only one None object, equality can be tested with identity. Identity tests are slightly faster than equality tests. Also, some object types may overload comparison, so equality testing may be much slower.
Pattern:
if v == None --> if v is None: if v != None --> if v is not None:
Locating: grep '== None' or grep '!= None'
References
| [1] | PEP 4, Deprecation of Standard Modules, von Loewis (http://www.python.org/dev/peps/pep-0004/) |
| [2] | http://pychecker.sourceforge.net/ |
Copyright
This document has been placed in the public domain.
pep-0291 Backward Compatibility for the Python 2 Standard Library
| PEP: | 291 |
|---|---|
| Title: | Backward Compatibility for the Python 2 Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neal Norwitz <nnorwitz at gmail.com> |
| Status: | Final |
| Type: | Informational |
| Created: | 06-Jun-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP describes the packages and modules in the Python 2
standard library which should remain backward compatible with
previous versions of Python. If a package is not listed here,
then it need only remain compatible with the version of Python it
is distributed with.
This PEP has no bearing on the Python 3 standard library.
Rationale
Authors have various reasons why packages and modules should
continue to work with previous versions of Python. In order to
maintain backward compatibility for these modules while moving the
rest of the standard library forward, it is necessary to know
which modules can be modified and which should use old and
possibly deprecated features.
Generally, authors should attempt to keep changes backward
compatible with the previous released version of Python in order
to make bug fixes easier to backport.
In addition to a package or module being listed in this PEP,
authors must add a comment at the top of each file documenting
the compatibility requirement.
When a major version of Python is released, a Subversion branch is
created for continued maintenance and bug fix releases. A package
version on a branch may have a different compatibility requirement
than the same package on the trunk (i.e. current bleeding-edge
development). Where appropriate, these branch compatibilities are
listed below.
Features to Avoid
The following list contains common features to avoid in order
to maintain backward compatibility with each version of Python.
This list is not complete! It is only meant as a general guide.
Note that the features below were implemented in the version
following the one listed. For example, features listed next to
1.5.2 were implemented in 2.0.
Version Features to Avoid
------- -----------------
1.5.2 string methods, Unicode, list comprehensions,
augmented assignment (eg, +=), zip(), import x as y,
dict.setdefault(), print >> f,
calling f(*args, **kw), plus all features below
2.0 nested scopes, rich comparisons,
function attributes, plus all features below
2.1 use of object or new-style classes, iterators,
using generators, nested scopes, or //
without from __future__ import ... statement,
isinstance(X, TYP) where TYP is a tuple of types,
plus all features below
2.2 bool, True, False, basestring, enumerate(),
{}.pop(), PendingDeprecationWarning,
Universal Newlines, plus all features below
plus all features below
2.3 generator expressions, multi-line imports,
decorators, int/long unification, set/frozenset,
reversed(), sorted(), "".rsplit(),
plus all features below
2.4 with statement, conditional expressions,
combined try/except/finally, relative imports,
yield expressions or generator.throw/send/close(),
plus all features below
2.5 with statement without from __future__ import,
io module, str.format(), except as,
bytes, b'' literals, property.setter/deleter
Backward Compatible Packages, Modules, and Tools
Package/Module Maintainer(s) Python Version Notes
-------------- ------------- -------------- -----
2to3 Benjamin Peterson 2.5
bsddb Greg Smith 2.1
Barry Warsaw
compiler Jeremy Hylton 2.1
ctypes Thomas Heller 2.3
decimal Raymond Hettinger 2.3 [2]
distutils Tarek Ziade 2.3
email Barry Warsaw 2.1 / 2.3 [1]
modulefinder Thomas Heller 2.2
Just van Rossum
pkgutil Phillip Eby 2.3
platform Marc-Andre Lemburg 1.5.2
pybench Marc-Andre Lemburg 1.5.2 [3]
sre Fredrik Lundh 2.1
subprocess Peter Astrand 2.2
wsgiref Phillip J. Eby 2.1
xml (PyXML) Martin v. Loewis 2.0
xmlrpclib Fredrik Lundh 2.1
Tool Maintainer(s) Python Version
---- ------------- --------------
None
Notes
-----
[1] The email package version 2 was distributed with Python up to
Python 2.3, and this must remain Python 2.1 compatible. email
package version 3 will be distributed with Python 2.4 and will
need to remain compatible only with Python 2.3.
[2] Specification updates will be treated as bugfixes and backported.
Python 2.3 compatibility will be kept for at least Python 2.4.
The decision will be revisited for Python 2.5 and not changed
unless compelling advantages arise.
[3] pybench lives under the Tools/ directory. Compatibility with
older Python versions is needed in order to be able to compare
performance between Python versions. New features may still
be used in new tests, which may then be configured to fail
gracefully on import by the tool in older Python versions.
Copyright
This document has been placed in the public domain.
pep-0292 Simpler String Substitutions
| PEP: | 292 |
|---|---|
| Title: | Simpler String Substitutions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 18-Jun-2002 |
| Python-Version: | 2.4 |
| Post-History: | 18-Jun-2002, 23-Mar-2004, 22-Aug-2004 |
Abstract
This PEP describes a simpler string substitution feature, also
known as string interpolation. This PEP is "simpler" in two
respects:
1. Python's current string substitution feature
(i.e. %-substitution) is complicated and error prone. This PEP
is simpler at the cost of some expressiveness.
2. PEP 215 proposed an alternative string interpolation feature,
introducing a new `$' string prefix. PEP 292 is simpler than
this because it involves no syntax changes and has much simpler
rules for what substitutions can occur in the string.
Rationale
Python currently supports a string substitution syntax based on
C's printf() '%' formatting character[1]. While quite rich,
%-formatting codes are also error prone, even for
experienced Python programmers. A common mistake is to leave off
the trailing format character, e.g. the `s' in "%(name)s".
In addition, the rules for what can follow a % sign are fairly
complex, while the usual application rarely needs such complexity.
Most scripts need to do some string interpolation, but most of
those use simple `stringification' formats, i.e. %s or %(name)s
This form should be made simpler and less error prone.
A Simpler Proposal
We propose the addition of a new class, called 'Template', which
will live in the string module. The Template class supports new
rules for string substitution; its value contains placeholders,
introduced with the $ character. The following rules for
$-placeholders apply:
1. $$ is an escape; it is replaced with a single $
2. $identifier names a substitution placeholder matching a mapping
key of "identifier". By default, "identifier" must spell a
Python identifier as defined in [2]. The first non-identifier
character after the $ character terminates this placeholder
specification.
3. ${identifier} is equivalent to $identifier. It is required
when valid identifier characters follow the placeholder but are
not part of the placeholder, e.g. "${noun}ification".
If the $ character appears at the end of the line, or is followed
by any other character than those described above, a ValueError
will be raised at interpolation time. Values in mapping are
converted automatically to strings.
No other characters have special meaning, however it is possible
to derive from the Template class to define different substitution
rules. For example, a derived class could allow for periods in
the placeholder (e.g. to support a kind of dynamic namespace and
attribute path lookup), or could define a delimiter character
other than '$'.
Once the Template has been created, substitutions can be performed
by calling one of two methods:
- substitute(). This method returns a new string which results
when the values of a mapping are substituted for the
placeholders in the Template. If there are placeholders which
are not present in the mapping, a KeyError will be raised.
- safe_substitute(). This is similar to the substitute() method,
except that KeyErrors are never raised (due to placeholders
missing from the mapping). When a placeholder is missing, the
original placeholder will appear in the resulting string.
Here are some examples:
>>> from string import Template
>>> s = Template('${name} was born in ${country}')
>>> print s.substitute(name='Guido', country='the Netherlands')
Guido was born in the Netherlands
>>> print s.substitute(name='Guido')
Traceback (most recent call last):
[...]
KeyError: 'country'
>>> print s.safe_substitute(name='Guido')
Guido was born in ${country}
The signature of substitute() and safe_substitute() allows for
passing the mapping of placeholders to values, either as a single
dictionary-like object in the first positional argument, or as
keyword arguments as shown above. The exact details and
signatures of these two methods is reserved for the standard
library documentation.
Why `$' and Braces?
The BDFL said it best[4]: "The $ means "substitution" in so many
languages besides Perl that I wonder where you've been. [...]
We're copying this from the shell."
Thus the substitution rules are chosen because of the similarity
with so many other languages. This makes the substitution rules
easier to teach, learn, and remember.
Comparison to PEP 215
PEP 215 describes an alternate proposal for string interpolation.
Unlike that PEP, this one does not propose any new syntax for
Python. All the proposed new features are embodied in a new
library module. PEP 215 proposes a new string prefix
representation such as $"" which signal to Python that a new type
of string is present. $-strings would have to interact with the
existing r-prefixes and u-prefixes, essentially doubling the
number of string prefix combinations.
PEP 215 also allows for arbitrary Python expressions inside the
$-strings, so that you could do things like:
import sys
print $"sys = $sys, sys = $sys.modules['sys']"
which would return
sys = <module 'sys' (built-in)>, sys = <module 'sys' (built-in)>
It's generally accepted that the rules in PEP 215 are safe in the
sense that they introduce no new security issues (see PEP 215,
"Security Issues" for details). However, the rules are still
quite complex, and make it more difficult to see the substitution
placeholder in the original $-string.
The interesting thing is that the Template class defined in this
PEP is designed for inheritance and, with a little extra work,
it's possible to support PEP 215's functionality using existing
Python syntax.
For example, one could define subclasses of Template and dict that
allowed for a more complex placeholder syntax and a mapping that
evaluated those placeholders.
Internationalization
The implementation supports internationalization by recording the
original template string in the Template instance's 'template'
attribute. This attribute would serve as the lookup key in an
gettext-based catalog. It is up to the application to turn the
resulting string back into a Template for substitution.
However, the Template class was designed to work more intuitively
in an internationalized application, by supporting the mixing-in
of Template and unicode subclasses. Thus an internationalized
application could create an application-specific subclass,
multiply inheriting from Template and unicode, and using instances
of that subclass as the gettext catalog key. Further, the
subclass could alias the special __mod__() method to either
.substitute() or .safe_substitute() to provide a more traditional
string/unicode like %-operator substitution syntax.
Reference Implementation
The implementation has been committed to the Python 2.4 source tree.
References
[1] String Formatting Operations
http://docs.python.org/library/stdtypes.html#string-formatting-operations
[2] Identifiers and Keywords
http://docs.python.org/reference/lexical_analysis.html#identifiers-and-keywords
[3] Guido's python-dev posting from 21-Jul-2002
http://mail.python.org/pipermail/python-dev/2002-July/026397.html
[4] http://mail.python.org/pipermail/python-dev/2002-June/025652.html
[5] Reference Implementation
http://sourceforge.net/tracker/index.php?func=detail&aid=1014055&group_id=5470&atid=305470
Copyright
This document has been placed in the public domain.
pep-0293 Codec Error Handling Callbacks
| PEP: | 293 |
|---|---|
| Title: | Codec Error Handling Callbacks |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Walter DĂśrwald <walter at livinglogic.de> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 18-Jun-2002 |
| Python-Version: | 2.3 |
| Post-History: | 19-Jun-2002 |
Abstract
This PEP aims at extending Python's fixed codec error handling
schemes with a more flexible callback based approach.
Python currently uses a fixed error handling for codec error
handlers. This PEP describes a mechanism which allows Python to
use function callbacks as error handlers. With these more
flexible error handlers it is possible to add new functionality to
existing codecs by e.g. providing fallback solutions or different
encodings for cases where the standard codec mapping does not
apply.
Specification
Currently the set of codec error handling algorithms is fixed to
either "strict", "replace" or "ignore" and the semantics of these
algorithms is implemented separately for each codec.
The proposed patch will make the set of error handling algorithms
extensible through a codec error handler registry which maps
handler names to handler functions. This registry consists of the
following two C functions:
int PyCodec_RegisterError(const char *name, PyObject *error)
PyObject *PyCodec_LookupError(const char *name)
and their Python counterparts
codecs.register_error(name, error)
codecs.lookup_error(name)
PyCodec_LookupError raises a LookupError if no callback function
has been registered under this name.
Similar to the encoding name registry there is no way of
unregistering callback functions or iterating through the
available functions.
The callback functions will be used in the following way by the
codecs: when the codec encounters an encoding/decoding error, the
callback function is looked up by name, the information about the
error is stored in an exception object and the callback is called
with this object. The callback returns information about how to
proceed (or raises an exception).
For encoding, the exception object will look like this:
class UnicodeEncodeError(UnicodeError):
def __init__(self, encoding, object, start, end, reason):
UnicodeError.__init__(self,
"encoding '%s' can't encode characters " +
"in positions %d-%d: %s" % (encoding,
start, end-1, reason))
self.encoding = encoding
self.object = object
self.start = start
self.end = end
self.reason = reason
This type will be implemented in C with the appropriate setter and
getter methods for the attributes, which have the following
meaning:
* encoding: The name of the encoding;
* object: The original unicode object for which encode() has
been called;
* start: The position of the first unencodable character;
* end: (The position of the last unencodable character)+1 (or
the length of object, if all characters from start to the end
of object are unencodable);
* reason: The reason why object[start:end] couldn't be encoded.
If object has consecutive unencodable characters, the encoder
should collect those characters for one call to the callback if
those characters can't be encoded for the same reason. The
encoder is not required to implement this behaviour but may call
the callback for every single character, but it is strongly
suggested that the collecting method is implemented.
The callback must not modify the exception object. If the
callback does not raise an exception (either the one passed in, or
a different one), it must return a tuple:
(replacement, newpos)
replacement is a unicode object that the encoder will encode and
emit instead of the unencodable object[start:end] part, newpos
specifies a new position within object, where (after encoding the
replacement) the encoder will continue encoding.
Negative values for newpos are treated as being relative to
end of object. If newpos is out of bounds the encoder will raise
an IndexError.
If the replacement string itself contains an unencodable character
the encoder raises the exception object (but may set a different
reason string before raising).
Should further encoding errors occur, the encoder is allowed to
reuse the exception object for the next call to the callback.
Furthermore the encoder is allowed to cache the result of
codecs.lookup_error.
If the callback does not know how to handle the exception, it must
raise a TypeError.
Decoding works similar to encoding with the following differences:
The exception class is named UnicodeDecodeError and the attribute
object is the original 8bit string that the decoder is currently
decoding.
The decoder will call the callback with those bytes that
constitute one undecodable sequence, even if there is more than
one undecodable sequence that is undecodable for the same reason
directly after the first one. E.g. for the "unicode-escape"
encoding, when decoding the illegal string "\\u00\\u01x", the
callback will be called twice (once for "\\u00" and once for
"\\u01"). This is done to be able to generate the correct number
of replacement characters.
The replacement returned from the callback is a unicode object
that will be emitted by the decoder as-is without further
processing instead of the undecodable object[start:end] part.
There is a third API that uses the old strict/ignore/replace error
handling scheme:
PyUnicode_TranslateCharmap/unicode.translate
The proposed patch will enhance PyUnicode_TranslateCharmap, so
that it also supports the callback registry. This has the
additional side effect that PyUnicode_TranslateCharmap will
support multi-character replacement strings (see SF feature
request #403100 [1]).
For PyUnicode_TranslateCharmap the exception class will be named
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
all consecutive untranslatable characters (i.e. those that map to
None) and call the callback with them. The replacement returned
from the callback is a unicode object that will be put in the
translated result as-is, without further processing.
All encoders and decoders are allowed to implement the callback
functionality themselves, if they recognize the callback name
(i.e. if it is a system callback like "strict", "replace" and
"ignore"). The proposed patch will add two additional system
callback names: "backslashreplace" and "xmlcharrefreplace", which
can be used for encoding and translating and which will also be
implemented in-place for all encoders and
PyUnicode_TranslateCharmap.
The Python equivalent of these five callbacks will look like this:
def strict(exc):
raise exc
def ignore(exc):
if isinstance(exc, UnicodeError):
return (u"", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def replace(exc):
if isinstance(exc, UnicodeEncodeError):
return ((exc.end-exc.start)*u"?", exc.end)
elif isinstance(exc, UnicodeDecodeError):
return (u"\\ufffd", exc.end)
elif isinstance(exc, UnicodeTranslateError):
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def backslashreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
if ord(c)<=0xff:
s += u"\\x%02x" % ord(c)
elif ord(c)<=0xffff:
s += u"\\u%04x" % ord(c)
else:
s += u"\\U%08x" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def xmlcharrefreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
s += u"&#%d;" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
These five callback handlers will also be accessible to Python as
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
Rationale
Most legacy encoding do not support the full range of Unicode
characters. For these cases many high level protocols support a
way of escaping a Unicode character (e.g. Python itself supports
the \x, \u and \U convention, XML supports character references
via &#xxx; etc.).
When implementing such an encoding algorithm, a problem with the
current implementation of the encode method of Unicode objects
becomes apparent: For determining which characters are unencodable
by a certain encoding, every single character has to be tried,
because encode does not provide any information about the location
of the error(s), so
# (1)
us = u"xxx"
s = us.encode(encoding)
has to be replaced by
# (2)
us = u"xxx"
v = []
for c in us:
try:
v.append(c.encode(encoding))
except UnicodeError:
v.append("&#%d;" % ord(c))
s = "".join(v)
This slows down encoding dramatically as now the loop through the
string is done in Python code and no longer in C code.
Furthermore this solution poses problems with stateful encodings.
For example UTF-16 uses a Byte Order Mark at the start of the
encoded byte string to specify the byte order. Using (2) with
UTF-16, results in an 8 bit string with a BOM between every
character.
To work around this problem, a stream writer - which keeps state
between calls to the encoding function - has to be used:
# (3)
us = u"xxx"
import codecs, cStringIO as StringIO
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"&#%d;" % ord(c))
s = v.getvalue()
To compare the speed of (1) and (3) the following test script has
been used:
# (4)
import time
us = u"äa"*1000000
encoding = "ascii"
import codecs, cStringIO as StringIO
t1 = time.time()
s1 = us.encode(encoding, "replace")
t2 = time.time()
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"?")
s2 = v.getvalue()
t3 = time.time()
assert(s1==s2)
print "1:", t2-t1
print "2:", t3-t2
print "factor:", (t3-t2)/(t2-t1)
On Linux this gives the following output (with Python 2.3a0):
1: 0.274321913719
2: 51.1284689903
factor: 186.381278466
i.e. (3) is 180 times slower than (1).
Callbacks must be stateless, because as soon as a callback is
registered it is available globally and can be called by multiple
encode() calls. To be able to use stateful callbacks, the errors
parameter for encode/decode/translate would have to be changed
from char * to PyObject *, so that the callback could be used
directly, without the need to register the callback globally. As
this requires changes to lots of C prototypes, this approach was
rejected.
Currently all encoding/decoding functions have arguments
const Py_UNICODE *p, int size
or
const char *p, int size
to specify the unicode characters/8bit characters to be
encoded/decoded. So in case of an error the codec has to create a
new unicode or str object from these parameters and store it in
the exception object. The callers of these encoding/decoding
functions extract these parameters from str/unicode objects
themselves most of the time, so it could speed up error handling
if these object were passed directly. As this again requires
changes to many C functions, this approach has been rejected.
For stream readers/writers the errors attribute must be changeable
to be able to switch between different error handling methods
during the lifetime of the stream reader/writer. This is currently
the case for codecs.StreamReader and codecs.StreamWriter and
all their subclasses. All core codecs and probably most of the
third party codecs (e.g. JapaneseCodecs) derive their stream
readers/writers from these classes so this already works,
but the attribute errors should be documented as a requirement.
Implementation Notes
A sample implementation is available as SourceForge patch #432401
[2] including a script for testing the speed of various
string/encoding/error combinations and a test script.
Currently the new exception classes are old style Python
classes. This means that accessing attributes results
in a dict lookup. The C API is implemented in a way
that makes it possible to switch to new style classes
behind the scene, if Exception (and UnicodeError) will
be changed to new style classes implemented in C for
improved performance.
The class codecs.StreamReaderWriter uses the errors parameter for
both reading and writing. To be more flexible this should
probably be changed to two separate parameters for reading and
writing.
The errors parameter of PyUnicode_TranslateCharmap is not
availably to Python, which makes testing of the new functionality
of PyUnicode_TranslateCharmap impossible with Python scripts. The
patch should add an optional argument errors to unicode.translate
to expose the functionality and make testing possible.
Codecs that do something different than encoding/decoding from/to
unicode and want to use the new machinery can define their own
exception classes and the strict handlers will automatically work
with it. The other predefined error handlers are unicode specific
and expect to get a Unicode(Encode|Decode|Translate)Error
exception object so they won't work.
Backwards Compatibility
The semantics of unicode.encode with errors="replace" has changed:
The old version always stored a ? character in the output string
even if no character was mapped to ? in the mapping. With the
proposed patch, the replacement string from the callback will
again be looked up in the mapping dictionary. But as all
supported encodings are ASCII based, and thus map ? to ?, this
should not be a problem in practice.
Illegal values for the errors argument raised ValueError before,
now they will raise LookupError.
References
[1] SF feature request #403100
"Multicharacter replacements in PyUnicode_TranslateCharmap"
http://www.python.org/sf/403100
[2] SF patch #432401 "unicode encoding error callbacks"
http://www.python.org/sf/432401
Copyright
This document has been placed in the public domain.
pep-0294 Type Names in the types Module
| PEP: | 294 |
|---|---|
| Title: | Type Names in the types Module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | oren at hishome.net (Oren Tirosh) |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 19-Jun-2002 |
| Python-Version: | 2.5 |
| Post-History: |
Abstract
This PEP proposes that symbols matching the type name should be
added to the types module for all basic Python types in the types
module:
types.IntegerType -> types.int
types.FunctionType -> types.function
types.TracebackType -> types.traceback
...
The long capitalized names currently in the types module will be
deprecated.
With this change the types module can serve as a replacement for
the new module. The new module shall be deprecated and listed in
PEP 4.
Pronouncement
A centralized repository of type names was a mistake. Neither the
"types" nor "new" modules should be carried forward to Python 3.0.
In the meantime, it does not make sense to make the proposed updates
to the modules. This would cause disruption without any compensating
benefit.
Instead, the problem that some internal types (frames, functions,
etc.) don't live anywhere outside those modules may be addressed by
either adding them to __builtin__ or sys. This will provide a
smoother transition to Python 3.0.
Rationale
Using two sets of names for the same objects is redundant and
confusing.
In Python versions prior to 2.2 the symbols matching many type
names were taken by the factory functions for those types. Now
all basic types have been unified with their factory functions and
therefore the type names are available to be consistently used to
refer to the type object.
Most types are accessible as either builtins or in the new module
but some types such as traceback and generator are only accssible
through the types module under names which do not match the type
name. This PEP provides a uniform way to access all basic types
under a single set of names.
Specification
The types module shall pass the following test:
import types
for t in vars(types).values():
if type(t) is type:
assert getattr(types, t.__name__) is t
The types 'class', 'instance method' and 'dict-proxy' have already
been renamed to the valid Python identifiers 'classobj',
'instancemethod' and 'dictproxy', making this possible.
Backward compatibility
Because of their widespread use it is not planned to actually
remove the long names from the types module in some future
version. However, the long names should be changed in
documentation and library sources to discourage their use in new
code.
Reference Implementation
A reference implementation is available in SourceForge patch
#569328: http://www.python.org/sf/569328
Copyright
This document has been placed in the public domain.
pep-0295 Interpretation of multiline string constants
| PEP: | 295 |
|---|---|
| Title: | Interpretation of multiline string constants |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Stepan Koltsov <yozh at mx1.ru> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 22-Jul-2002 |
| Python-Version: | 3.0 |
| Post-History: |
Abstract
This PEP describes an interpretation of multiline string constants
for Python. It suggests stripping spaces after newlines and
stripping a newline if it is first character after an opening
quotation.
Rationale
This PEP proposes an interpretation of multiline string constants
in Python. Currently, the value of string constant is all the
text between quotations, maybe with escape sequences substituted,
e.g.:
def f():
"""
la-la-la
limona, banana
"""
def g():
return "This is \
string"
print repr(f.__doc__)
print repr(g())
prints:
'\n\tla-la-la\n\tlimona, banana\n\t'
'This is \tstring'
This PEP suggest two things
- ignore the first character after opening quotation, if it is
newline
- second: ignore in string constants all spaces and tabs up to
first non-whitespace character, but no more then current
indentation.
After applying this, previous program will print:
'la-la-la\nlimona, banana\n'
'This is string'
To get this result, previous programs could be rewritten for
current Python as (note, this gives the same result with new
strings meaning):
def f():
"""\
la-la-la
limona, banana
"""
def g():
"This is \
string"
Or stripping can be done with library routines at runtime (as
pydoc does), but this decreases program readability.
Implementation
I'll say nothing about CPython, Jython or Python.NET.
In original Python, there is no info about the current indentation
(in spaces) at compile time, so space and tab stripping should be
done at parse time. Currently no flags can be passed to the
parser in program text (like from __future__ import xxx). I
suggest enabling or disabling of this feature at Python compile
time depending of CPP flag Py_PARSE_MULTILINE_STRINGS.
Alternatives
New interpretation of string constants can be implemented with flags
'i' and 'o' to string constants, like
i"""
SELECT * FROM car
WHERE model = 'i525'
""" is in new style,
o"""SELECT * FROM employee
WHERE birth < 1982
""" is in old style, and
"""
SELECT employee.name, car.name, car.price FROM employee, car
WHERE employee.salary * 36 > car.price
""" is in new style after Python-x.y.z and in old style otherwise.
Also this feature can be disabled if string is raw, i.e. if flag 'r'
specified.
Copyright
This document has been placed in the Public Domain.
pep-0296 Adding a bytes Object Type
| PEP: | 296 |
|---|---|
| Title: | Adding a bytes Object Type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | xscottg at yahoo.com (Scott Gilbert) |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 12-Jul-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Notice
This PEP is withdrawn by the author (in favor of PEP 358).
Abstract
This PEP proposes the creation of a new standard type and builtin
constructor called 'bytes'. The bytes object is an efficiently
stored array of bytes with some additional characteristics that
set it apart from several implementations that are similar.
Rationale
Python currently has many objects that implement something akin to
the bytes object of this proposal. For instance the standard
string, buffer, array, and mmap objects are all very similar in
some regards to the bytes object. Additionally, several
significant third party extensions have created similar objects to
try and fill similar needs. Frustratingly, each of these objects
is too narrow in scope and is missing critical features to make it
applicable to a wider category of problems.
Specification
The bytes object has the following important characteristics:
1. Efficient underlying array storage via the standard C type "unsigned
char". This allows fine grain control over how much memory is
allocated. With the alignment restrictions designated in the next
item, it is trivial for low level extensions to cast the pointer
to a different type as needed.
Also, since the object is implemented as an array of bytes, it is
possible to pass the bytes object to the extensive library of
routines already in the standard library that presently work with
strings. For instance, the bytes object in conjunction with the
struct module could be used to provide a complete replacement for
the array module using only Python script.
If an unusual platform comes to light, one where there isn't a
native unsigned 8 bit type, the object will do its best to
represent itself at the Python script level as though it were an
array of 8 bit unsigned values. It is doubtful whether many
extensions would handle this correctly, but Python script could be
portable in these cases.
2. Alignment of the allocated byte array is whatever is promised by the
platform implementation of malloc. A bytes object created from an
extension can be supplied that provides any arbitrary alignment as
the extension author sees fit.
This alignment restriction should allow the bytes object to be
used as storage for all standard C types - including PyComplex
objects or other structs of standard C type types. Further
alignment restrictions can be provided by extensions as necessary.
3. The bytes object implements a subset of the sequence operations
provided by string/array objects, but with slightly different
semantics in some cases. In particular, a slice always returns a
new bytes object, but the underlying memory is shared between the
two objects. This type of slice behavior has been called creating
a "view". Additionally, repetition and concatenation are
undefined for bytes objects and will raise an exception.
As these objects are likely to find use in high performance
applications, one motivation for the decision to use view slicing
is that copying between bytes objects should be very efficient and
not require the creation of temporary objects. The following code
illustrates this:
# create two 10 Meg bytes objects
b1 = bytes(10000000)
b2 = bytes(10000000)
# copy from part of one to another with out creating a 1 Meg temporary
b1[2000000:3000000] = b2[4000000:5000000]
Slice assignment where the rvalue is not the same length as the
lvalue will raise an exception. However, slice assignment will
work correctly with overlapping slices (typically implemented with
memmove).
4. The bytes object will be recognized as a native type by the pickle and
cPickle modules for efficient serialization. (In truth, this is
the only requirement that can't be implemented via a third party
extension.)
Partial solutions to address the need to serialize the data stored
in a bytes-like object without creating a temporary copy of the
data into a string have been implemented in the past. The tofile
and fromfile methods of the array object are good examples of
this. The bytes object will support these methods too. However,
pickling is useful in other situations - such as in the shelve
module, or implementing RPC of Python objects, and requiring the
end user to use two different serialization mechanisms to get an
efficient transfer of data is undesirable.
XXX: Will try to implement pickling of the new bytes object in
such a way that previous versions of Python will unpickle it as a
string object.
When unpickling, the bytes object will be created from memory
allocated from Python (via malloc). As such, it will lose any
additional properties that an extension supplied pointer might
have provided (special alignment, or special types of memory).
XXX: Will try to make it so that C subclasses of bytes type can
supply the memory that will be unpickled into. For instance, a
derived class called PageAlignedBytes would unpickle to memory
that is also page aligned.
On any platform where an int is 32 bits (most of them), it is
currently impossible to create a string with a length larger than
can be represented in 31 bits. As such, pickling to a string will
raise an exception when the operation is not possible.
At least on platforms supporting large files (many of them),
pickling large bytes objects to files should be possible via
repeated calls to the file.write() method.
5. The bytes type supports the PyBufferProcs interface, but a bytes object
provides the additional guarantee that the pointer will not be
deallocated or reallocated as long as a reference to the bytes
object is held. This implies that a bytes object is not resizable
once it is created, but allows the global interpreter lock (GIL)
to be released while a separate thread manipulates the memory
pointed to if the PyBytes_Check(...) test passes.
This characteristic of the bytes object allows it to be used in
situations such as asynchronous file I/O or on multiprocessor
machines where the pointer obtained by PyBufferProcs will be used
independently of the global interpreter lock.
Knowing that the pointer can not be reallocated or freed after the
GIL is released gives extension authors the capability to get true
concurrency and make use of additional processors for long running
computations on the pointer.
6. In C/C++ extensions, the bytes object can be created from a supplied
pointer and destructor function to free the memory when the
reference count goes to zero.
The special implementation of slicing for the bytes object allows
multiple bytes objects to refer to the same pointer/destructor.
As such, a refcount will be kept on the actual
pointer/destructor. This refcount is separate from the refcount
typically associated with Python objects.
XXX: It may be desirable to expose the inner refcounted object as an
actual Python object. If a good use case arises, it should be possible
for this to be implemented later with no loss to backwards compatibility.
7. It is also possible to signify the bytes object as readonly, in this
case it isn't actually mutable, but does provide the other features of a
bytes object.
8. The bytes object keeps track of the length of its data with a Python
LONG_LONG type. Even though the current definition for PyBufferProcs
restricts the length to be the size of an int, this PEP does not propose
to make any changes there. Instead, extensions can work around this limit
by making an explicit PyBytes_Check(...) call, and if that succeeds they
can make a PyBytes_GetReadBuffer(...) or PyBytes_GetWriteBuffer call to
get the pointer and full length of the object as a LONG_LONG.
The bytes object will raise an exception if the standard PyBufferProcs
mechanism is used and the size of the bytes object is greater than can be
represented by an integer.
From Python scripting, the bytes object will be subscriptable with longs
so the 32 bit int limit can be avoided.
There is still a problem with the len() function as it is PyObject_Size()
and this returns an int as well. As a workaround, the bytes object will
provide a .length() method that will return a long.
9. The bytes object can be constructed at the Python scripting level by
passing an int/long to the bytes constructor with the number of bytes to
allocate. For example:
b = bytes(100000) # alloc 100K bytes
The constructor can also take another bytes object. This will be useful
for the implementation of unpickling, and in converting a read-write bytes
object into a read-only one. An optional second argument will be used to
designate creation of a readonly bytes object.
10. From the C API, the bytes object can be allocated using any of the
following signatures:
PyObject* PyBytes_FromLength(LONG_LONG len, int readonly);
PyObject* PyBytes_FromPointer(void* ptr, LONG_LONG len, int readonly
void (*dest)(void *ptr, void *user), void* user);
In the PyBytes_FromPointer(...) function, if the dest function pointer is
passed in as NULL, it will not be called. This should only be used for
creating bytes objects from statically allocated space.
The user pointer has been called a closure in other places. It is a
pointer that the user can use for whatever purposes. It will be passed to
the destructor function on cleanup and can be useful for a number of
things. If the user pointer is not needed, NULL should be passed instead.
11. The bytes type will be a new style class as that seems to be where all
standard Python types are headed.
Contrast to existing types
The most common way to work around the lack of a bytes object has been to
simply use a string object in its place. Binary files, the struct/array
modules, and several other examples exist of this. Putting aside the
style issue that these uses typically have nothing to do with text
strings, there is the real problem that strings are not mutable, so direct
manipulation of the data returned in these cases is not possible. Also,
numerous optimizations in the string module (such as caching the hash
value or interning the pointers) mean that extension authors are on very
thin ice if they try to break the rules with the string object.
The buffer object seems like it was intended to address the purpose that
the bytes object is trying fulfill, but several shortcomings in its
implementation [1] have made it less useful in many common cases. The
buffer object made a different choice for its slicing behavior (it returns
new strings instead of buffers for slicing and other operations), and it
doesn't make many of the promises on alignment or being able to release
the GIL that the bytes object does.
Also in regards to the buffer object, it is not possible to simply replace
the buffer object with the bytes object and maintain backwards
compatibility. The buffer object provides a mechanism to take the
PyBufferProcs supplied pointer of another object and present it as its
own. Since the behavior of the other object can not be guaranteed to
follow the same set of strict rules that a bytes object does, it can't be
used in places that a bytes object could.
The array module supports the creation of an array of bytes, but it does
not provide a C API for supplying pointers and destructors to extension
supplied memory. This makes it unusable for constructing objects out of
shared memory, or memory that has special alignment or locking for things
like DMA transfers. Also, the array object does not currently pickle.
Finally since the array object allows its contents to grow, via the extend
method, the pointer can be changed if the GIL is not held while using it.
Creating a buffer object from an array object has the same problem of
leaving an invalid pointer when the array object is resized.
The mmap object caters to its particular niche, but does not attempt to
solve a wider class of problems.
Finally, any third party extension can not implement pickling without
creating a temporary object of a standard python type. For example in the
Numeric community, it is unpleasant that a large array can't pickle
without creating a large binary string to duplicate the array data.
Backward Compatibility
The only possibility for backwards compatibility problems that the author
is aware of are in previous versions of Python that try to unpickle data
containing the new bytes type.
Reference Implementation
XXX: Actual implementation is in progress, but changes are still possible
as this PEP gets further review.
The following new files will be added to the Python baseline:
Include/bytesobject.h # C interface
Objects/bytesobject.c # C implementation
Lib/test/test_bytes.py # unit testing
Doc/lib/libbytes.tex # documentation
The following files will also be modified:
Include/Python.h # adding bytesmodule.h include file
Python/bltinmodule.c # adding the bytes type object
Modules/cPickle.c # adding bytes to the standard types
Lib/pickle.py # adding bytes to the standard types
It is possible that several other modules could be cleaned up and
implemented in terms of the bytes object. The mmap module comes to mind
first, but as noted above it would be possible to reimplement the array
module as a pure Python module. While it is attractive that this PEP
could actually reduce the amount of source code by some amount, the author
feels that this could cause unnecessary risk for breaking existing
applications and should be avoided at this time.
Additional Notes/Comments
- Guido van Rossum wondered whether it would make sense to be able
to create a bytes object from a mmap object. The mmap object
appears to support the requirements necessary to provide memory
for a bytes object. (It doesn't resize, and the pointer is valid
for the lifetime of the object.) As such, a method could be added
to the mmap module such that a bytes object could be created
directly from a mmap object. An initial stab at how this would be
implemented would be to use the PyBytes_FromPointer() function
described above and pass the mmap_object as the user pointer. The
destructor function would decref the mmap_object for cleanup.
- Todd Miller notes that it may be useful to have two new functions:
PyObject_AsLargeReadBuffer() and PyObject_AsLargeWriteBuffer that are
similar to PyObject_AsReadBuffer() and PyObject_AsWriteBuffer(), but
support getting a LONG_LONG length in addition to the void* pointer.
These functions would allow extension authors to work transparently with
bytes object (that support LONG_LONG lengths) and most other buffer like
objects (which only support int lengths). These functions could be in
lieu of, or in addition to, creating a specific PyByte_GetReadBuffer() and
PyBytes_GetWriteBuffer() functions.
XXX: The author thinks this is very a good idea as it paves the way for
other objects to eventually support large (64 bit) pointers, and it should
only affect abstract.c and abstract.h. Should this be added above?
- It was generally agreed that abusing the segment count of the
PyBufferProcs interface is not a good hack to work around the 31 bit
limitation of the length. If you don't know what this means, then you're
in good company. Most code in the Python baseline, and presumably in many
third party extensions, punt when the segment count is not 1.
References
[1] The buffer interface
http://mail.python.org/pipermail/python-dev/2000-October/009974.html
Copyright
This document has been placed in the public domain.
pep-0297 Support for System Upgrades
| PEP: | 297 |
|---|---|
| Title: | Support for System Upgrades |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Marc-AndrĂŠ Lemburg <mal at lemburg.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 19-Jul-2001 |
| Python-Version: | 2.6 |
| Post-History: |
Rejection Notice
This PEP is rejected for failure to generate significant interest.
Abstract
This PEP proposes strategies to allow the Python standard library
to be upgraded in parts without having to reinstall the complete
distribution or having to wait for a new patch level release.
Problem
Python currently does not allow overriding modules or packages in
the standard library per default. Even though this is possible by
defining a PYTHONPATH environment variable (the paths defined in
this variable are prepended to the Python standard library path),
there is no standard way of achieving this without changing the
configuration.
Since Python's standard library is starting to host packages which
are also available separately, e.g. the distutils, email and PyXML
packages, which can also be installed independently of the Python
distribution, it is desirable to have an option to upgrade these
packages without having to wait for a new patch level release of
the Python interpreter to bring along the changes.
On some occasions, it may also be desirable to update modules of
the standard library without going through the whole Python release
cycle, e.g. in order to provide hot-fixes for security problems.
Proposed Solutions
This PEP proposes two different but not necessarily conflicting
solutions:
1. Adding a new standard search path to sys.path:
$stdlibpath/system-packages just before the $stdlibpath
entry. This complements the already existing entry for site
add-ons $stdlibpath/site-packages which is appended to the
sys.path at interpreter startup time.
To make use of this new standard location, distutils will need
to grow support for installing certain packages in
$stdlibpath/system-packages rather than the standard location
for third-party packages $stdlibpath/site-packages.
2. Tweaking distutils to install directly into $stdlibpath for the
system upgrades rather than into $stdlibpath/site-packages.
The first solution has a few advantages over the second:
* upgrades can be easily identified (just look in
$stdlibpath/system-packages)
* upgrades can be de-installed without affecting the rest
of the interpreter installation
* modules can be virtually removed from packages; this is
due to the way Python imports packages: once it finds the
top-level package directory it stay in this directory for
all subsequent package submodule imports
* the approach has an overall much cleaner design than the
hackish install on top of an existing installation approach
The only advantages of the second approach are that the Python
interpreter does not have to changed and that it works with
older Python versions.
Both solutions require changes to distutils. These changes can
also be implemented by package authors, but it would be better to
define a standard way of switching on the proposed behaviour.
Scope
Solution 1: Python 2.6 and up
Solution 2: all Python versions supported by distutils
Credits
None
References
None
Copyright
This document has been placed in the public domain.
pep-0298 The Locked Buffer Interface
| PEP: | 298 |
|---|---|
| Title: | The Locked Buffer Interface |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Thomas Heller <theller at python.net> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Created: | 26-Jul-2002 |
| Python-Version: | 2.3 |
| Post-History: | 30-Jul-2002, 1-Aug-2002 |
Abstract
This PEP proposes an extension to the buffer interface called the
'locked buffer interface'.
The locked buffer interface avoids the flaws of the 'old' buffer
interface [1] as defined in Python versions up to and including
2.2, and has the following semantics:
The lifetime of the retrieved pointer is clearly defined and
controlled by the client.
The buffer size is returned as a 'size_t' data type, which
allows access to large buffers on platforms where sizeof(int)
!= sizeof(void *).
(Guido comments: This second sounds like a change we could also
make to the "old" buffer interface, if we introduce another flag
bit that's *not* part of the default flags.)
Specification
The locked buffer interface exposes new functions which return the
size and the pointer to the internal memory block of any python
object which chooses to implement this interface.
Retrieving a buffer from an object puts this object in a locked
state during which the buffer may not be freed, resized, or
reallocated.
The object must be unlocked again by releasing the buffer if it's
no longer used by calling another function in the locked buffer
interface. If the object never resizes or reallocates the buffer
during its lifetime, this function may be NULL. Failure to call
this function (if it is != NULL) is a programming error and may
have unexpected results.
The locked buffer interface omits the memory segment model which
is present in the old buffer interface - only a single memory
block can be exposed.
The memory blocks can be accessed without holding the global
interpreter lock.
Implementation
Define a new flag in Include/object.h:
/* PyBufferProcs contains bf_acquirelockedreadbuffer,
bf_acquirelockedwritebuffer, and bf_releaselockedbuffer */
#define Py_TPFLAGS_HAVE_LOCKEDBUFFER (1L<<15)
This flag would be included in Py_TPFLAGS_DEFAULT:
#define Py_TPFLAGS_DEFAULT ( \
....
Py_TPFLAGS_HAVE_LOCKEDBUFFER | \
....
0)
Extend the PyBufferProcs structure by new fields in
Include/object.h:
typedef size_t (*acquirelockedreadbufferproc)(PyObject *,
const void **);
typedef size_t (*acquirelockedwritebufferproc)(PyObject *,
void **);
typedef void (*releaselockedbufferproc)(PyObject *);
typedef struct {
getreadbufferproc bf_getreadbuffer;
getwritebufferproc bf_getwritebuffer;
getsegcountproc bf_getsegcount;
getcharbufferproc bf_getcharbuffer;
/* locked buffer interface functions */
acquirelockedreadbufferproc bf_acquirelockedreadbuffer;
acquirelockedwritebufferproc bf_acquirelockedwritebuffer;
releaselockedbufferproc bf_releaselockedbuffer;
} PyBufferProcs;
The new fields are present if the Py_TPFLAGS_HAVE_LOCKEDBUFFER
flag is set in the object's type.
The Py_TPFLAGS_HAVE_LOCKEDBUFFER flag implies the
Py_TPFLAGS_HAVE_GETCHARBUFFER flag.
The acquirelockedreadbufferproc and acquirelockedwritebufferproc
functions return the size in bytes of the memory block on success,
and fill in the passed void * pointer on success. If these
functions fail - either because an error occurs or no memory block
is exposed - they must set the void * pointer to NULL and raise an
exception. The return value is undefined in these cases and
should not be used.
If calls to these functions succeed, eventually the buffer must be
released by a call to the releaselockedbufferproc, supplying the
original object as argument. The releaselockedbufferproc cannot
fail. For objects that actually maintain an internal lock count
it would be a fatal error if the releaselockedbufferproc function
would be called too often, leading to a negative lock count.
Similar to the 'old' buffer interface, any of these functions may
be set to NULL, but it is strongly recommended to implement the
releaselockedbufferproc function (even if it does nothing) if any
of the acquireread/writelockedbufferproc functions are
implemented, to discourage extension writers from checking for a
NULL value and not calling it.
These functions aren't supposed to be called directly, they are
called through convenience functions declared in
Include/abstract.h:
int PyObject_AquireLockedReadBuffer(PyObject *obj,
const void **buffer,
size_t *buffer_len);
int PyObject_AcquireLockedWriteBuffer(PyObject *obj,
void **buffer,
size_t *buffer_len);
void PyObject_ReleaseLockedBuffer(PyObject *obj);
The former two functions return 0 on success, set buffer to the
memory location and buffer_len to the length of the memory block
in bytes. On failure, or if the locked buffer interface is not
implemented by obj, they return -1 and set an exception.
The latter function doesn't return anything, and cannot fail.
Backward Compatibility
The size of the PyBufferProcs structure changes if this proposal
is implemented, but the type's tp_flags slot can be used to
determine if the additional fields are present.
Reference Implementation
An implementation has been uploaded to the SourceForge patch
manager as http://www.python.org/sf/652857.
Additional Notes/Comments
Python strings, unicode strings, mmap objects, and array objects
would expose the locked buffer interface.
mmap and array objects would actually enter a locked state while
the buffer is active, this is not needed for strings and unicode
objects. Resizing locked array objects is not allowed and will
raise an exception. Whether closing a locked mmap object is an
error or will only be deferred until the lock count reaches zero
is an implementation detail.
Guido recommends:
But I'm still very concerned that if most built-in types
(e.g. strings, bytes) don't implement the release
functionality, it's too easy for an extension to seem to work
while forgetting to release the buffer.
I recommend that at least some built-in types implement the
acquire/release functionality with a counter, and assert that
the counter is zero when the object is deleted -- if the
assert fails, someone DECREF'ed their reference to the object
without releasing it. (The rule should be that you must own a
reference to the object while you've aquired the object.)
For strings that might be impractical because the string
object would have to grow 4 bytes to hold the counter; but the
new bytes object (PEP 296) could easily implement the counter,
and the array object too -- that way there will be plenty of
opportunity to test proper use of the protocol.
Community Feedback
Greg Ewing doubts the locked buffer interface is needed at all, he
thinks the normal buffer interface could be used if the pointer is
(re)fetched each time it's used. This seems to be dangerous,
because even innocent looking calls to the Python API like
Py_DECREF() may trigger execution of arbitrary Python code.
The first version of this proposal didn't have the release
function, but it turned out that this would have been too
restrictive: mmap and array objects wouldn't have been able to
implement it, because mmap objects can be closed anytime if not
locked, and array objects could resize or reallocate the buffer.
This PEP will probably be rejected because nobody except the
author needs it.
References
[1] The buffer interface
http://mail.python.org/pipermail/python-dev/2000-October/009974.html
[2] The Buffer Problem
http://www.python.org/dev/peps/pep-0296/
Copyright
This document has been placed in the public domain.
pep-0299 Special __main__() function in modules
| PEP: | 299 |
|---|---|
| Title: | Special __main__() function in modules |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeff Epler <jepler at unpythonic.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 12-Aug-2002 |
| Python-Version: | 2.3 |
| Post-History: | 29-Mar-2006 |
Abstract
Many Python modules are also intended to be callable as standalone
scripts. This PEP proposes that a special function called
__main__() should serve this purpose.
Motivation
There should be one simple and universal idiom for invoking a
module as a standalone script.
The semi-standard idiom
if __name__ == '__main__':
perform "standalone" functionality
is unclear to programmers of languages like C and C++. It also
does not permit invocation of the standalone function when the
module is imported. The variant
if __name__ == '__main__':
main_function()
is sometimes seen, but there exists no standard name for the
function, and because arguments are taken from sys.argv it is not
possible to pass specific arguments without changing the argument
list seen by all other modules. (Imagine a threaded Python
program, with two threads wishing to invoke the standalone
functionality of different modules with different argument lists)
Proposal
The standard name of the 'main function' should be '__main__'.
When a module is invoked on the command line, such as
python mymodule.py
then the module behaves as though the following lines existed at
the end of the module (except that the attribute __sys may not be
used or assumed to exist elsewhere in the script):
if globals().has_key("__main__"):
import sys as __sys
__sys.exit(__main__(__sys.argv))
Other modules may execute
import mymodule
mymodule.__main__(['mymodule', ...])
It is up to mymodule to document thread-safety issues or other
issues which might restrict use of __main__. (Other issues might
include use of mutually exclusive GUI modules, non-sharable
resources like hardware devices, reassignment of sys.stdin/stdout,
etc)
Implementation
In modules/main.c, the block near line 385 (after the
PyRun_AnyFileExFlags call) will be changed so that the above code
(or its C equivalent) is executed.
Open Issues
- Should the return value from __main__ be treated as the exit value?
Yes. Many __main__ will naturally return None, which sys.exit
translates into a "success" return code. In those that return a
numeric result, it behaves just like the argument to sys.exit()
or the return value from C's main().
- Should the argument list to __main__ include argv[0], or just the
"real" arguments argv[1:]?
argv[0] is included for symmetry with sys.argv and easy
transition to the new standard idiom.
Rejection
In a short discussion on python-dev [1], two major backwards
compatibility problems were brought up and Guido pronounced that he
doesn't like the idea anyway as it's "not worth the change (in docs,
user habits, etc.) and there's nothing particularly broken."
References
[1] Georg Brandl, "What about PEP 299",
http://mail.python.org/pipermail/python-dev/2006-March/062951.html
Copyright
This document has been placed in the public domain.
pep-0301 Package Index and Metadata for Distutils
| PEP: | 301 |
|---|---|
| Title: | Package Index and Metadata for Distutils |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Richard Jones <richard at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 24-Oct-2002 |
| Python-Version: | 2.3 |
| Post-History: | 8-Nov-2002 |
Contents
Abstract
This PEP proposes several extensions to the Distutils packaging system [1]. These enhancements include a central package index server, tools for submitting package information to the index and extensions to the package metadata to include Trove [2] information.
This PEP does not address issues of package dependency. It also does not address storage and download of packages as described in PEP 243 [6]. Nor is it proposing a local database of packages as described in PEP 262 [7].
Existing package repositories such as the Vaults of Parnassus [3], CPAN [4] and PAUSE [5] will be investigated as prior art in this field.
Rationale
Python programmers have long needed a simple method of discovering existing modules and systems available for their use. It is arguable that the existence of these systems for other languages have been a significant contribution to their popularity. The existence of the Catalog-SIG, and the many discussions there indicate that there is a large population of users who recognise this need.
The introduction of the Distutils packaging system to Python simplified the process of distributing shareable code, and included mechanisms for the capture of package metadata, but did little with the metadata save ship it with the package.
An interface to the index should be hosted in the python.org domain, giving it an air of legitimacy that existing catalog efforts do not have.
The interface for submitting information to the catalog should be as simple as possible - hopefully just a one-line command for most users.
Issues of package dependency are not addressed due to the complexity of such a system. PEP 262 proposes such a system, but as of this writing the PEP is still unfinished.
Issues of package dissemination (storage on a central server) are not addressed because they require assumptions about availability of storage and bandwidth that I am not in a position to make. PEP 243, which is still being developed, is tackling these issues and many more. This proposal is considered compatible with, and adjunct to the proposal in PEP 243.
Specification
The specification takes three parts, the web interface, the Distutils register command and the Distutils Trove classification.
Web Interface
A web interface is implemented over a simple store. The interface is available through the python.org domain, either directly or as packages.python.org.
The store has columns for all metadata fields. The (name, version) double is used as a uniqueness key. Additional submissions for an existing (name, version) will result in an update operation.
The web interface implements the following commands/interfaces:
- index
- Lists known packages, optionally filtered. An additional HTML page, search, presents a form to the user which is used to customise the index view. The index will include a browsing interface like that presented in the Trove interface design section 4.3. The results will be paginated, sorted alphabetically and only showing the most recent version. The most recent version information will be determined using the Distutils LooseVersion class.
- display
- Displays information about the package. All fields are displayed as plain text. The "url" (or "home_page") field is hyperlinked.
- submit
Accepts a POST submission of metadata about a package. The "name" and "version" fields are mandatory, as they uniquely identify an entry in the index. Submit will automatically determine whether to create a new entry or update an existing entry. The metadata is checked for correctness where appropriate - specifically the Trove discriminators are compared with the allowed set. An update will update all information about the package based on the new submitted information.
There will also be a submit/edit form that will allow manual submission and updating for those who do not use Distutils.
- submit_pkg_info
- Accepts a POST submission of a PKG-INFO file and performs the same function as the submit interface.
- user
Registers a new user with the index. Requires username, password and email address. Passwords will be stored in the index database as SHA hashes. If the username already exists in the database:
- If valid HTTP Basic authentication is provided, the password and email address are updated with the submission information, or
- If no valid authentication is provided, the user is informed that the login is already taken.
Registration will be a three-step process, involving:
- User submission of details via the Distutils register command or through the web,
- Index server sending email to the user's email address with a URL to visit to confirm registration with a random one-time key, and
- User visits URL with the key and confirms registration.
- roles
- An interface for changing user Role assignments.
- password_reset
- Using a supplied email address as the key, this resets a user's password and sends an email with the new password to the user.
The submit command will require HTTP Basic authentication, preferably over an HTTPS connection.
The server interface will indicate success or failure of the commands through a subset of the standard HTTP response codes:
| Code | Meaning | Register command implications |
|---|---|---|
| 200 | OK | Everything worked just fine |
| 400 | Bad request | Data provided for submission was malformed |
| 401 | Unauthorised | The username or password supplied were incorrect |
| 403 | Forbidden | User does not have permission to update the package information (not Owner or Maintainer) |
User Roles
Three user Roles will be assignable to users:
- Owner
- Owns a package name, may assign Maintainer Role for that name. The first user to register information about a package is deemed Owner of the package name. The Admin user may change this if necessary. May submit updates for the package name.
- Maintainer
- Can submit and update info for a particular package name.
- Admin
- Can assign Owner Role and edit user details. Not specific to a package name.
Index Storage (Schema)
The index is stored in a set of relational database tables:
- packages
- Lists package names and holds package-level metadata (currently just the stable release version)
- releases
- Each package has an entry in releases for each version of the package that is released. A row holds the bulk of the information given in the package's PKG-INFO file. There is one row for each package (name, version).
- trove_discriminators
- Lists the Trove discriminator text and assigns each one a unique ID.
- release_discriminators
- Each entry maps a package (name, version) to a discriminator_id. We map to releases instead of packages because the set of discriminators may change between releases.
- journals
- Holds information about changes to package information in the index. Changes to the packages, releases, roles, and release_discriminators tables are listed here by package name and version if the change is release-specific.
- users
- Holds our user database - user name, email address and password.
- roles
- Maps user_name and role_name to a package_name.
An additional table, rego_otk holds the One Time Keys generated during registration and is not interesting in the scope of the index itself.
Distutils register Command
An additional Distutils command, register, is implemented which posts the package metadata to the central index. The register command automatically handles user registration; the user is presented with three options:
- login and submit package information
- register as a new packager
- send password reminder email
On systems where the $HOME environment variable is set, the user will be prompted at exit to save their username/password to a file in their $HOME directory in the file .pypirc.
Notification of changes to a package entry will be sent to all users who have submitted information about the package. That is, the original submitter and any subsequent updaters.
The register command will include a --verify option which performs a test submission to the index without actually committing the data. The index will perform its submission verification checks as usual and report any errors it would have reported during a normal submission. This is useful for verifying correctness of Trove discriminators.
Distutils Trove Classification
The Trove concept of discrimination will be added to the metadata set available to package authors through the new attribute "classifiers". The list of classifiers will be available through the web, and added to the package like so:
setup(
name = "roundup",
version = __version__,
classifiers = [
'Development Status :: 4 - Beta',
'Environment :: Console',
'Environment :: Web Environment',
'Intended Audience :: End Users/Desktop',
'Intended Audience :: Developers',
'Intended Audience :: System Administrators',
'License :: OSI Approved :: Python Software Foundation License',
'Operating System :: MacOS :: MacOS X',
'Operating System :: Microsoft :: Windows',
'Operating System :: POSIX',
'Programming Language :: Python',
'Topic :: Communications :: Email',
'Topic :: Office/Business',
'Topic :: Software Development :: Bug Tracking',
],
url = 'http://sourceforge.net/projects/roundup/',
...
)
It was decided that strings would be used for the classification entries due to the deep nesting that would be involved in a more formal Python structure.
The original Trove specification that classification namespaces be separated by slashes ("/") unfortunately collides with many of the names having slashes in them (e.g. "OS/2"). The double-colon solution (" :: ") implemented by SourceForge and FreshMeat gets around this limitation.
The list of classification values on the module index has been merged from FreshMeat and SourceForge (with their permission). This list will be made available both through the web interface and through the register command's --list-classifiers option as a text list which may then be copied to the setup.py file. The register command's --verify option will check classifiers values against the server's list.
Unfortunately, the addition of the "classifiers" property is not backwards-compatible. A setup.py file using it will not work under Python 2.1.3. It is hoped that a bug-fix release of Python 2.2 (most likely 2.2.3) will relax the argument checking of the setup() command to allow new keywords, even if they're not actually used. It is preferable that a warning be produced, rather than a show-stopping error. The use of the new keyword should be discouraged in situations where the package is advertised as being compatible with python versions earlier than 2.2.3 or 2.3.
In the PKG-INFO, the classifiers list items will appear as individual Classifier: entries:
Name: roundup
Version: 0.5.2
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console (Text Based)
.
.
Classifier: Topic :: Software Development :: Bug Tracking
Url: http://sourceforge.net/projects/roundup/
Implementation
The server is available at:
http://www.python.org/pypi
The code is available from the SourceForge project:
http://sourceforge.net/projects/pypi/
The register command has been integrated into Python 2.3.
Rejected Proposals
Originally, the index server was to return custom headers (inspired by PEP 243):
- X-Pypi-Status
- Either "success" or "fail".
- X-Pypi-Reason
- A description of the reason for failure, or additional information in the case of a success.
However, it has been pointed out [8] that this is a bad scheme to use.
References
| [1] | Distutils packaging system (http://docs.python.org/library/distutils.html) |
| [2] | Trove (http://www.catb.org/~esr/trove/) |
| [3] | Vaults of Parnassus (http://www.vex.net/parnassus/) |
| [4] | CPAN (http://www.cpan.org/) |
| [5] | PAUSE (http://pause.cpan.org/) |
| [6] | PEP 243, Module Repository Upload Mechanism (http://www.python.org/dev/peps/pep-0243/) |
| [7] | PEP 262, A Database of Installed Python Packages (http://www.python.org/dev/peps/pep-0262/) |
| [8] | [PEP243] upload status is bogus (http://mail.python.org/pipermail/distutils-sig/2001-March/002262.html) |
Copyright
This document has been placed in the public domain.
Acknowledgements
Anthony Baxter, Martin v. Loewis and David Goodger for encouragement and feedback during initial drafting.
A.M. Kuchling for support including hosting the second prototype.
Greg Stein for recommending that the register command interpret the HTTP response codes rather than custom X-PyPI-* headers.
The many participants of the Distutils and Catalog SIGs for their ideas over the years.
pep-0302 New Import Hooks
| PEP: | 302 |
|---|---|
| Title: | New Import Hooks |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Just van Rossum <just at letterror.com>, Paul Moore <p.f.moore at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 19-Dec-2002 |
| Python-Version: | 2.3 |
| Post-History: | 19-Dec-2002 |
Contents
- Abstract
- Motivation
- Use cases
- Rationale
- Specification part 1: The Importer Protocol
- Specification part 2: Registering Hooks
- Packages and the role of __path__
- Optional Extensions to the Importer Protocol
- Integration with the 'imp' module
- Forward Compatibility
- Open Issues
- Implementation
- References and Footnotes
- Copyright
Warning
The language reference for import [10] and importlib documentation [11] now supercede this PEP. This document is no longer updated and provided for historical purposes only.
Abstract
This PEP proposes to add a new set of import hooks that offer better customization of the Python import mechanism. Contrary to the current __import__ hook, a new-style hook can be injected into the existing scheme, allowing for a finer grained control of how modules are found and how they are loaded.
Motivation
The only way to customize the import mechanism is currently to override the built-in __import__ function. However, overriding __import__ has many problems. To begin with:
- An __import__ replacement needs to fully reimplement the entire import mechanism, or call the original __import__ before or after the custom code.
- It has very complex semantics and responsibilities.
- __import__ gets called even for modules that are already in sys.modules, which is almost never what you want, unless you're writing some sort of monitoring tool.
The situation gets worse when you need to extend the import mechanism from C: it's currently impossible, apart from hacking Python's import.c or reimplementing much of import.c from scratch.
There is a fairly long history of tools written in Python that allow extending the import mechanism in various way, based on the __import__ hook. The Standard Library includes two such tools: ihooks.py (by GvR) and imputil.py [1] (Greg Stein), but perhaps the most famous is iu.py by Gordon McMillan, available as part of his Installer package. Their usefulness is somewhat limited because they are written in Python; bootstrapping issues need to worked around as you can't load the module containing the hook with the hook itself. So if you want the entire Standard Library to be loadable from an import hook, the hook must be written in C.
Use cases
This section lists several existing applications that depend on import hooks. Among these, a lot of duplicate work was done that could have been saved if there had been a more flexible import hook at the time. This PEP should make life a lot easier for similar projects in the future.
Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include modules that are bundled together in an archive; byte code that is not stored in a pyc formatted file; modules that are loaded from a database over a network.
The work on this PEP was partly triggered by the implementation of PEP 273, which adds imports from Zip archives as a built-in feature to Python. While the PEP itself was widely accepted as a must-have feature, the implementation left a few things to desire. For one thing it went through great lengths to integrate itself with import.c, adding lots of code that was either specific for Zip file imports or not specific to Zip imports, yet was not generally useful (or even desirable) either. Yet the PEP 273 implementation can hardly be blamed for this: it is simply extremely hard to do, given the current state of import.c.
Packaging applications for end users is a typical use case for import hooks, if not the typical use case. Distributing lots of source or pyc files around is not always appropriate (let alone a separate Python installation), so there is a frequent desire to package all needed modules in a single file. So frequent in fact that multiple solutions have been implemented over the years.
The oldest one is included with the Python source code: Freeze [2]. It puts marshalled byte code into static objects in C source code. Freeze's "import hook" is hard wired into import.c, and has a couple of issues. Later solutions include Fredrik Lundh's Squeeze, Gordon McMillan's Installer, and Thomas Heller's py2exe [3]. MacPython ships with a tool called BuildApplication.
Squeeze, Installer and py2exe use an __import__ based scheme (py2exe currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython has two Mac-specific import hooks hard wired into import.c, that are similar to the Freeze hook. The hooks proposed in this PEP enables us (at least in theory; it's not a short term goal) to get rid of the hard coded hooks in import.c, and would allow the __import__-based tools to get rid of most of their import.c emulation code.
Before work on the design and implementation of this PEP was started, a new BuildApplication-like tool for Mac OS X prompted one of the authors of this PEP (JvR) to expose the table of frozen modules to Python, in the imp module. The main reason was to be able to use the freeze import hook (avoiding fancy __import__ support), yet to also be able to supply a set of modules at runtime. This resulted in issue #642578 [4], which was mysteriously accepted (mostly because nobody seemed to care either way ;-). Yet it is completely superfluous when this PEP gets accepted, as it offers a much nicer and general way to do the same thing.
Rationale
While experimenting with alternative implementation ideas to get built-in Zip import, it was discovered that achieving this is possible with only a fairly small amount of changes to import.c. This allowed to factor out the Zip-specific stuff into a new source file, while at the same time creating a general new import hook scheme: the one you're reading about now.
An earlier design allowed non-string objects on sys.path. Such an object would have the necessary methods to handle an import. This has two disadvantages: 1) it breaks code that assumes all items on sys.path are strings; 2) it is not compatible with the PYTHONPATH environment variable. The latter is directly needed for Zip imports. A compromise came from Jython: allow string subclasses on sys.path, which would then act as importer objects. This avoids some breakage, and seems to work well for Jython (where it is used to load modules from .jar files), but it was perceived as an "ugly hack".
This lead to a more elaborate scheme, (mostly copied from McMillan's iu.py) in which each in a list of candidates is asked whether it can handle the sys.path item, until one is found that can. This list of candidates is a new object in the sys module: sys.path_hooks.
Traversing sys.path_hooks for each path item for each new import can be expensive, so the results are cached in another new object in the sys module: sys.path_importer_cache. It maps sys.path entries to importer objects.
To minimize the impact on import.c as well as to avoid adding extra overhead, it was chosen to not add an explicit hook and importer object for the existing file system import logic (as iu.py has), but to simply fall back to the built-in logic if no hook on sys.path_hooks could handle the path item. If this is the case, a None value is stored in sys.path_importer_cache, again to avoid repeated lookups. (Later we can go further and add a real importer object for the built-in mechanism, for now, the None fallback scheme should suffice.)
A question was raised: what about importers that don't need any entry on sys.path? (Built-in and frozen modules fall into that category.) Again, Gordon McMillan to the rescue: iu.py contains a thing he calls the metapath. In this PEP's implementation, it's a list of importer objects that is traversed before sys.path. This list is yet another new object in the sys module: sys.meta_path. Currently, this list is empty by default, and frozen and built-in module imports are done after traversing sys.meta_path, but still before sys.path.
Specification part 1: The Importer Protocol
This PEP introduces a new protocol: the "Importer Protocol". It is important to understand the context in which the protocol operates, so here is a brief overview of the outer shells of the import mechanism.
When an import statement is encountered, the interpreter looks up the __import__ function in the built-in name space. __import__ is then called with four arguments, amongst which are the name of the module being imported (may be a dotted name) and a reference to the current global namespace.
The built-in __import__ function (known as PyImport_ImportModuleEx() in import.c) will then check to see whether the module doing the import is a package or a submodule of a package. If it is indeed a (submodule of a) package, it first tries to do the import relative to the package (the parent package for a submodule). For example if a package named "spam" does "import eggs", it will first look for a module named "spam.eggs". If that fails, the import continues as an absolute import: it will look for a module named "eggs". Dotted name imports work pretty much the same: if package "spam" does "import eggs.bacon" (and "spam.eggs" exists and is itself a package), "spam.eggs.bacon" is tried. If that fails "eggs.bacon" is tried. (There are more subtleties that are not described here, but these are not relevant for implementers of the Importer Protocol.)
Deeper down in the mechanism, a dotted name import is split up by its components. For "import spam.ham", first an "import spam" is done, and only when that succeeds is "ham" imported as a submodule of "spam".
The Importer Protocol operates at this level of individual imports. By the time an importer gets a request for "spam.ham", module "spam" has already been imported.
The protocol involves two objects: a finder and a loader. A finder object has a single method:
finder.find_module(fullname, path=None)
This method will be called with the fully qualified name of the module. If the finder is installed on sys.meta_path, it will receive a second argument, which is None for a top-level module, or package.__path__ for submodules or subpackages [5]. It should return a loader object if the module was found, or None if it wasn't. If find_module() raises an exception, it will be propagated to the caller, aborting the import.
A loader object also has one method:
loader.load_module(fullname)
This method returns the loaded module or raises an exception, preferably ImportError if an existing exception is not being propagated. If load_module() is asked to load a module that it cannot, ImportError is to be raised.
In many cases the finder and loader can be one and the same object: finder.find_module() would just return self.
The fullname argument of both methods is the fully qualified module name, for example "spam.eggs.ham". As explained above, when finder.find_module("spam.eggs.ham") is called, "spam.eggs" has already been imported and added to sys.modules. However, the find_module() method isn't necessarily always called during an actual import: meta tools that analyze import dependencies (such as freeze, Installer or py2exe) don't actually load modules, so a finder shouldn't depend on the parent package being available in sys.modules.
The load_module() method has a few responsibilities that it must fulfill before it runs any code:
If there is an existing module object named 'fullname' in sys.modules, the loader must use that existing module. (Otherwise, the reload() builtin will not work correctly.) If a module named 'fullname' does not exist in sys.modules, the loader must create a new module object and add it to sys.modules.
Note that the module object must be in sys.modules before the loader executes the module code. This is crucial because the module code may (directly or indirectly) import itself; adding it to sys.modules beforehand prevents unbounded recursion in the worst case and multiple loading in the best.
If the load fails, the loader needs to remove any module it may have inserted into sys.modules. If the module was already in sys.modules then the loader should leave it alone.
The __file__ attribute must be set. This must be a string, but it may be a dummy value, for example "<frozen>". The privilege of not having a __file__ attribute at all is reserved for built-in modules.
The __name__ attribute must be set. If one uses imp.new_module() then the attribute is set automatically.
If it's a package, the __path__ variable must be set. This must be a list, but may be empty if __path__ has no further significance to the importer (more on this later).
The __loader__ attribute must be set to the loader object. This is mostly for introspection and reloading, but can be used for importer-specific extras, for example getting data associated with an importer.
The __package__ attribute [8] must be set.
If the module is a Python module (as opposed to a built-in module or a dynamically loaded extension), it should execute the module's code in the module's global name space (module.__dict__).
Here is a minimal pattern for a load_module() method:
# Consider using importlib.util.module_for_loader() to handle # most of these details for you. def load_module(self, fullname): code = self.get_code(fullname) ispkg = self.is_package(fullname) mod = sys.modules.setdefault(fullname, imp.new_module(fullname)) mod.__file__ = "<%s>" % self.__class__.__name__ mod.__loader__ = self if ispkg: mod.__path__ = [] mod.__package__ = fullname else: mod.__package__ = fullname.rpartition('.')[0] exec(code, mod.__dict__) return mod
Specification part 2: Registering Hooks
There are two types of import hooks: Meta hooks and Path hooks. Meta hooks are called at the start of import processing, before any other import processing (so that meta hooks can override sys.path processing, frozen modules, or even built-in modules). To register a meta hook, simply add the finder object to sys.meta_path (the list of registered meta hooks).
Path hooks are called as part of sys.path (or package.__path__) processing, at the point where their associated path item is encountered. A path hook is registered by adding an importer factory to sys.path_hooks.
sys.path_hooks is a list of callables, which will be checked in sequence to determine if they can handle a given path item. The callable is called with one argument, the path item. The callable must raise ImportError if it is unable to handle the path item, and return an importer object if it can handle the path item. Note that if the callable returns an importer object for a specific sys.path entry, the builtin import machinery will not be invoked to handle that entry any longer, even if the importer object later fails to find a specific module. The callable is typically the class of the import hook, and hence the class __init__() method is called. (This is also the reason why it should raise ImportError: an __init__() method can't return anything. This would be possible with a __new__() method in a new style class, but we don't want to require anything about how a hook is implemented.)
The results of path hook checks are cached in sys.path_importer_cache, which is a dictionary mapping path entries to importer objects. The cache is checked before sys.path_hooks is scanned. If it is necessary to force a rescan of sys.path_hooks, it is possible to manually clear all or part of sys.path_importer_cache.
Just like sys.path itself, the new sys variables must have specific types:
- sys.meta_path and sys.path_hooks must be Python lists.
- sys.path_importer_cache must be a Python dict.
Modifying these variables in place is allowed, as is replacing them with new objects.
Packages and the role of __path__
If a module has a __path__ attribute, the import mechanism will treat it as a package. The __path__ variable is used instead of sys.path when importing submodules of the package. The rules for sys.path therefore also apply to pkg.__path__. So sys.path_hooks is also consulted when pkg.__path__ is traversed. Meta importers don't necessarily use sys.path at all to do their work and may therefore ignore the value of pkg.__path__. In this case it is still advised to set it to list, which can be empty.
Optional Extensions to the Importer Protocol
The Importer Protocol defines three optional extensions. One is to retrieve data files, the second is to support module packaging tools and/or tools that analyze module dependencies (for example Freeze), while the last is to support execution of modules as scripts. The latter two categories of tools usually don't actually load modules, they only need to know if and where they are available. All three extensions are highly recommended for general purpose importers, but may safely be left out if those features aren't needed.
To retrieve the data for arbitrary "files" from the underlying storage backend, loader objects may supply a method named get_data():
loader.get_data(path)
This method returns the data as a string, or raise IOError if the "file" wasn't found. The data is always returned as if "binary" mode was used - there is no CRLF translation of text files, for example. It is meant for importers that have some file-system-like properties. The 'path' argument is a path that can be constructed by munging module.__file__ (or pkg.__path__ items) with the os.path.* functions, for example:
d = os.path.dirname(__file__) data = __loader__.get_data(os.path.join(d, "logo.gif"))
The following set of methods may be implemented if support for (for example) Freeze-like tools is desirable. It consists of three additional methods which, to make it easier for the caller, each of which should be implemented, or none at all:
loader.is_package(fullname) loader.get_code(fullname) loader.get_source(fullname)
All three methods should raise ImportError if the module wasn't found.
The loader.is_package(fullname) method should return True if the module specified by 'fullname' is a package and False if it isn't.
The loader.get_code(fullname) method should return the code object associated with the module, or None if it's a built-in or extension module. If the loader doesn't have the code object but it does have the source code, it should return the compiled source code. (This is so that our caller doesn't also need to check get_source() if all it needs is the code object.)
The loader.get_source(fullname) method should return the source code for the module as a string (using newline characters for line endings) or None if the source is not available (yet it should still raise ImportError if the module can't be found by the importer at all).
To support execution of modules as scripts [6], the above three methods for finding the code associated with a module must be implemented. In addition to those methods, the following method may be provided in order to allow the runpy module to correctly set the __file__ attribute:
loader.get_filename(fullname)
This method should return the value that __file__ would be set to if the named module was loaded. If the module is not found, then ImportError should be raised.
Integration with the 'imp' module
The new import hooks are not easily integrated in the existing imp.find_module() and imp.load_module() calls. It's questionable whether it's possible at all without breaking code; it is better to simply add a new function to the imp module. The meaning of the existing imp.find_module() and imp.load_module() calls changes from: "they expose the built-in import mechanism" to "they expose the basic unhooked built-in import mechanism". They simply won't invoke any import hooks. A new imp module function is proposed (but not yet implemented) under the name get_loader(), which is used as in the following pattern:
loader = imp.get_loader(fullname, path)
if loader is not None:
loader.load_module(fullname)
In the case of a "basic" import, one the imp.find_module() function would handle, the loader object would be a wrapper for the current output of imp.find_module(), and loader.load_module() would call imp.load_module() with that output.
Note that this wrapper is currently not yet implemented, although a Python prototype exists in the test_importhooks.py script (the ImpWrapper class) included with the patch.
Forward Compatibility
Existing __import__ hooks will not invoke new-style hooks by magic, unless they call the original __import__ function as a fallback. For example, ihooks.py, iu.py and imputil.py are in this sense not forward compatible with this PEP.
Open Issues
Modules often need supporting data files to do their job, particularly in the case of complex packages or full applications. Current practice is generally to locate such files via sys.path (or a package.__path__ attribute). This approach will not work, in general, for modules loaded via an import hook.
There are a number of possible ways to address this problem:
- "Don't do that". If a package needs to locate data files via its __path__, it is not suitable for loading via an import hook. The package can still be located on a directory in sys.path, as at present, so this should not be seen as a major issue.
- Locate data files from a standard location, rather than relative to the module file. A relatively simple approach (which is supported by distutils) would be to locate data files based on sys.prefix (or sys.exec_prefix). For example, looking in os.path.join(sys.prefix, "data", package_name).
- Import hooks could offer a standard way of getting at data files relative to the module file. The standard zipimport object provides a method get_data(name) which returns the content of the "file" called name, as a string. To allow modules to get at the importer object, zipimport also adds an attribute __loader__ to the module, containing the zipimport object used to load the module. If such an approach is used, it is important that client code takes care not to break if the get_data() method is not available, so it is not clear that this approach offers a general answer to the problem.
It was suggested on python-dev that it would be useful to be able to receive a list of available modules from an importer and/or a list of available data files for use with the get_data() method. The protocol could grow two additional extensions, say list_modules() and list_files(). The latter makes sense on loader objects with a get_data() method. However, it's a bit unclear which object should implement list_modules(): the importer or the loader or both?
This PEP is biased towards loading modules from alternative places: it currently doesn't offer dedicated solutions for loading modules from alternative file formats or with alternative compilers. In contrast, the ihooks module from the standard library does have a fairly straightforward way to do this. The Quixote project [7] uses this technique to import PTL files as if they are ordinary Python modules. To do the same with the new hooks would either mean to add a new module implementing a subset of ihooks as a new-style importer, or add a hookable built-in path importer object.
There is no specific support within this PEP for "stacking" hooks. For example, it is not obvious how to write a hook to load modules from tar.gz files by combining separate hooks to load modules from .tar and .gz files. However, there is no support for such stacking in the existing hook mechanisms (either the basic "replace __import__" method, or any of the existing import hook modules) and so this functionality is not an obvious requirement of the new mechanism. It may be worth considering as a future enhancement, however.
It is possible (via sys.meta_path) to add hooks which run before sys.path is processed. However, there is no equivalent way of adding hooks to run after sys.path is processed. For now, if a hook is required after sys.path has been processed, it can be simulated by adding an arbitrary "cookie" string at the end of sys.path, and having the required hook associated with this cookie, via the normal sys.path_hooks processing. In the longer term, the path handling code will become a "real" hook on sys.meta_path, and at that stage it will be possible to insert user-defined hooks either before or after it.
Implementation
The PEP 302 implementation has been integrated with Python as of 2.3a1. An earlier version is available as patch #652586 [9], but more interestingly, the issue contains a fairly detailed history of the development and design.
References and Footnotes
| [1] | imputil module http://docs.python.org/library/imputil.html |
| [2] | The Freeze tool. See also the Tools/freeze/ directory in a Python source distribution |
| [3] | py2exe by Thomas Heller http://www.py2exe.org/ |
| [4] | imp.set_frozenmodules() patch http://bugs.python.org/issue642578 |
| [5] | The path argument to finder.find_module() is there because the pkg.__path__ variable may be needed at this point. It may either come from the actual parent module or be supplied by imp.find_module() or the proposed imp.get_loader() function. |
| [6] | PEP 338: Executing modules as scripts http://www.python.org/dev/peps/pep-0338/ |
| [7] | Quixote, a framework for developing Web applications http://www.mems-exchange.org/software/quixote/ |
| [8] | PEP 366: Main module explicit relative imports http://www.python.org/dev/peps/pep-0366/ |
| [9] | New import hooks + Import from Zip files http://bugs.python.org/issue652586 |
| [10] | Language reference for imports http://docs.python.org/3/reference/import.html |
| [11] | importlib documentation http://docs.python.org/3/library/importlib.html#module-importlib |
Copyright
This document has been placed in the public domain.
pep-0303 Extend divmod() for Multiple Divisors
| PEP: | 303 |
|---|---|
| Title: | Extend divmod() for Multiple Divisors |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Thomas Bellman <bellman+pep-divmod at lysator.liu.se> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 31-Dec-2002 |
| Python-Version: | 2.3 |
| Post-History: |
Abstract
This PEP describes an extension to the built-in divmod() function,
allowing it to take multiple divisors, chaining several calls to
divmod() into one.
Pronouncement
This PEP is rejected. Most uses for chained divmod() involve a
constant modulus (in radix conversions for example) and are more
properly coded as a loop. The example of splitting seconds
into days/hours/minutes/seconds does not generalize to months
and years; rather, the whole use case is handled more flexibly and
robustly by date and time modules. The other use cases mentioned
in the PEP are somewhat rare in real code. The proposal is also
problematic in terms of clarity and obviousness. In the examples,
it is not immediately clear that the argument order is correct or
that the target tuple is of the right length. Users from other
languages are more likely to understand the standard two argument
form without having to re-read the documentation. See python-dev
discussion on 17 June 2005.
Specification
The built-in divmod() function would be changed to accept multiple
divisors, changing its signature from divmod(dividend, divisor) to
divmod(dividend, *divisors). The dividend is divided by the last
divisor, giving a quotient and a remainder. The quotient is then
divided by the second to last divisor, giving a new quotient and
remainder. This is repeated until all divisors have been used,
and divmod() then returns a tuple consisting of the quotient from
the last step, and the remainders from all the steps.
A Python implementation of the new divmod() behaviour could look
like:
def divmod(dividend, *divisors):
modulos = ()
q = dividend
while divisors:
q,r = q.__divmod__(divisors[-1])
modulos = (r,) + modulos
divisors = divisors[:-1]
return (q,) + modulos
Motivation
Occasionally one wants to perform a chain of divmod() operations,
calling divmod() on the quotient from the previous step, with
varying divisors. The most common case is probably converting a
number of seconds into weeks, days, hours, minutes and seconds.
This would today be written as:
def secs_to_wdhms(seconds):
m,s = divmod(seconds, 60)
h,m = divmod(m, 60)
d,h = divmod(h, 24)
w,d = divmod(d, 7)
return (w,d,h,m,s)
This is tedious and easy to get wrong each time you need it.
If instead the divmod() built-in is changed according the proposal,
the code for converting seconds to weeks, days, hours, minutes and
seconds then become
def secs_to_wdhms(seconds):
w,d,h,m,s = divmod(seconds, 7, 24, 60, 60)
return (w,d,h,m,s)
which is easier to type, easier to type correctly, and easier to
read.
Other applications are:
- Astronomical angles (declination is measured in degrees, minutes
and seconds, right ascension is measured in hours, minutes and
seconds).
- Old British currency (1 pound = 20 shilling, 1 shilling = 12 pence)
- Anglo-Saxon length units: 1 mile = 1760 yards, 1 yard = 3 feet,
1 foot = 12 inches.
- Anglo-Saxon weight units: 1 long ton = 160 stone, 1 stone = 14
pounds, 1 pound = 16 ounce, 1 ounce = 16 dram
- British volumes: 1 gallon = 4 quart, 1 quart = 2 pint, 1 pint
= 20 fluid ounces
Rationale
The idea comes from APL, which has an operator that does this. (I
don't remember what the operator looks like, and it would probably
be impossible to render in ASCII anyway.)
The APL operator takes a list as its second operand, while this
PEP proposes that each divisor should be a separate argument to
the divmod() function. This is mainly because it is expected that
the most common uses will have the divisors as constants right in
the call (as the 7, 24, 60, 60 above), and adding a set of
parentheses or brackets would just clutter the call.
Requiring an explicit sequence as the second argument to divmod()
would seriously break backwards compatibility. Making divmod()
check its second argument for being a sequence is deemed to be too
ugly to contemplate. And in the case where one *does* have a
sequence that is computed other-where, it is easy enough to write
divmod(x, *divs) instead.
Requiring at least one divisor, i.e rejecting divmod(x), has been
considered, but no good reason to do so has come to mind, and is
thus allowed in the name of generality.
Calling divmod() with no divisors should still return a tuple (of
one element). Code that calls divmod() with a varying number of
divisors, and thus gets a return value with an "unknown" number of
elements, would otherwise have to special case that case. Code
that *knows* it is calling divmod() with no divisors is considered
to be too silly to warrant a special case.
Processing the divisors in the other direction, i.e dividing with
the first divisor first, instead of dividing with the last divisor
first, has been considered. However, the result comes with the
most significant part first and the least significant part last
(think of the chained divmod as a way of splitting a number into
"digits", with varying weights), and it is reasonable to specify
the divisors (weights) in the same order as the result.
The inverse operation:
def inverse_divmod(seq, *factors):
product = seq[0]
for x,y in zip(factors, seq[1:]):
product = product * x + y
return product
could also be useful. However, writing
seconds = (((((w * 7) + d) * 24 + h) * 60 + m) * 60 + s)
is less cumbersome both to write and to read than the chained
divmods. It is therefore deemed to be less important, and its
introduction can be deferred to its own PEP. Also, such a
function needs a good name, and the PEP author has not managed to
come up with one yet.
Calling divmod("spam") does not raise an error, despite strings
supporting neither division nor modulo. However, unless we know
the other object too, we can't determine whether divmod() would
work or not, and thus it seems silly to forbid it.
Backwards Compatibility
Any module that replaces the divmod() function in the __builtin__
module, may cause other modules using the new syntax to break. It
is expected that this is very uncommon.
Code that expects a TypeError exception when calling divmod() with
anything but two arguments will break. This is also expected to
be very uncommon.
No other issues regarding backwards compatibility are known.
Reference Implementation
Not finished yet, but it seems a rather straightforward
new implementation of the function builtin_divmod() in
Python/bltinmodule.c
Copyright
This document has been placed in the public domain.
pep-0304 Controlling Generation of Bytecode Files
| PEP: | 304 |
|---|---|
| Title: | Controlling Generation of Bytecode Files |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Skip Montanaro |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 22-Jan-2003 |
| Post-History: | 27-Jan-2003, 31-Jan-2003, 17-Jun-2005 |
Contents
Abstract
This PEP outlines a mechanism for controlling the generation and location of compiled Python bytecode files. This idea originally arose as a patch request [1] and evolved into a discussion thread on the python-dev mailing list [2]. The introduction of an environment variable will allow people installing Python or Python-based third-party packages to control whether or not bytecode files should be generated at installation time, and if so, where they should be written. It will also allow users to control whether or not bytecode files should be generated at application run-time, and if so, where they should be written.
Proposal
Add a new environment variable, PYTHONBYTECODEBASE, to the mix of environment variables which Python understands. PYTHONBYTECODEBASE is interpreted as follows:
If not defined, Python bytecode is generated in exactly the same way as is currently done. sys.bytecodebase is set to the root directory (either / on Unix and Mac OSX or the root directory of the startup (installation???) drive -- typically C:\ -- on Windows).
If defined and it refers to an existing directory to which the user has write permission, sys.bytecodebase is set to that directory and bytecode files are written into a directory structure rooted at that location.
If defined but empty, sys.bytecodebase is set to None and generation of bytecode files is suppressed altogether.
If defined and one of the following is true:
- it does not refer to a directory,
- it refers to a directory, but not one for which the user has write permission
a warning is displayed, sys.bytecodebase is set to None and generation of bytecode files is suppressed altogether.
After startup initialization, all runtime references are to sys.bytecodebase, not the PYTHONBYTECODEBASE environment variable. sys.path is not modified.
From the above, we see sys.bytecodebase can only take on two valid types of values: None or a string referring to a valid directory on the system.
During import, this extension works as follows:
- The normal search for a module is conducted. The search order is roughly: dynamically loaded extension module, Python source file, Python bytecode file. The only time this mechanism comes into play is if a Python source file is found.
- Once we've found a source module, an attempt to read a byte-compiled file in the same directory is made. (This is the same as before.)
- If no byte-compiled file is found, an attempt to read a byte-compiled file from the augmented directory is made.
- If bytecode generation is required, the generated bytecode is wrtten to the augmented directory if possible.
Note that this PEP is explicitly not about providing module-by-module or directory-by-directory control over the disposition of bytecode files.
Glossary
- "bytecode base" refers to the current setting of sys.bytecodebase.
- "augmented directory" refers to the directory formed from the bytecode base and the directory name of the source file.
- PYTHONBYTECODEBASE refers to the environment variable when necessary to distinguish it from "bytecode base".
Locating bytecode files
When the interpreter is searching for a module, it will use sys.path as usual. However, when a possible bytecode file is considered, an extra probe for a bytecode file may be made. First, a check is made for the bytecode file using the directory in sys.path which holds the source file (the current behavior). If a valid bytecode file is not found there (either one does not exist or exists but is out-of-date) and the bytecode base is not None, a second probe is made using the directory in sys.path prefixed appropriately by the bytecode base.
Writing bytecode files
When the bytecode base is not None, a new bytecode file is written to the appropriate augmented directory, never directly to a directory in sys.path.
Defining augmented directories
Conceptually, the augmented directory for a bytecode file is the directory in which the source file exists prefixed by the bytecode base. In a Unix environment this would be:
pcb = os.path.abspath(sys.bytecodebase) if sourcefile[0] == os.sep: sourcefile = sourcefile[1:] augdir = os.path.join(pcb, os.path.dirname(sourcefile))
On Windows, which does not have a single-rooted directory tree, the drive letter of the directory containing the source file is treated as a directory component after removing the trailing colon. The augmented directory is thus derived as
pcb = os.path.abspath(sys.bytecodebase) drive, base = os.path.splitdrive(os.path.dirname(sourcefile)) drive = drive[:-1] if base[0] == "\\": base = base[1:] augdir = os.path.join(pcb, drive, base)
Fixing the location of the bytecode base
During program startup, the value of the PYTHONBYTECODEBASE environment variable is made absolute, checked for validity and added to the sys module, effectively:
pcb = os.path.abspath(os.environ["PYTHONBYTECODEBASE"])
probe = os.path.join(pcb, "foo")
try:
open(probe, "w")
except IOError:
sys.bytecodebase = None
else:
os.unlink(probe)
sys.bytecodebase = pcb
This allows the user to specify the bytecode base as a relative path, but not have it subject to changes to the current working directory during program execution. (I can't imagine you'd want it to move around during program execution.)
There is nothing special about sys.bytecodebase. The user may change it at runtime if desired, but normally it will not be modified.
Rationale
In many environments it is not possible for non-root users to write into directories containing Python source files. Most of the time, this is not a problem as Python source is generally byte compiled during installation. However, there are situations where bytecode files are either missing or need to be updated. If the directory containing the source file is not writable by the current user a performance penalty is incurred each time a program importing the module is run. [3] Warning messages may also be generated in certain circumstances. If the directory is writable, nearly simultaneous attempts attempts to write the bytecode file by two separate processes may occur, resulting in file corruption. [4]
In environments with RAM disks available, it may be desirable for performance reasons to write bytecode files to a directory on such a disk. Similarly, in environments where Python source code resides on network file systems, it may be desirable to cache bytecode files on local disks.
Alternatives
The only other alternative proposed so far [1] seems to be to add a -R flag to the interpreter to disable writing bytecode files altogether. This proposal subsumes that. Adding a command-line option is certainly possible, but is probably not sufficient, as the interpreter's command line is not readily available during installation (early during program startup???).
Issues
- Interpretation of a module's __file__ attribute. I believe the __file__ attribute of a module should reflect the true location of the bytecode file. If people want to locate a module's source code, they should use imp.find_module(module).
- Security - What if root has PYTHONBYTECODEBASE set? Yes, this can present a security risk, but so can many other things the root user does. The root user should probably not set PYTHONBYTECODEBASE except possibly during installation. Still, perhaps this problem can be minimized. When running as root the interpreter should check to see if PYTHONBYTECODEBASE refers to a directory which is writable by anyone other than root. If so, it could raise an exception or warning and set sys.bytecodebase to None. Or, see the next item.
- More security - What if PYTHONBYTECODEBASE refers to a general directory (say, /tmp)? In this case, perhaps loading of a preexisting bytecode file should occur only if the file is owned by the current user or root. (Does this matter on Windows?)
- The interaction of this PEP with import hooks has not been considered yet. In fact, the best way to implement this idea might be as an import hook. See PEP 302. [5]
- In the current (pre-PEP 304) environment, it is safe to delete a source file after the corresponding bytecode file has been created, since they reside in the same directory. With PEP 304 as currently defined, this is not the case. A bytecode file in the augmented directory is only considered when the source file is present and it thus never considered when looking for module files ending in ".pyc". I think this behavior may have to change.
Examples
In the examples which follow, the urllib source code resides in /usr/lib/python2.3/urllib.py and /usr/lib/python2.3 is in sys.path but is not writable by the current user.
- The bytecode base is /tmp. /usr/lib/python2.3/urllib.pyc exists and is valid. When urllib is imported, the contents of /usr/lib/python2.3/urllib.pyc are used. The augmented directory is not consulted. No other bytecode file is generated.
- The bytecode base is /tmp. /usr/lib/python2.3/urllib.pyc exists, but is out-of-date. When urllib is imported, the generated bytecode file is written to urllib.pyc in the augmented directory which has the value /tmp/usr/lib/python2.3. Intermediate directories will be created as needed.
- The bytecode base is None. No urllib.pyc file is found. When urllib is imported, no bytecode file is written.
- The bytecode base is /tmp. No urllib.pyc file is found. When urllib is imported, the generated bytecode file is written to the augmented directory which has the value /tmp/usr/lib/python2.3. Intermediate directories will be created as needed.
- At startup, PYTHONBYTECODEBASE is /tmp/foobar, which does not exist. A warning is emitted, sys.bytecodebase is set to None and no bytecode files are written during program execution unless sys.bytecodebase is later changed to refer to a valid, writable directory.
- At startup, PYTHONBYTECODEBASE is set to /, which exists, but is not writable by the current user. A warning is emitted, sys.bytecodebase is set to None and no bytecode files are written during program execution unless sys.bytecodebase is later changed to refer to a valid, writable directory. Note that even though the augmented directory constructed for a particular bytecode file may be writable by the current user, what counts is that the bytecode base directory itself is writable.
- At startup PYTHONBYTECODEBASE is set to the empty string. sys.bytecodebase is set to None. No warning is generated, however. If no urllib.pyc file is found when urllib is imported, no bytecode file is written.
In the Windows examples which follow, the urllib source code resides in C:\PYTHON22\urllib.py. C:\PYTHON22 is in sys.path but is not writable by the current user.
- The bytecode base is set to C:\TEMP. C:\PYTHON22\urllib.pyc exists and is valid. When urllib is imported, the contents of C:\PYTHON22\urllib.pyc are used. The augmented directory is not consulted.
- The bytecode base is set to C:\TEMP. C:\PYTHON22\urllib.pyc exists, but is out-of-date. When urllib is imported, a new bytecode file is written to the augmented directory which has the value C:\TEMP\C\PYTHON22. Intermediate directories will be created as needed.
- At startup PYTHONBYTECODEBASE is set to TEMP and the current working directory at application startup is H:\NET. The potential bytecode base is thus H:\NET\TEMP. If this directory exists and is writable by the current user, sys.bytecodebase will be set to that value. If not, a warning will be emitted and sys.bytecodebase will be set to None.
- The bytecode base is C:\TEMP. No urllib.pyc file is found. When urllib is imported, the generated bytecode file is written to the augmented directory which has the value C:\TEMP\C\PYTHON22. Intermediate directories will be created as needed.
Implementation
See the patch on Sourceforge. [6]
References
| [1] | (1, 2) patch 602345, Option for not writing py.[co] files, Klose (http://www.python.org/sf/602345) |
| [2] | python-dev thread, Disable writing .py[co], Norwitz (http://mail.python.org/pipermail/python-dev/2003-January/032270.html) |
| [3] | Debian bug report, Mailman is writing to /usr in cron, Wegner (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=96111) |
| [4] | python-dev thread, Parallel pyc construction, Dubois (http://mail.python.org/pipermail/python-dev/2003-January/032060.html) |
| [5] | PEP 302, New Import Hooks, van Rossum and Moore (http://www.python.org/dev/peps/pep-0302) |
| [6] | patch 677103, PYTHONBYTECODEBASE patch (PEP 304), Montanaro (http://www.python.org/sf/677103) |
Copyright
This document has been placed in the public domain.
pep-0305 CSV File API
| PEP: | 305 |
|---|---|
| Title: | CSV File API |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Kevin Altis <altis at semi-retired.com>, Dave Cole <djc at object-craft.com.au>, Andrew McNamara <andrewm at object-craft.com.au>, Skip Montanaro <skip at pobox.com>, Cliff Wells <LogiplexSoftware at earthlink.net> |
| Discussions-To: | <csv at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Jan-2003 |
| Post-History: | 31-Jan-2003, 13-Feb-2003 |
Contents
Abstract
The Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like line.split(",") is eventually bound to fail. This PEP defines an API for reading and writing CSV files. It is accompanied by a corresponding module which implements the API.
To Do (Notes for the Interested and Ambitious)
- Better motivation for the choice of passing a file object to the constructors. See http://mail.python.org/pipermail/csv/2003-January/000179.html
- Unicode. ugh.
Application Domain
This PEP is about doing one thing well: parsing tabular data which may use a variety of field separators, quoting characters, quote escape mechanisms and line endings. The authors intend the proposed module to solve this one parsing problem efficiently. The authors do not intend to address any of these related topics:
- data interpretation (is a field containing the string "10" supposed to be a string, a float or an int? is it a number in base 10, base 16 or base 2? is a number in quotes a number or a string?)
- locale-specific data representation (should the number 1.23 be written as "1.23" or "1,23" or "1 23"?) -- this may eventually be addressed.
- fixed width tabular data - can already be parsed reliably.
Rationale
Often, CSV files are formatted simply enough that you can get by reading them line-by-line and splitting on the commas which delimit the fields. This is especially true if all the data being read is numeric. This approach may work for awhile, then come back to bite you in the butt when somebody puts something unexpected in the data like a comma. As you dig into the problem you may eventually come to the conclusion that you can solve the problem using regular expressions. This will work for awhile, then break mysteriously one day. The problem grows, so you dig deeper and eventually realize that you need a purpose-built parser for the format.
CSV formats are not well-defined and different implementations have a number of subtle corner cases. It has been suggested that the "V" in the acronym stands for "Vague" instead of "Values". Different delimiters and quoting characters are just the start. Some programs generate whitespace after each delimiter which is not part of the following field. Others quote embedded quoting characters by doubling them, others by prefixing them with an escape character. The list of weird ways to do things can seem endless.
All this variability means it is difficult for programmers to reliably parse CSV files from many sources or generate CSV files designed to be fed to specific external programs without a thorough understanding of those sources and programs. This PEP and the software which accompany it attempt to make the process less fragile.
Existing Modules
This problem has been tackled before. At least three modules currently available in the Python community enable programmers to read and write CSV files:
Each has a different API, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret some of the CSV corner cases differently, so even after surmounting the differences between the different module APIs, the programmer has to also deal with semantic differences between the packages.
Module Interface
This PEP supports three basic APIs, one to read and parse CSV files, one to write them, and one to identify different CSV dialects to the readers and writers.
Reading CSV Files
CSV readers are created with the reader factory function:
obj = reader(iterable [, dialect='excel']
[optional keyword args])
A reader object is an iterator which takes an iterable object returning lines as the sole required parameter. If it supports a binary mode (file objects do), the iterable argument to the reader function must have been opened in binary mode. This gives the reader object full control over the interpretation of the file's contents. The optional dialect parameter is discussed below. The reader function also accepts several optional keyword arguments which define specific format settings for the parser (see the section "Formatting Parameters"). Readers are typically used as follows:
csvreader = csv.reader(file("some.csv"))
for row in csvreader:
process(row)
Each row returned by a reader object is a list of strings or Unicode objects.
When both a dialect parameter and individual formatting parameters are passed to the constructor, first the dialect is queried for formatting parameters, then individual formatting parameters are examined.
Writing CSV Files
Creating writers is similar:
obj = writer(fileobj [, dialect='excel'],
[optional keyword args])
A writer object is a wrapper around a file-like object opened for writing in binary mode (if such a distinction is made). It accepts the same optional keyword parameters as the reader constructor.
Writers are typically used as follows:
csvwriter = csv.writer(file("some.csv", "w"))
for row in someiterable:
csvwriter.writerow(row)
To generate a set of field names as the first row of the CSV file, the programmer must explicitly write it, e.g.:
csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
csvwriter.write(names)
for row in someiterable:
csvwriter.write(row)
or arrange for it to be the first row in the iterable being written.
Managing Different Dialects
Because CSV is a somewhat ill-defined format, there are plenty of ways one CSV file can differ from another, yet contain exactly the same data. Many tools which can import or export tabular data allow the user to indicate the field delimiter, quote character, line terminator, and other characteristics of the file. These can be fairly easily determined, but are still mildly annoying to figure out, and make for fairly long function calls when specified individually.
To try and minimize the difficulty of figuring out and specifying a bunch of formatting parameters, reader and writer objects support a dialect argument which is just a convenient handle on a group of these lower level parameters. When a dialect is given as a string it identifies one of the dialects known to the module via its registration functions, otherwise it must be an instance of the Dialect class as described below.
Dialects will generally be named after applications or organizations which define specific sets of format constraints. Two dialects are defined in the module as of this writing, "excel", which describes the default format constraints for CSV file export by Excel 97 and Excel 2000, and "excel-tab", which is the same as "excel" but specifies an ASCII TAB character as the field delimiter.
Dialects are implemented as attribute only classes to enable users to construct variant dialects by subclassing. The "excel" dialect is a subclass of Dialect and is defined as follows:
class Dialect:
# placeholders
delimiter = None
quotechar = None
escapechar = None
doublequote = None
skipinitialspace = None
lineterminator = None
quoting = None
class excel(Dialect):
delimiter = ','
quotechar = '"'
doublequote = True
skipinitialspace = False
lineterminator = '\r\n'
quoting = QUOTE_MINIMAL
The "excel-tab" dialect is defined as:
class exceltsv(excel):
delimiter = '\t'
(For a description of the individual formatting parameters see the section "Formatting Parameters".)
To enable string references to specific dialects, the module defines several functions:
dialect = get_dialect(name) names = list_dialects() register_dialect(name, dialect) unregister_dialect(name)
get_dialect() returns the dialect instance associated with the given name. list_dialects() returns a list of all registered dialect names. register_dialects() associates a string name with a dialect class. unregister_dialect() deletes a name/dialect association.
Formatting Parameters
In addition to the dialect argument, both the reader and writer constructors take several specific formatting parameters, specified as keyword parameters. The formatting parameters understood are:
- quotechar specifies a one-character string to use as the quoting character. It defaults to '"'. Setting this to None has the same effect as setting quoting to csv.QUOTE_NONE.
- delimiter specifies a one-character string to use as the field separator. It defaults to ','.
- escapechar specifies a one-character string used to escape the delimiter when quotechar is set to None.
- skipinitialspace specifies how to interpret whitespace which immediately follows a delimiter. It defaults to False, which means that whitespace immediately following a delimiter is part of the following field.
- lineterminator specifies the character sequence which should terminate rows.
- quoting controls when quotes should be generated by the writer.
It can take on any of the following module constants:
- csv.QUOTE_MINIMAL means only when required, for example, when a field contains either the quotechar or the delimiter
- csv.QUOTE_ALL means that quotes are always placed around fields.
- csv.QUOTE_NONNUMERIC means that quotes are always placed around nonnumeric fields.
- csv.QUOTE_NONE means that quotes are never placed around fields.
- doublequote controls the handling of quotes inside fields. When True two consecutive quotes are interpreted as one during read, and when writing, each quote is written as two quotes.
When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed before the individual formatting parameters. This makes it easy to choose a dialect, then override one or more of the settings without defining a new dialect class. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character and a colon as the delimiter, you could create a reader like:
csvreader = csv.reader(file("some.csv"), dialect="excel",
quotechar="'", delimiter=':')
Other details of how Excel generates CSV files would be handled automatically because of the reference to the "excel" dialect.
Reader Objects
Reader objects are iterables whose next() method returns a sequence of strings, one string per field in the row.
Writer Objects
Writer objects have two methods, writerow() and writerows(). The former accepts an iterable (typically a list) of fields which are to be written to the output. The latter accepts a list of iterables and calls writerow() for each.
Implementation
There is a sample implementation available. [1] The goal is for it to efficiently implement the API described in the PEP. It is heavily based on the Object Craft csv module. [2]
Issues
Should a parameter control how consecutive delimiters are interpreted? Our thought is "no". Consecutive delimiters should always denote an empty field.
What about Unicode? Is it sufficient to pass a file object gotten from codecs.open()? For example:
csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252")) csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))In the first example, text would be assumed to be encoded as cp1252. Should the system be aggressive in converting to Unicode or should Unicode strings only be returned if necessary?
In the second example, the file will take care of automatically encoding Unicode strings as utf-8 before writing to disk.
Note: As of this writing, the csv module doesn't handle Unicode data.
What about alternate escape conventions? If the dialect in use includes an escapechar parameter which is not None and the quoting parameter is set to QUOTE_NONE, delimiters appearing within fields will be prefixed by the escape character when writing and are expected to be prefixed by the escape character when reading.
Should there be a "fully quoted" mode for writing? What about "fully quoted except for numeric values"? Both are implemented (QUOTE_ALL and QUOTE_NONNUMERIC, respectively).
What about end-of-line? If I generate a CSV file on a Unix system, will Excel properly recognize the LF-only line terminators? Files must be opened for reading or writing as appropriate using binary mode. Specify the lineterminator sequence as '\r\n'. The resulting file will be written correctly.
What about an option to generate dicts from the reader and accept dicts by the writer? See the DictReader and DictWriter classes in csv.py.
Are quote character and delimiters limited to single characters? For the time being, yes.
How should rows of different lengths be handled? Interpretation of the data is the application's job. There is no such thing as a "short row" or a "long row" at this level.
References
| [1] | (1, 2) csv module, Python Sandbox (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/) |
| [2] | (1, 2) csv module, Object Craft (http://www.object-craft.com.au/projects/csv) |
| [3] | Python-DSV module, Wells (http://sourceforge.net/projects/python-dsv/) |
| [4] | ASV module, Tratt (http://tratt.net/laurie/python/asv/) |
There are many references to other CSV-related projects on the Web. A few are included here.
Copyright
This document has been placed in the public domain.
pep-0306 How to Change Python's Grammar
| PEP: | 306 |
|---|---|
| Title: | How to Change Python's Grammar |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michael Hudson <mwh at python.net>, Jack Diederich <jackdied at gmail.com>, Nick Coghlan <ncoghlan at gmail.com>, Benjamin Peterson <benjamin at python.org> |
| Status: | Withdrawn |
| Type: | Informational |
| Content-Type: | text/plain |
| Created: | 29-Jan-2003 |
| Post-History: | 30-Jan-2003 |
Note
This PEP has been moved to the Python dev guide.
Abstract
There's more to changing Python's grammar than editing
Grammar/Grammar and Python/compile.c. This PEP aims to be a
checklist of places that must also be fixed.
It is probably incomplete. If you see omissions, just add them if
you can -- you are not going to offend the author's sense of
ownership. Otherwise submit a bug or patch and assign it to mwh.
This PEP is not intended to be an instruction manual on Python
grammar hacking, for several reasons.
Rationale
People are getting this wrong all the time; it took well over a
year before someone noticed[1] that adding the floor division
operator (//) broke the parser module.
Checklist
__ Grammar/Grammar: OK, you'd probably worked this one out :)
__ Parser/Python.asdl may need changes to match the Grammar. Run
make to regenerate Include/Python-ast.h and
Python/Python-ast.c.
__ Python/ast.c will need changes to create the AST objects
involved with the Grammar change. Lib/compiler/ast.py will
need matching changes to the pure-python AST objects.
__ Parser/pgen needs to be rerun to regenerate Include/graminit.h
and Python/graminit.c. (make should handle this for you.)
__ Python/symbtable.c: This handles the symbol collection pass
that happens immediately before the compilation pass.
__ Python/compile.c: You will need to create or modify the
compiler_* functions to generate opcodes for your productions.
__ You may need to regenerate Lib/symbol.py and/or Lib/token.py
and/or Lib/keyword.py.
__ The parser module. Add some of your new syntax to test_parser,
bang on Modules/parsermodule.c until it passes.
__ Add some usage of your new syntax to test_grammar.py
__ The compiler package. A good test is to compile the standard
library and test suite with the compiler package and then check
it runs. Note that this only needs to be done in Python 2.x.
__ If you've gone so far as to change the token structure of
Python, then the Lib/tokenizer.py library module will need to
be changed.
__ Certain changes may require tweaks to the library module
pyclbr.
__ Documentation must be written!
__ After everything's been checked in, you're likely to see a new
change to Python/Python-ast.c. This is because this
(generated) file contains the SVN version of the source from
which it was generated. There's no way to avoid this; you just
have to submit this file separately.
References
[1] SF Bug #676521, parser module validation failure
http://www.python.org/sf/676521
Copyright
This document has been placed in the public domain.
pep-0307 Extensions to the pickle protocol
| PEP: | 307 |
|---|---|
| Title: | Extensions to the pickle protocol |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum, Tim Peters |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 31-Jan-2003 |
| Post-History: | 7-Feb-2003 |
Introduction
Pickling new-style objects in Python 2.2 is done somewhat clumsily
and causes pickle size to bloat compared to classic class
instances. This PEP documents a new pickle protocol in Python 2.3
that takes care of this and many other pickle issues.
There are two sides to specifying a new pickle protocol: the byte
stream constituting pickled data must be specified, and the
interface between objects and the pickling and unpickling engines
must be specified. This PEP focuses on API issues, although it
may occasionally touch on byte stream format details to motivate a
choice. The pickle byte stream format is documented formally by
the standard library module pickletools.py (already checked into
CVS for Python 2.3).
This PEP attempts to fully document the interface between pickled
objects and the pickling process, highlighting additions by
specifying "new in this PEP". (The interface to invoke pickling
or unpickling is not covered fully, except for the changes to the
API for specifying the pickling protocol to picklers.)
Motivation
Pickling new-style objects causes serious pickle bloat. For
example,
class C(object): # Omit "(object)" for classic class
pass
x = C()
x.foo = 42
print len(pickle.dumps(x, 1))
The binary pickle for the classic object consumed 33 bytes, and for
the new-style object 86 bytes.
The reasons for the bloat are complex, but are mostly caused by
the fact that new-style objects use __reduce__ in order to be
picklable at all. After ample consideration we've concluded that
the only way to reduce pickle sizes for new-style objects is to
add new opcodes to the pickle protocol. The net result is that
with the new protocol, the pickle size in the above example is 35
(two extra bytes are used at the start to indicate the protocol
version, although this isn't strictly necessary).
Protocol versions
Previously, pickling (but not unpickling) distinguished between
text mode and binary mode. By design, binary mode is a
superset of text mode, and unpicklers don't need to know in
advance whether an incoming pickle uses text mode or binary mode.
The virtual machine used for unpickling is the same regardless of
the mode; certain opcodes simply aren't used in text mode.
Retroactively, text mode is now called protocol 0, and binary mode
protocol 1. The new protocol is called protocol 2. In the
tradition of pickling protocols, protocol 2 is a superset of
protocol 1. But just so that future pickling protocols aren't
required to be supersets of the oldest protocols, a new opcode is
inserted at the start of a protocol 2 pickle indicating that it is
using protocol 2. To date, each release of Python has been able to
read pickles written by all previous releases. Of course pickles
written under protocol N can't be read by versions of Python
earlier than the one that introduced protocol N.
Several functions, methods and constructors used for pickling used
to take a positional argument named 'bin' which was a flag,
defaulting to 0, indicating binary mode. This argument is renamed
to 'protocol' and now gives the protocol number, still defaulting
to 0.
It so happens that passing 2 for the 'bin' argument in previous
Python versions had the same effect as passing 1. Nevertheless, a
special case is added here: passing a negative number selects the
highest protocol version supported by a particular implementation.
This works in previous Python versions, too, and so can be used to
select the highest protocol available in a way that's both backward
and forward compatible. In addition, a new module constant
HIGHEST_PROTOCOL is supplied by both pickle and cPickle, equal to
the highest protocol number the module can read. This is cleaner
than passing -1, but cannot be used before Python 2.3.
The pickle.py module has supported passing the 'bin' value as a
keyword argument rather than a positional argument. (This is not
recommended, since cPickle only accepts positional arguments, but
it works...) Passing 'bin' as a keyword argument is deprecated,
and a PendingDeprecationWarning is issued in this case. You have
to invoke the Python interpreter with -Wa or a variation on that
to see PendingDeprecationWarning messages. In Python 2.4, the
warning class may be upgraded to DeprecationWarning.
Security issues
In previous versions of Python, unpickling would do a "safety
check" on certain operations, refusing to call functions or
constructors that weren't marked as "safe for unpickling" by
either having an attribute __safe_for_unpickling__ set to 1, or by
being registered in a global registry, copy_reg.safe_constructors.
This feature gives a false sense of security: nobody has ever done
the necessary, extensive, code audit to prove that unpickling
untrusted pickles cannot invoke unwanted code, and in fact bugs in
the Python 2.2 pickle.py module make it easy to circumvent these
security measures.
We firmly believe that, on the Internet, it is better to know that
you are using an insecure protocol than to trust a protocol to be
secure whose implementation hasn't been thoroughly checked. Even
high quality implementations of widely used protocols are
routinely found flawed; Python's pickle implementation simply
cannot make such guarantees without a much larger time investment.
Therefore, as of Python 2.3, all safety checks on unpickling are
officially removed, and replaced with this warning:
*** Do not unpickle data received from an untrusted or
unauthenticated source ***
The same warning applies to previous Python versions, despite the
presence of safety checks there.
Extended __reduce__ API
There are several APIs that a class can use to control pickling.
Perhaps the most popular of these are __getstate__ and
__setstate__; but the most powerful one is __reduce__. (There's
also __getinitargs__, and we're adding __getnewargs__ below.)
There are several ways to provide __reduce__ functionality: a
class can implement a __reduce__ method or a __reduce_ex__ method
(see next section), or a reduce function can be declared in
copy_reg (copy_reg.dispatch_table maps classes to functions). The
return values are interpreted exactly the same, though, and we'll
refer to these collectively as __reduce__.
IMPORTANT: pickling of classic class instances does not look for a
__reduce__ or __reduce_ex__ method or a reduce function in the
copy_reg dispatch table, so that a classic class cannot provide
__reduce__ functionality in the sense intended here. A classic
class must use __getinitargs__ and/or __getstate__ to customize
pickling. These are described below.
__reduce__ must return either a string or a tuple. If it returns
a string, this is an object whose state is not to be pickled, but
instead a reference to an equivalent object referenced by name.
Surprisingly, the string returned by __reduce__ should be the
object's local name (relative to its module); the pickle module
searches the module namespace to determine the object's module.
The rest of this section is concerned with the tuple returned by
__reduce__. It is a variable size tuple, of length 2 through 5.
The first two items (function and arguments) are required. The
remaining items are optional and may be left off from the end;
giving None for the value of an optional item acts the same as
leaving it off. The last two items are new in this PEP. The items
are, in order:
function Required.
A callable object (not necessarily a function) called
to create the initial version of the object; state
may be added to the object later to fully reconstruct
the pickled state. This function must itself be
picklable. See the section about __newobj__ for a
special case (new in this PEP) here.
arguments Required.
A tuple giving the argument list for the function.
As a special case, designed for Zope 2's
ExtensionClass, this may be None; in that case,
function should be a class or type, and
function.__basicnew__() is called to create the
initial version of the object. This exception is
deprecated.
Unpickling invokes function(*arguments) to create an initial object,
called obj below. If the remaining items are left off, that's the end
of unpickling for this object and obj is the result. Else obj is
modified at unpickling time by each item specified, as follows.
state Optional.
Additional state. If this is not None, the state is
pickled, and obj.__setstate__(state) will be called
when unpickling. If no __setstate__ method is
defined, a default implementation is provided, which
assumes that state is a dictionary mapping instance
variable names to their values. The default
implementation calls
obj.__dict__.update(state)
or, if the update() call fails,
for k, v in state.items():
setattr(obj, k, v)
listitems Optional, and new in this PEP.
If this is not None, it should be an iterator (not a
sequence!) yielding successive list items. These list
items will be pickled, and appended to the object using
either obj.append(item) or obj.extend(list_of_items).
This is primarily used for list subclasses, but may
be used by other classes as long as they have append()
and extend() methods with the appropriate signature.
(Whether append() or extend() is used depends on which
pickle protocol version is used as well as the number
of items to append, so both must be supported.)
dictitems Optional, and new in this PEP.
If this is not None, it should be an iterator (not a
sequence!) yielding successive dictionary items, which
should be tuples of the form (key, value). These items
will be pickled, and stored to the object using
obj[key] = value. This is primarily used for dict
subclasses, but may be used by other classes as long
as they implement __setitem__.
Note: in Python 2.2 and before, when using cPickle, state would be
pickled if present even if it is None; the only safe way to avoid
the __setstate__ call was to return a two-tuple from __reduce__.
(But pickle.py would not pickle state if it was None.) In Python
2.3, __setstate__ will never be called at unpickling time when
__reduce__ returns a state with value None at pickling time.
A __reduce__ implementation that needs to work both under Python
2.2 and under Python 2.3 could check the variable
pickle.format_version to determine whether to use the listitems
and dictitems features. If this value is >= "2.0" then they are
supported. If not, any list or dict items should be incorporated
somehow in the 'state' return value, and the __setstate__ method
should be prepared to accept list or dict items as part of the
state (how this is done is up to the application).
The __reduce_ex__ API
It is sometimes useful to know the protocol version when
implementing __reduce__. This can be done by implementing a
method named __reduce_ex__ instead of __reduce__. __reduce_ex__,
when it exists, is called in preference over __reduce__ (you may
still provide __reduce__ for backwards compatibility). The
__reduce_ex__ method will be called with a single integer
argument, the protocol version.
The 'object' class implements both __reduce__ and __reduce_ex__;
however, if a subclass overrides __reduce__ but not __reduce_ex__,
the __reduce_ex__ implementation detects this and calls
__reduce__.
Customizing pickling absent a __reduce__ implementation
If no __reduce__ implementation is available for a particular
class, there are three cases that need to be considered
separately, because they are handled differently:
1. classic class instances, all protocols
2. new-style class instances, protocols 0 and 1
3. new-style class instances, protocol 2
Types implemented in C are considered new-style classes. However,
except for the common built-in types, these need to provide a
__reduce__ implementation in order to be picklable with protocols
0 or 1. Protocol 2 supports built-in types providing
__getnewargs__, __getstate__ and __setstate__ as well.
Case 1: pickling classic class instances
This case is the same for all protocols, and is unchanged from
Python 2.1.
For classic classes, __reduce__ is not used. Instead, classic
classes can customize their pickling by providing methods named
__getstate__, __setstate__ and __getinitargs__. Absent these, a
default pickling strategy for classic class instances is
implemented that works as long as all instance variables are
picklable. This default strategy is documented in terms of
default implementations of __getstate__ and __setstate__.
The primary ways to customize pickling of classic class instances
is by specifying __getstate__ and/or __setstate__ methods. It is
fine if a class implements one of these but not the other, as long
as it is compatible with the default version.
The __getstate__ method
The __getstate__ method should return a picklable value
representing the object's state without referencing the object
itself. If no __getstate__ method exists, a default
implementation is used that returns self.__dict__.
The __setstate__ method
The __setstate__ method should take one argument; it will be
called with the value returned by __getstate__ (or its default
implementation).
If no __setstate__ method exists, a default implementation is
provided that assumes the state is a dictionary mapping instance
variable names to values. The default implementation tries two
things:
- First, it tries to call self.__dict__.update(state).
- If the update() call fails with a RuntimeError exception, it
calls setattr(self, key, value) for each (key, value) pair in
the state dictionary. This only happens when unpickling in
restricted execution mode (see the rexec standard library
module).
The __getinitargs__ method
The __setstate__ method (or its default implementation) requires
that a new object already exists so that its __setstate__ method
can be called. The point is to create a new object that isn't
fully initialized; in particular, the class's __init__ method
should not be called if possible.
These are the possibilities:
- Normally, the following trick is used: create an instance of a
trivial classic class (one without any methods or instance
variables) and then use __class__ assignment to change its
class to the desired class. This creates an instance of the
desired class with an empty __dict__ whose __init__ has not
been called.
- However, if the class has a method named __getinitargs__, the
above trick is not used, and a class instance is created by
using the tuple returned by __getinitargs__ as an argument
list to the class constructor. This is done even if
__getinitargs__ returns an empty tuple -- a __getinitargs__
method that returns () is not equivalent to not having
__getinitargs__ at all. __getinitargs__ *must* return a
tuple.
- In restricted execution mode, the trick from the first bullet
doesn't work; in this case, the class constructor is called
with an empty argument list if no __getinitargs__ method
exists. This means that in order for a classic class to be
unpicklable in restricted execution mode, it must either
implement __getinitargs__ or its constructor (i.e., its
__init__ method) must be callable without arguments.
Case 2: pickling new-style class instances using protocols 0 or 1
This case is unchanged from Python 2.2. For better pickling of
new-style class instances when backwards compatibility is not an
issue, protocol 2 should be used; see case 3 below.
New-style classes, whether implemented in C or in Python, inherit
a default __reduce__ implementation from the universal base class
'object'.
This default __reduce__ implementation is not used for those
built-in types for which the pickle module has built-in support.
Here's a full list of those types:
- Concrete built-in types: NoneType, bool, int, float, complex,
str, unicode, tuple, list, dict. (Complex is supported by
virtue of a __reduce__ implementation registered in copy_reg.)
In Jython, PyStringMap is also included in this list.
- Classic instances.
- Classic class objects, Python function objects, built-in
function and method objects, and new-style type objects (==
new-style class objects). These are pickled by name, not by
value: at unpickling time, a reference to an object with the
same name (the fully qualified module name plus the variable
name in that module) is substituted.
The default __reduce__ implementation will fail at pickling time
for built-in types not mentioned above, and for new-style classes
implemented in C: if they want to be picklable, they must supply
a custom __reduce__ implementation under protocols 0 and 1.
For new-style classes implemented in Python, the default
__reduce__ implementation (copy_reg._reduce) works as follows:
Let D be the class on the object to be pickled. First, find the
nearest base class that is implemented in C (either as a
built-in type or as a type defined by an extension class). Call
this base class B, and the class of the object to be pickled D.
Unless B is the class 'object', instances of class B must be
picklable, either by having built-in support (as defined in the
above three bullet points), or by having a non-default
__reduce__ implementation. B must not be the same class as D
(if it were, it would mean that D is not implemented in Python).
The callable produced by the default __reduce__ is
copy_reg._reconstructor, and its arguments tuple is
(D, B, basestate), where basestate is None if B is the builtin
object class, and basestate is
basestate = B(obj)
if B is not the builtin object class. This is geared toward
pickling subclasses of builtin types, where, for example,
list(some_list_subclass_instance) produces "the list part" of
the list subclass instance.
The object is recreated at unpickling time by
copy_reg._reconstructor, like so:
obj = B.__new__(D, basestate)
B.__init__(obj, basestate)
Objects using the default __reduce__ implementation can customize
it by defining __getstate__ and/or __setstate__ methods. These
work almost the same as described for classic classes above, except
that if __getstate__ returns an object (of any type) whose value is
considered false (e.g. None, or a number that is zero, or an empty
sequence or mapping), this state is not pickled and __setstate__
will not be called at all. If __getstate__ exists and returns a
true value, that value becomes the third element of the tuple
returned by the default __reduce__, and at unpickling time the
value is passed to __setstate__. If __getstate__ does not exist,
but obj.__dict__ exists, then obj.__dict__ becomes the third
element of the tuple returned by __reduce__, and again at
unpickling time the value is passed to obj.__setstate__. The
default __setstate__ is the same as that for classic classes,
described above.
Note that this strategy ignores slots. Instances of new-style
classes that have slots but no __getstate__ method cannot be
pickled by protocols 0 and 1; the code explicitly checks for
this condition.
Note that pickling new-style class instances ignores __getinitargs__
if it exists (and under all protocols). __getinitargs__ is
useful only for classic classes.
Case 3: pickling new-style class instances using protocol 2
Under protocol 2, the default __reduce__ implementation inherited
from the 'object' base class is *ignored*. Instead, a different
default implementation is used, which allows more efficient
pickling of new-style class instances than possible with protocols
0 or 1, at the cost of backward incompatibility with Python 2.2
(meaning no more than that a protocol 2 pickle cannot be unpickled
before Python 2.3).
The customization uses three special methods: __getstate__,
__setstate__ and __getnewargs__ (note that __getinitargs__ is again
ignored). It is fine if a class implements one or more but not all
of these, as long as it is compatible with the default
implementations.
The __getstate__ method
The __getstate__ method should return a picklable value
representing the object's state without referencing the object
itself. If no __getstate__ method exists, a default
implementation is used which is described below.
There's a subtle difference between classic and new-style
classes here: if a classic class's __getstate__ returns None,
self.__setstate__(None) will be called as part of unpickling.
But if a new-style class's __getstate__ returns None, its
__setstate__ won't be called at all as part of unpickling.
If no __getstate__ method exists, a default state is computed.
There are several cases:
- For a new-style class that has no instance __dict__ and no
__slots__, the default state is None.
- For a new-style class that has an instance __dict__ and no
__slots__, the default state is self.__dict__.
- For a new-style class that has an instance __dict__ and
__slots__, the default state is a tuple consisting of two
dictionaries: self.__dict__, and a dictionary mapping slot
names to slot values. Only slots that have a value are
included in the latter.
- For a new-style class that has __slots__ and no instance
__dict__, the default state is a tuple whose first item is
None and whose second item is a dictionary mapping slot names
to slot values described in the previous bullet.
The __setstate__ method
The __setstate__ method should take one argument; it will be
called with the value returned by __getstate__ or with the
default state described above if no __getstate__ method is
defined.
If no __setstate__ method exists, a default implementation is
provided that can handle the state returned by the default
__getstate__, described above.
The __getnewargs__ method
Like for classic classes, the __setstate__ method (or its
default implementation) requires that a new object already
exists so that its __setstate__ method can be called.
In protocol 2, a new pickling opcode is used that causes a new
object to be created as follows:
obj = C.__new__(C, *args)
where C is the class of the pickled object, and args is either
the empty tuple, or the tuple returned by the __getnewargs__
method, if defined. __getnewargs__ must return a tuple. The
absence of a __getnewargs__ method is equivalent to the existence
of one that returns ().
The __newobj__ unpickling function
When the unpickling function returned by __reduce__ (the first
item of the returned tuple) has the name __newobj__, something
special happens for pickle protocol 2. An unpickling function
named __newobj__ is assumed to have the following semantics:
def __newobj__(cls, *args):
return cls.__new__(cls, *args)
Pickle protocol 2 special-cases an unpickling function with this
name, and emits a pickling opcode that, given 'cls' and 'args',
will return cls.__new__(cls, *args) without also pickling a
reference to __newobj__ (this is the same pickling opcode used by
protocol 2 for a new-style class instance when no __reduce__
implementation exists). This is the main reason why protocol 2
pickles are much smaller than classic pickles. Of course, the
pickling code cannot verify that a function named __newobj__
actually has the expected semantics. If you use an unpickling
function named __newobj__ that returns something different, you
deserve what you get.
It is safe to use this feature under Python 2.2; there's nothing
in the recommended implementation of __newobj__ that depends on
Python 2.3.
The extension registry
Protocol 2 supports a new mechanism to reduce the size of pickles.
When class instances (classic or new-style) are pickled, the full
name of the class (module name including package name, and class
name) is included in the pickle. Especially for applications that
generate many small pickles, this is a lot of overhead that has to
be repeated in each pickle. For large pickles, when using
protocol 1, repeated references to the same class name are
compressed using the "memo" feature; but each class name must be
spelled in full at least once per pickle, and this causes a lot of
overhead for small pickles.
The extension registry allows one to represent the most frequently
used names by small integers, which are pickled very efficiently:
an extension code in the range 1-255 requires only two bytes
including the opcode, one in the range 256-65535 requires only
three bytes including the opcode.
One of the design goals of the pickle protocol is to make pickles
"context-free": as long as you have installed the modules
containing the classes referenced by a pickle, you can unpickle
it, without needing to import any of those classes ahead of time.
Unbridled use of extension codes could jeopardize this desirable
property of pickles. Therefore, the main use of extension codes
is reserved for a set of codes to be standardized by some
standard-setting body. This being Python, the standard-setting
body is the PSF. From time to time, the PSF will decide on a
table mapping extension codes to class names (or occasionally
names of other global objects; functions are also eligible). This
table will be incorporated in the next Python release(s).
However, for some applications, like Zope, context-free pickles
are not a requirement, and waiting for the PSF to standardize
some codes may not be practical. Two solutions are offered for
such applications.
First, a few ranges of extension codes are reserved for private
use. Any application can register codes in these ranges.
Two applications exchanging pickles using codes in these ranges
need to have some out-of-band mechanism to agree on the mapping
between extension codes and names.
Second, some large Python projects (e.g. Zope) can be assigned a
range of extension codes outside the "private use" range that they
can assign as they see fit.
The extension registry is defined as a mapping between extension
codes and names. When an extension code is unpickled, it ends up
producing an object, but this object is gotten by interpreting the
name as a module name followed by a class (or function) name. The
mapping from names to objects is cached. It is quite possible
that certain names cannot be imported; that should not be a
problem as long as no pickle containing a reference to such names
has to be unpickled. (The same issue already exists for direct
references to such names in pickles that use protocols 0 or 1.)
Here is the proposed initial assigment of extension code ranges:
First Last Count Purpose
0 0 1 Reserved -- will never be used
1 127 127 Reserved for Python standard library
128 191 64 Reserved for Zope
192 239 48 Reserved for 3rd parties
240 255 16 Reserved for private use (will never be assigned)
256 MAX MAX Reserved for future assignment
MAX stands for 2147483647, or 2**31-1. This is a hard limitation
of the protocol as currently defined.
At the moment, no specific extension codes have been assigned yet.
Extension registry API
The extension registry is maintained as private global variables
in the copy_reg module. The following three functions are defined
in this module to manipulate the registry:
add_extension(module, name, code)
Register an extension code. The module and name arguments
must be strings; code must be an int in the inclusive range 1
through MAX. This must either register a new (module, name)
pair to a new code, or be a redundant repeat of a previous
call that was not canceled by a remove_extension() call; a
(module, name) pair may not be mapped to more than one code,
nor may a code be mapped to more than one (module, name)
pair. (XXX Aliasing may actually cause a problem for this
requirement; we'll see as we go.)
remove_extension(module, name, code)
Arguments are as for add_extension(). Remove a previously
registered mapping between (module, name) and code.
clear_extension_cache()
The implementation of extension codes may use a cache to speed
up loading objects that are named frequently. This cache can
be emptied (removing references to cached objects) by calling
this method.
Note that the API does not enforce the standard range assignments.
It is up to applications to respect these.
The copy module
Traditionally, the copy module has supported an extended subset of
the pickling APIs for customizing the copy() and deepcopy()
operations.
In particular, besides checking for a __copy__ or __deepcopy__
method, copy() and deepcopy() have always looked for __reduce__,
and for classic classes, have looked for __getinitargs__,
__getstate__ and __setstate__.
In Python 2.2, the default __reduce__ inherited from 'object' made
copying simple new-style classes possible, but slots and various
other special cases were not covered.
In Python 2.3, several changes are made to the copy module:
- __reduce_ex__ is supported (and always called with 2 as the
protocol version argument).
- The four- and five-argument return values of __reduce__ are
supported.
- Before looking for a __reduce__ method, the
copy_reg.dispatch_table is consulted, just like for pickling.
- When the __reduce__ method is inherited from object, it is
(unconditionally) replaced by a better one that uses the same
APIs as pickle protocol 2: __getnewargs__, __getstate__, and
__setstate__, handling list and dict subclasses, and handling
slots.
As a consequence of the latter change, certain new-style classes
that were copyable under Python 2.2 are not copyable under Python
2.3. (These classes are also not picklable using pickle protocol
2.) A minimal example of such a class:
class C(object):
def __new__(cls, a):
return object.__new__(cls)
The problem only occurs when __new__ is overridden and has at
least one mandatory argument in addition to the class argument.
To fix this, a __getnewargs__ method should be added that returns
the appropriate argument tuple (excluding the class).
Pickling Python longs
Pickling and unpickling Python longs takes time quadratic in
the number of digits, in protocols 0 and 1. Under protocol 2,
new opcodes support linear-time pickling and unpickling of longs.
Pickling bools
Protocol 2 introduces new opcodes for pickling True and False
directly. Under protocols 0 and 1, bools are pickled as integers,
using a trick in the representation of the integer in the pickle
so that an unpickler can recognize that a bool was intended. That
trick consumed 4 bytes per bool pickled. The new bool opcodes
consume 1 byte per bool.
Pickling small tuples
Protocol 2 introduces new opcodes for more-compact pickling of
tuples of lengths 1, 2 and 3. Protocol 1 previously introduced
an opcode for more-compact pickling of empty tuples.
Protocol identification
Protocol 2 introduces a new opcode, with which all protocol 2
pickles begin, identifying that the pickle is protocol 2.
Attempting to unpickle a protocol 2 pickle under older versions
of Python will therefore raise an "unknown opcode" exception
immediately.
Pickling of large lists and dicts
Protocol 1 pickles large lists and dicts "in one piece", which
minimizes pickle size, but requires that unpickling create a temp
object as large as the object being unpickled. Part of the
protocol 2 changes break large lists and dicts into pieces of no
more than 1000 elements each, so that unpickling needn't create
a temp object larger than needed to hold 1000 elements. This
isn't part of protocol 2, however: the opcodes produced are still
part of protocol 1. __reduce__ implementations that return the
optional new listitems or dictitems iterators also benefit from
this unpickling temp-space optimization.
Copyright
This document has been placed in the public domain.
pep-0308 Conditional Expressions
| PEP: | 308 |
|---|---|
| Title: | Conditional Expressions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum, Raymond Hettinger |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 7-Feb-2003 |
| Post-History: | 7-Feb-2003, 11-Feb-2003 |
Adding a conditional expression
On 9/29/2005, Guido decided to add conditional expressions in the
form of "X if C else Y". [1]
The motivating use case was the prevalance of error-prone attempts
to achieve the same effect using "and" and "or". [2]
Previous community efforts to add a conditional expression were
stymied by a lack of consensus on the best syntax. That issue was
resolved by simply deferring to a BDFL best judgment call.
The decision was validated by reviewing how the syntax fared when
applied throughout the standard library (this review approximates a
sampling of real-world use cases, across a variety of applications,
written by a number of programmers with diverse backgrounds). [3]
The following change will be made to the grammar. (The or_test
symbols is new, the others are modified.)
test: or_test ['if' or_test 'else' test] | lambdef
or_test: and_test ('or' and_test)*
...
testlist_safe: or_test [(',' or_test)+ [',']]
...
gen_for: 'for' exprlist 'in' or_test [gen_iter]
The new syntax nearly introduced a minor syntactical backwards
incompatibility. In previous Python versions, the following is
legal:
[f for f in lambda x: x, lambda x: x**2 if f(1) == 1]
(I.e. a list comprehension where the sequence following 'in' is an
unparenthesized series of lambdas -- or just one lambda, even.)
In Python 3.0, the series of lambdas will have to be
parenthesized, e.g.:
[f for f in (lambda x: x, lambda x: x**2) if f(1) == 1]
This is because lambda binds less tight than the if-else
expression, but in this context, the lambda could already be
followed by an 'if' keyword that binds less tightly still (for
details, consider the grammar changes shown above).
However, in Python 2.5, a slightly different grammar is used that
is more backwards compatible, but constrains the grammar of a
lambda used in this position by forbidding the lambda's body to
contain an unparenthesized condition expression. Examples:
[f for f in (1, lambda x: x if x >= 0 else -1)] # OK
[f for f in 1, (lambda x: x if x >= 0 else -1)] # OK
[f for f in 1, lambda x: (x if x >= 0 else -1)] # OK
[f for f in 1, lambda x: x if x >= 0 else -1] # INVALID
References
[1] Pronouncement
http://mail.python.org/pipermail/python-dev/2005-September/056846.html
[2] Motivating use case:
http://mail.python.org/pipermail/python-dev/2005-September/056546.html
http://mail.python.org/pipermail/python-dev/2005-September/056510.html
[3] Review in the context of real-world code fragments:
http://mail.python.org/pipermail/python-dev/2005-September/056803.html
Introduction to earlier draft of the PEP (kept for historical purposes)
Requests for an if-then-else ("ternary") expression keep coming up
on comp.lang.python. This PEP contains a concrete proposal of a
fairly Pythonic syntax. This is the community's one chance: if
this PEP is approved with a clear majority, it will be implemented
in Python 2.4. If not, the PEP will be augmented with a summary
of the reasons for rejection and the subject better not come up
again. While the BDFL is co-author of this PEP, he is neither in
favor nor against this proposal; it is up to the community to
decide. If the community can't decide, the BDFL will reject the
PEP.
After unprecedented community response (very good arguments were
made both pro and con) this PEP has been revised with the help of
Raymond Hettinger. Without going through a complete revision
history, the main changes are a different proposed syntax, an
overview of proposed alternatives, the state of the curent
discussion, and a discussion of short-circuit behavior.
Following the discussion, a vote was held. While there was an overall
interest in having some form of if-then-else expressions, no one
format was able to draw majority support. Accordingly, the PEP was
rejected due to the lack of an overwhelming majority for change.
Also, a Python design principle has been to prefer the status quo
whenever there are doubts about which path to take.
Proposal
The proposed syntax is as follows:
(if <condition>: <expression1> else: <expression2>)
Note that the enclosing parentheses are not optional.
The resulting expression is evaluated like this:
- First, <condition> is evaluated.
- If <condition> is true, <expression1> is evaluated and is the
result of the whole thing.
- If <condition> is false, <expression2> is evaluated and is the
result of the whole thing.
A natural extension of this syntax is to allow one or more 'elif'
parts:
(if <cond1>: <expr1> elif <cond2>: <expr2> ... else: <exprN>)
This will be implemented if the proposal is accepted.
The downsides to the proposal are:
* the required parentheses
* confusability with statement syntax
* additional semantic loading of colons
Note that at most one of <expression1> and <expression2> is
evaluated. This is called a "short-circuit expression"; it is
similar to the way the second operand of 'and' / 'or' is only
evaluated if the first operand is true / false.
A common way to emulate an if-then-else expression is:
<condition> and <expression1> or <expression2>
However, this doesn't work the same way: it returns <expression2>
when <expression1> is false! See FAQ 4.16 for alternatives that
work -- however, they are pretty ugly and require much more effort
to understand.
Alternatives
Holger Krekel proposed a new, minimally invasive variant:
<condition> and <expression1> else <expression2>
The concept behind it is that a nearly complete ternary operator
already exists with and/or and this proposal is the least invasive
change that makes it complete. Many respondants on the
newsgroup found this to be the most pleasing alternative.
However, a couple of respondants were able to post examples
that were mentally difficult to parse. Later it was pointed
out that this construct works by having the "else" change the
existing meaning of "and".
As a result, there is increasing support for Christian Tismer's
proposed variant of the same idea:
<condition> then <expression1> else <expression2>
The advantages are simple visual parsing, no required parenthesis,
no change in the semantics of existing keywords, not as likely
as the proposal to be confused with statement syntax, and does
not further overload the colon. The disadvantage is the
implementation costs of introducing a new keyword. However,
unlike other new keywords, the word "then" seems unlikely to
have been used as a name in existing programs.
---
Many C-derived languages use this syntax:
<condition> ? <expression1> : <expression2>
Eric Raymond even implemented this. The BDFL rejected this for
several reasons: the colon already has many uses in Python (even
though it would actually not be ambiguous, because the question
mark requires a matching colon); for people not used to C-derived
language, it is hard to understand.
---
The original version of this PEP proposed the following syntax:
<expression1> if <condition> else <expression2>
The out-of-order arrangement was found to be too uncomfortable
for many of participants in the discussion; especially when
<expression1> is long, it's easy to miss the conditional while
skimming.
---
Some have suggested adding a new builtin instead of extending the
syntax of the language. For example:
cond(<condition>, <expression1>, <expression2>)
This won't work the way a syntax extension will because both
expression1 and expression2 must be evaluated before the function
is called. There's no way to short-circuit the expression
evaluation. It could work if 'cond' (or some other name) were
made a keyword, but that has all the disadvantages of adding a new
keyword, plus confusing syntax: it *looks* like a function call so
a casual reader might expect both <expression1> and <expression2>
to be evaluated.
Summary of the Current State of the Discussion
Groups are falling into one of three camps:
1. Adopt a ternary operator built using punctuation characters:
<condition> ? <expression1> : <expression2>
2. Adopt a ternary operator built using new or existing keywords.
The leading examples are:
<condition> then <expression1> else <expression2>
(if <condition>: <expression1> else: <expression2>)
3. Do nothing.
The first two positions are relatively similar.
Some find that any form of punctuation makes the language more
cryptic. Others find that punctuation style is appropriate for
expressions rather than statements and helps avoid a COBOL style:
3 plus 4 times 5.
Adapting existing keywords attempts to improve on punctuation
through explicit meaning and a more tidy appearance. The downside
is some loss of the economy-of-expression provided by punctuation
operators. The other downside is that it creates some degree of
confusion between the two meanings and two usages of the keywords.
Those difficulties are overcome by options which introduce new
keywords which take more effort to implement.
The last position is doing nothing. Arguments in favor include
keeping the language simple and concise; maintaining backwards
compatibility; and that any every use case can already be already
expressed in terms of "if" and "else". Lambda expressions are an
exception as they require the conditional to be factored out into
a separate function definition.
The arguments against doing nothing are that the other choices
allow greater economy of expression and that current practices
show a propensity for erroneous uses of "and", "or", or one their
more complex, less visually unappealing workarounds.
Short-Circuit Behavior
The principal difference between the ternary operator and the
cond() function is that the latter provides an expression form but
does not provide short-circuit evaluation.
Short-circuit evaluation is desirable on three occasions:
1. When an expression has side-effects
2. When one or both of the expressions are resource intensive
3. When the condition serves as a guard for the validity of the
expression.
# Example where all three reasons apply
data = isinstance(source, file) ? source.readlines()
: source.split()
1. readlines() moves the file pointer
2. for long sources, both alternatives take time
3. split() is only valid for strings and readlines() is only
valid for file objects.
Supporters of a cond() function point out that the need for
short-circuit evaluation is rare. Scanning through existing code
directories, they found that if/else did not occur often; and of
those only a few contained expressions that could be helped by
cond() or a ternary operator; and that most of those had no need
for short-circuit evaluation. Hence, cond() would suffice for
most needs and would spare efforts to alter the syntax of the
language.
More supporting evidence comes from scans of C code bases which
show that its ternary operator used very rarely (as a percentage
of lines of code).
A counter point to that analysis is that the availability of a
ternary operator helped the programmer in every case because it
spared the need to search for side-effects. Further, it would
preclude errors arising from distant modifications which introduce
side-effects. The latter case has become more of a reality with
the advent of properties where even attribute access can be given
side-effects.
The BDFL's position is that short-circuit behavior is essential
for an if-then-else construct to be added to the language.
Detailed Results of Voting
Votes rejecting all options: 82
Votes with rank ordering: 436
---
Total votes received: 518
ACCEPT REJECT TOTAL
--------------------- --------------------- -----
Rank1 Rank2 Rank3 Rank1 Rank2 Rank3
Letter
A 51 33 19 18 20 20 161
B 45 46 21 9 24 23 168
C 94 54 29 20 20 18 235
D 71 40 31 5 28 31 206
E 7 7 10 3 5 32
F 14 19 10 7 17 67
G 7 6 10 1 2 4 30
H 20 22 17 4 10 25 98
I 16 20 9 5 5 20 75
J 6 17 5 1 10 39
K 1 6 4 13 24
L 1 2 3 3 9
M 7 3 4 2 5 11 32
N 2 3 4 2 11
O 1 6 5 1 4 9 26
P 5 3 6 1 5 7 27
Q 18 7 15 6 5 11 62
Z 1 1
--- --- --- --- --- --- ----
Total 363 286 202 73 149 230 1303
RejectAll 82 82 82 246
--- --- --- --- --- --- ----
Total 363 286 202 155 231 312 1549
CHOICE KEY
----------
A. x if C else y
B. if C then x else y
C. (if C: x else: y)
D. C ? x : y
E. C ? x ! y
F. cond(C, x, y)
G. C ?? x || y
H. C then x else y
I. x when C else y
J. C ? x else y
K. C -> x else y
L. C -> (x, y)
M. [x if C else y]
N. ifelse C: x else y
O. <if C then x else y>
P. C and x else y
Q. any write-in vote
Detail for write-in votes and their ranking:
--------------------------------------------
3: Q reject y x C elsethenif
2: Q accept (C ? x ! y)
3: Q reject ...
3: Q accept ? C : x : y
3: Q accept (x if C, y otherwise)
3: Q reject ...
3: Q reject NONE
1: Q accept select : (<c1> : <val1>; [<cx> : <valx>; ]* elseval)
2: Q reject if C: t else: f
3: Q accept C selects x else y
2: Q accept iff(C, x, y) # "if-function"
1: Q accept (y, x)[C]
1: Q accept C true: x false: y
3: Q accept C then: x else: y
3: Q reject
3: Q accept (if C: x elif C2: y else: z)
3: Q accept C -> x : y
1: Q accept x (if C), y
1: Q accept if c: x else: y
3: Q accept (c).{True:1, False:2}
2: Q accept if c: x else: y
3: Q accept (c).{True:1, False:2}
3: Q accept if C: x else y
1: Q accept (x if C else y)
1: Q accept ifelse(C, x, y)
2: Q reject x or y <- C
1: Q accept (C ? x : y) required parens
1: Q accept iif(C, x, y)
1: Q accept ?(C, x, y)
1: Q accept switch-case
2: Q accept multi-line if/else
1: Q accept C: x else: y
2: Q accept (C): x else: y
3: Q accept if C: x else: y
1: Q accept x if C, else y
1: Q reject choice: c1->a; c2->b; ...; z
3: Q accept [if C then x else y]
3: Q reject no other choice has x as the first element
1: Q accept (x,y) ? C
3: Q accept x if C else y (The "else y" being optional)
1: Q accept (C ? x , y)
1: Q accept any outcome (i.e form or plain rejection) from a usability study
1: Q reject (x if C else y)
1: Q accept (x if C else y)
2: Q reject NONE
3: Q reject NONE
3: Q accept (C ? x else y)
3: Q accept x when C else y
2: Q accept (x if C else y)
2: Q accept cond(C1, x1, C2, x2, C3, x3,...)
1: Q accept (if C1: x elif C2: y else: z)
1: Q reject cond(C, :x, :y)
3: Q accept (C and [x] or [y])[0]
2: Q reject
3: Q reject
3: Q reject all else
1: Q reject no-change
3: Q reject deliberately omitted as I have no interest in any other proposal
2: Q reject (C then x else Y)
1: Q accept if C: x else: y
1: Q reject (if C then x else y)
3: Q reject C?(x, y)
Copyright
This document has been placed in the public domain.
pep-0309 Partial Function Application
| PEP: | 309 |
|---|---|
| Title: | Partial Function Application |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Peter Harris <scav at blueyonder.co.uk> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 08-Feb-2003 |
| Python-Version: | 2.5 |
| Post-History: | 10-Feb-2003, 27-Feb-2003, 22-Feb-2004, 28-Apr-2006 |
Contents
Note
Following the acceptance of this PEP, further discussion on python-dev and comp.lang.python revealed a desire for several tools that operated on function objects, but were not related to functional programming. Rather than create a new module for these tools, it was agreed [1] that the "functional" module be renamed to "functools" to reflect its newly-widened focus.
References in this PEP to a "functional" module have been left in for historical reasons.
Abstract
This proposal is for a function or callable class that allows a new callable to be constructed from a callable and a partial argument list (including positional and keyword arguments).
I propose a standard library module called "functional", to hold useful higher-order functions, including the implementation of partial().
An implementation has been submitted to SourceForge [2].
Acceptance
Patch #941881 was accepted and applied in 2005 for Py2.5. It is essentially as outlined here, a partial() type constructor binding leftmost positional arguments and any keywords. The partial object has three read-only attributes func, args, and keywords. Calls to the partial object can specify keywords that override those in the object itself.
There is a separate and continuing discussion of whether to modify the partial implementation with a __get__ method to more closely emulate the behavior of an equivalent function.
Motivation
In functional programming, function currying is a way of implementing multi-argument functions in terms of single-argument functions. A function with N arguments is really a function with 1 argument that returns another function taking (N-1) arguments. Function application in languages like Haskell and ML works such that a function call:
f x y z
actually means:
(((f x) y) z)
This would be only an obscure theoretical issue except that in actual programming it turns out to be very useful. Expressing a function in terms of partial application of arguments to another function can be both elegant and powerful, and in functional languages it is heavily used.
In some functional languages, (e.g. Miranda) you can use an expression such as (+1) to mean the equivalent of Python's (lambda x: x + 1).
In general, languages like that are strongly typed, so the compiler always knows the number of arguments expected and can do the right thing when presented with a functor and less arguments than expected.
Python does not implement multi-argument functions by currying, so if you want a function with partially-applied arguments you would probably use a lambda as above, or define a named function for each instance.
However, lambda syntax is not to everyone's taste, so say the least. Furthermore, Python's flexible parameter passing using both positional and keyword presents an opportunity to generalise the idea of partial application and do things that lambda cannot.
Example Implementation
Here is one way to do a create a callable with partially-applied arguments in Python. The implementation below is based on improvements provided by Scott David Daniels:
class partial(object):
def __init__(*args, **kw):
self = args[0]
self.fn, self.args, self.kw = (args[1], args[2:], kw)
def __call__(self, *args, **kw):
if kw and self.kw:
d = self.kw.copy()
d.update(kw)
else:
d = kw or self.kw
return self.fn(*(self.args + args), **d)
(A recipe similar to this has been in the Python Cookbook for some time [3].)
Note that when the object is called as though it were a function, positional arguments are appended to those provided to the constructor, and keyword arguments override and augment those provided to the constructor.
Positional arguments, keyword arguments or both can be supplied at when creating the object and when calling it.
Examples of Use
So partial(operator.add, 1) is a bit like (lambda x: 1 + x). Not an example where you see the benefits, of course.
Note too, that you could wrap a class in the same way, since classes themselves are callable factories for objects. So in some cases, rather than defining a subclass, you can specialise classes by partial application of the arguments to the constructor.
For example, partial(Tkinter.Label, fg='blue') makes Tkinter Labels that have a blue foreground by default.
Here's a simple example that uses partial application to construct callbacks for Tkinter widgets on the fly:
from Tkinter import Tk, Canvas, Button
import sys
from functional import partial
win = Tk()
c = Canvas(win,width=200,height=50)
c.pack()
for colour in sys.argv[1:]:
b = Button(win, text=colour,
command=partial(c.config, bg=colour))
b.pack(side='left')
win.mainloop()
Abandoned Syntax Proposal
I originally suggested the syntax fn@(*args, **kw), meaning the same as partial(fn, *args, **kw).
The @ sign is used in some assembly languages to imply register indirection, and the use here is also a kind of indirection. f@(x) is not f(x), but a thing that becomes f(x) when you call it.
It was not well-received, so I have withdrawn this part of the proposal. In any case, @ has been taken for the new decorator syntax.
Feedback from comp.lang.python and python-dev
Among the opinions voiced were the following (which I summarise):
- Lambda is good enough.
- The @ syntax is ugly (unanimous).
- It's really a curry rather than a closure. There is an almost identical implementation of a curry class on ActiveState's Python Cookbook.
- A curry class would indeed be a useful addition to the standard library.
- It isn't function currying, but partial application. Hence the name is now proposed to be partial().
- It maybe isn't useful enough to be in the built-ins.
- The idea of a module called functional was well received, and there are other things that belong there (for example function composition).
- For completeness, another object that appends partial arguments after those supplied in the function call (maybe called rightcurry) has been suggested.
I agree that lambda is usually good enough, just not always. And I want the possibility of useful introspection and subclassing.
I disagree that @ is particularly ugly, but it may be that I'm just weird. We have dictionary, list and tuple literals neatly differentiated by special punctuation -- a way of directly expressing partially-applied function literals is not such a stretch. However, not one single person has said they like it, so as far as I'm concerned it's a dead parrot.
I concur with calling the class partial rather than curry or closure, so I have amended the proposal in this PEP accordingly. But not throughout: some incorrect references to 'curry' have been left in since that's where the discussion was at the time.
Partially applying arguments from the right, or inserting arguments at arbitrary positions creates its own problems, but pending discovery of a good implementation and non-confusing semantics, I don't think it should be ruled out.
Carl Banks posted an implementation as a real functional closure:
def curry(fn, *cargs, **ckwargs):
def call_fn(*fargs, **fkwargs):
d = ckwargs.copy()
d.update(fkwargs)
return fn(*(cargs + fargs), **d)
return call_fn
which he assures me is more efficient.
I also coded the class in Pyrex, to estimate how the performance might be improved by coding it in C:
cdef class curry:
cdef object fn, args, kw
def __init__(self, fn, *args, **kw):
self.fn=fn
self.args=args
self.kw = kw
def __call__(self, *args, **kw):
if self.kw: # from Python Cookbook version
d = self.kw.copy()
d.update(kw)
else:
d=kw
return self.fn(*(self.args + args), **d)
The performance gain in Pyrex is less than 100% over the nested function implementation, since to be fully general it has to operate by Python API calls. For the same reason, a C implementation will be unlikely to be much faster, so the case for a built-in coded in C is not very strong.
Summary
I prefer that some means to partially-apply functions and other callables should be present in the standard library.
A standard library module functional should contain an implementation of partial, and any other higher-order functions the community want. Other functions that might belong there fall outside the scope of this PEP though.
Patches for the implementation, documentation and unit tests (SF patches 931005 [4], 931007 [5], and 931010 [6] respectively) have been submitted but not yet checked in.
A C implementation by Hye-Shik Chang has also been submitted, although it is not expected to be included until after the Python implementation has proven itself useful enough to be worth optimising.
References
| [1] | http://mail.python.org/pipermail/python-dev/2006-March/062290.html |
| [2] | Patches 931005 [4], 931007 [5], and 931010 [6]. |
| [3] | http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52549 |
| [4] | (1, 2) http://www.python.org/sf/931005 |
| [5] | (1, 2) http://www.python.org/sf/931007 |
| [6] | (1, 2) http://www.python.org/sf/931010 |
Copyright
This document has been placed in the public domain.
pep-0310 Reliable Acquisition/Release Pairs
| PEP: | 310 |
|---|---|
| Title: | Reliable Acquisition/Release Pairs |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michael Hudson <mwh at python.net>, Paul Moore <p.f.moore at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 18-Dec-2002 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
It would be nice to have a less typing-intense way of writing:
the_lock.acquire()
try:
....
finally:
the_lock.release()
This PEP proposes a piece of syntax (a 'with' block) and a
"small-i" interface that generalizes the above.
Pronouncement
This PEP is rejected in favor of PEP 343.
Rationale
One of the advantages of Python's exception handling philosophy is
that it makes it harder to do the "wrong" thing (e.g. failing to
check the return value of some system call). Currently, this does
not apply to resource cleanup. The current syntax for acquisition
and release of a resource (for example, a lock) is
the_lock.acquire()
try:
....
finally:
the_lock.release()
This syntax separates the acquisition and release by a (possibly
large) block of code, which makes it difficult to confirm "at a
glance" that the code manages the resource correctly. Another
common error is to code the "acquire" call within the try block,
which incorrectly releases the lock if the acquire fails.
Basic Syntax and Semantics
The syntax of a 'with' statement is as follows::
'with' [ var '=' ] expr ':'
suite
This statement is defined as being equivalent to the following
sequence of statements:
var = expr
if hasattr(var, "__enter__"):
var.__enter__()
try:
suite
finally:
var.__exit__()
(The presence of an __exit__ method is *not* checked like that of
__enter__ to ensure that using inappropriate objects in with:
statements gives an error).
If the variable is omitted, an unnamed object is allocated on the
stack. In that case, the suite has no access to the unnamed object.
Possible Extensions
A number of potential extensions to the basic syntax have been
discussed on the Python Developers list. None of these extensions
are included in the solution proposed by this PEP. In many cases,
the arguments are nearly equally strong in both directions. In
such cases, the PEP has always chosen simplicity, simply because
where extra power is needed, the existing try block is available.
Multiple expressions
One proposal was for allowing multiple expressions within one
'with' statement. The __enter__ methods would be called left to
right, and the __exit__ methods right to left. The advantage of
doing so is that where more than one resource is being managed,
nested 'with' statements will result in code drifting towards the
right margin. The solution to this problem is the same as for any
other deep nesting - factor out some of the code into a separate
function. Furthermore, the question of what happens if one of the
__exit__ methods raises an exception (should the other __exit__
methods be called?) needs to be addressed.
Exception handling
An extension to the protocol to include an optional __except__
handler, which is called when an exception is raised, and which
can handle or re-raise the exception, has been suggested. It is
not at all clear that the semantics of this extension can be made
precise and understandable. For example, should the equivalent
code be try ... except ... else if an exception handler is
defined, and try ... finally if not? How can this be determined
at compile time, in general? The alternative is to define the
code as expanding to a try ... except inside a try ... finally.
But this may not do the right thing in real life.
The only use case identified for exception handling is with
transactional processing (commit on a clean finish, and rollback
on an exception). This is probably just as easy to handle with a
conventional try ... except ... else block, and so the PEP does
not include any support for exception handlers.
Implementation Notes
There is a potential race condition in the code specified as
equivalent to the with statement. For example, if a
KeyboardInterrupt exception is raised between the completion of
the __enter__ method call and the start of the try block, the
__exit__ method will not be called. This can lead to resource
leaks, or to deadlocks. [XXX Guido has stated that he cares about
this sort of race condition, and intends to write some C magic to
handle them. The implementation of the 'with' statement should
copy this.]
Open Issues
Should existing classes (for example, file-like objects and locks)
gain appropriate __enter__ and __exit__ methods? The obvious
reason in favour is convenience (no adapter needed). The argument
against is that if built-in files have this but (say) StringIO
does not, then code that uses "with" on a file object can't be
reused with a StringIO object. So __exit__ = close becomes a part
of the "file-like object" protocol, which user-defined classes may
need to support.
The __enter__ hook may be unnecessary - for many use cases, an
adapter class is needed and in that case, the work done by the
__enter__ hook can just as easily be done in the __init__ hook.
If a way of controlling object lifetimes explicitly was available,
the function of the __exit__ hook could be taken over by the
existing __del__ hook. An email exchange[1] with a proponent of
this approach left one of the authors even more convinced that
it isn't the right idea...
It has been suggested[2] that the "__exit__" method be called
"close", or that a "close" method should be considered if no
__exit__ method is found, to increase the "out-of-the-box utility"
of the "with ..." construct.
There are some similarities in concept between 'with ...' blocks
and generators, which have led to proposals that for loops could
implement the with block functionality[3]. While neat on some
levels, we think that for loops should stick to being loops.
Alternative Ideas
IEXEC: Holger Krekel -- generalised approach with XML-like syntax
(no URL found...)
Holger has much more far-reaching ideas about "execution monitors"
that are informed about details of control flow in the monitored
block. While interesting, these ideas could change the language
in deep and subtle ways and as such belong to a different PEP.
Any Smalltalk/Ruby anonymous block style extension obviously
subsumes this one.
PEP 319 is in the same area, but did not win support when aired on
python-dev.
Backwards Compatibility
This PEP proposes a new keyword, so the __future__ game will need
to be played.
Cost of Adoption
Those who claim the language is getting larger and more
complicated have something else to complain about. It's something
else to teach.
For the proposal to be useful, many file-like and lock-like
classes in the standard library and other code will have to have
__exit__ = close
or similar added.
Cost of Non-Adoption
Writing correct code continues to be more effort than writing
incorrect code.
References
There are various python-list and python-dev discussions that
could be mentioned here.
[1] Off-list conversation between Michael Hudson and Bill Soudan
(made public with permission)
http://starship.python.net/crew/mwh/pep310/
[2] Samuele Pedroni on python-dev
http://mail.python.org/pipermail/python-dev/2003-August/037795.html
[3] Thread on python-dev with subject
[Python-Dev] pre-PEP: Resource-Release Support for Generators
starting at
http://mail.python.org/pipermail/python-dev/2003-August/037803.html
Copyright
This document has been placed in the public domain.
pep-0311 Simplified Global Interpreter Lock Acquisition for Extensions
| PEP: | 311 |
|---|---|
| Title: | Simplified Global Interpreter Lock Acquisition for Extensions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Mark Hammond <mhammond at skippinet.com.au> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 05-Feb-2003 |
| Post-History: | 05-Feb-2003 14-Feb-2003 19-Apr-2003 |
Abstract
This PEP proposes a simplified API for access to the Global
Interpreter Lock (GIL) for Python extension modules.
Specifically, it provides a solution for authors of complex
multi-threaded extensions, where the current state of Python
(i.e., the state of the GIL is unknown.
This PEP proposes a new API, for platforms built with threading
support, to manage the Python thread state. An implementation
strategy is proposed, along with an initial, platform independent
implementation.
Rationale
The current Python interpreter state API is suitable for simple,
single-threaded extensions, but quickly becomes incredibly complex
for non-trivial, multi-threaded extensions.
Currently Python provides two mechanisms for dealing with the GIL:
- Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros.
These macros are provided primarily to allow a simple Python
extension that already owns the GIL to temporarily release it
while making an "external" (ie, non-Python), generally
expensive, call. Any existing Python threads that are blocked
waiting for the GIL are then free to run. While this is fine
for extensions making calls from Python into the outside world,
it is no help for extensions that need to make calls into Python
when the thread state is unknown.
- PyThreadState and PyInterpreterState APIs.
These API functions allow an extension/embedded application to
acquire the GIL, but suffer from a serious boot-strapping
problem - they require you to know the state of the Python
interpreter and of the GIL before they can be used. One
particular problem is for extension authors that need to deal
with threads never before seen by Python, but need to call
Python from this thread. It is very difficult, delicate and
error prone to author an extension where these "new" threads
always know the exact state of the GIL, and therefore can
reliably interact with this API.
For these reasons, the question of how such extensions should
interact with Python is quickly becoming a FAQ. The main impetus
for this PEP, a thread on python-dev [1], immediately identified
the following projects with this exact issue:
- The win32all extensions
- Boost
- ctypes
- Python-GTK bindings
- Uno
- PyObjC
- Mac toolbox
- PyXPCOM
Currently, there is no reasonable, portable solution to this
problem, forcing each extension author to implement their own
hand-rolled version. Further, the problem is complex, meaning
many implementations are likely to be incorrect, leading to a
variety of problems that will often manifest simply as "Python has
hung".
While the biggest problem in the existing thread-state API is the
lack of the ability to query the current state of the lock, it is
felt that a more complete, simplified solution should be offered
to extension authors. Such a solution should encourage authors to
provide error-free, complex extension modules that take full
advantage of Python's threading mechanisms.
Limitations and Exclusions
This proposal identifies a solution for extension authors with
complex multi-threaded requirements, but that only require a
single "PyInterpreterState". There is no attempt to cater for
extensions that require multiple interpreter states. At the time
of writing, no extension has been identified that requires
multiple PyInterpreterStates, and indeed it is not clear if that
facility works correctly in Python itself.
This API will not perform automatic initialization of Python, or
initialize Python for multi-threaded operation. Extension authors
must continue to call Py_Initialize(), and for multi-threaded
applications, PyEval_InitThreads(). The reason for this is that
the first thread to call PyEval_InitThreads() is nominated as the
"main thread" by Python, and so forcing the extension author to
specify the main thread (by forcing her to make this first call)
removes ambiguity. As Py_Initialize() must be called before
PyEval_InitThreads(), and as both of these functions currently
support being called multiple times, the burden this places on
extension authors is considered reasonable.
It is intended that this API be all that is necessary to acquire
the Python GIL. Apart from the existing, standard
Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros, it is
assumed that no additional thread state API functions will be used
by the extension. Extensions with such complicated requirements
are free to continue to use the existing thread state API.
Proposal
This proposal recommends a new API be added to Python to simplify
the management of the GIL. This API will be available on all
platforms built with WITH_THREAD defined.
The intent is that assuming Python has correctly been initialized,
an extension author be able to use a small, well-defined "prologue
dance", at any time and on any thread, which will ensure Python
is ready to be used on that thread. After the extension has
finished with Python, it must also perform an "epilogue dance" to
release any resources previously acquired. Ideally, these dances
can be expressed in a single line.
Specifically, the following new APIs are proposed:
/* Ensure that the current thread is ready to call the Python
C API, regardless of the current state of Python, or of its
thread lock. This may be called as many times as desired
by a thread so long as each call is matched with a call to
PyGILState_Release(). In general, other thread-state APIs may
be used between _Ensure() and _Release() calls, so long as the
thread-state is restored to its previous state before the Release().
For example, normal use of the Py_BEGIN_ALLOW_THREADS/
Py_END_ALLOW_THREADS macros are acceptable.
The return value is an opaque "handle" to the thread state when
PyGILState_Acquire() was called, and must be passed to
PyGILState_Release() to ensure Python is left in the same state. Even
though recursive calls are allowed, these handles can *not* be
shared - each unique call to PyGILState_Ensure must save the handle
for its call to PyGILState_Release.
When the function returns, the current thread will hold the GIL.
Failure is a fatal error.
*/
PyAPI_FUNC(PyGILState_STATE) PyGILState_Ensure(void);
/* Release any resources previously acquired. After this call, Python's
state will be the same as it was prior to the corresponding
PyGILState_Acquire call (but generally this state will be unknown to
the caller, hence the use of the GILState API.)
Every call to PyGILState_Ensure must be matched by a call to
PyGILState_Release on the same thread.
*/
PyAPI_FUNC(void) PyGILState_Release(PyGILState_STATE);
Common usage will be:
void SomeCFunction(void)
{
/* ensure we hold the lock */
PyGILState_STATE state = PyGILState_Ensure();
/* Use the Python API */
...
/* Restore the state of Python */
PyGILState_Release(state);
}
Design and Implementation
The general operation of PyGILState_Ensure() will be:
- assert Python is initialized.
- Get a PyThreadState for the current thread, creating and saving
if necessary.
- remember the current state of the lock (owned/not owned)
- If the current state does not own the GIL, acquire it.
- Increment a counter for how many calls to
PyGILState_Ensure have been made on the current thread.
- return
The general operation of PyGILState_Release() will be:
- assert our thread currently holds the lock.
- If old state indicates lock was previously unlocked, release GIL.
- Decrement the PyGILState_Ensure counter for the thread.
- If counter == 0:
- release and delete the PyThreadState.
- forget the ThreadState as being owned by the thread.
- return
It is assumed that it is an error if two discrete PyThreadStates
are used for a single thread. Comments in pystate.h ("State
unique per thread") support this view, although it is never
directly stated. Thus, this will require some implementation of
Thread Local Storage. Fortunately, a platform independent
implementation of Thread Local Storage already exists in the
Python source tree, in the SGI threading port. This code will be
integrated into the platform independent Python core, but in such
a way that platforms can provide a more optimal implementation if
desired.
Implementation
An implementation of this proposal can be found at
http://www.python.org/sf/684256
References
[1] http://mail.python.org/pipermail/python-dev/2002-December/031424.html
Copyright
This document has been placed in the public domain.
pep-0312 Simple Implicit Lambda
| PEP: | 312 |
|---|---|
| Title: | Simple Implicit Lambda |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Roman Suzi <rnd at onego.ru>, Alex Martelli <aleaxit at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 11-Feb-2003 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
This PEP proposes to make argumentless lambda keyword optional in
some cases where it is not grammatically ambiguous.
Deferral
The BDFL hates the unary colon syntax. This PEP needs to go back
to the drawing board and find a more Pythonic syntax (perhaps an
alternative unary operator). See python-dev discussion on
17 June 2005.
Also, it is probably a good idea to eliminate the alternative
propositions which have no chance at all. The examples section
is good and highlights the readability improvements. It would
carry more weight with additional examples and with real-world
referrents (instead of the abstracted dummy calls to :A and :B).
Motivation
Lambdas are useful for defining anonymous functions, e.g. for use
as callbacks or (pseudo)-lazy evaluation schemes. Often, lambdas
are not used when they would be appropriate, just because the
keyword "lambda" makes code look complex. Omitting lambda in some
special cases is possible, with small and backwards compatible
changes to the grammar, and provides a cheap cure against such
"lambdaphobia".
Rationale
Sometimes people do not use lambdas because they fear to introduce
a term with a theory behind it. This proposal makes introducing
argumentless lambdas easier, by omitting the "lambda" keyword.
itself. Implementation can be done simply changing grammar so it
lets the "lambda" keyword be implied in a few well-known cases.
In particular, adding surrounding brackets lets you specify
nullary lambda anywhere.
Syntax
An argumentless "lambda" keyword can be omitted in the following
cases:
* immediately after "=" in named parameter assignment or default
value assignment;
* immediately after "(" in any expression;
* immediately after a "," in a function argument list;
* immediately after a ":" in a dictionary literal; (not
implemented)
* in an assignment statement; (not implemented)
Examples of Use
1) Inline "if":
def ifelse(cond, true_part, false_part):
if cond:
return true_part()
else:
return false_part()
# old syntax:
print ifelse(a < b, lambda:A, lambda:B)
# new syntax:
print ifelse(a < b, :A, :B)
# parts A and B may require extensive processing, as in:
print ifelse(a < b, :ext_proc1(A), :ext_proc2(B))
2) Locking:
def with(alock, acallable):
alock.acquire()
try:
acallable()
finally:
alock.release()
with(mylock, :x(y(), 23, z(), 'foo'))
Implementation
Implementation requires some tweaking of the Grammar/Grammar file
in the Python sources, and some adjustment of
Modules/parsermodule.c to make syntactic and pragmatic changes.
(Some grammar/parser guru is needed to make a full
implementation.)
Here are the changes needed to Grammar to allow implicit lambda:
varargslist: (fpdef ['=' imptest] ',')* ('*' NAME [',' '**'
NAME] | '**' NAME) | fpdef ['=' imptest] (',' fpdef ['='
imptest])* [',']
imptest: test | implambdef
atom: '(' [imptestlist] ')' | '[' [listmaker] ']' |
'{' [dictmaker] '}' | '`' testlist1 '`' | NAME | NUMBER | STRING+
implambdef: ':' test
imptestlist: imptest (',' imptest)* [',']
argument: [test '='] imptest
Three new non-terminals are needed: imptest for the place where
implicit lambda may occur, implambdef for the implicit lambda
definition itself, imptestlist for a place where imptest's may
occur.
This implementation is not complete. First, because some files in
Parser module need to be updated. Second, some additional places
aren't implemented, see Syntax section above.
Discussion
This feature is not a high-visibility one (the only novel part is
the absence of lambda). The feature is intended to make null-ary
lambdas more appealing syntactically, to provide lazy evaluation
of expressions in some simple cases. This proposal is not targeted
at more advanced cases (demanding arguments for the lambda).
There is an alternative proposition for implicit lambda: implicit
lambda with unused arguments. In this case the function defined by
such lambda can accept any parameters, i.e. be equivalent to:
lambda *args: expr. This form would be more powerful. Grep in the
standard library revealed that such lambdas are indeed in use.
One more extension can provide a way to have a list of parameters
passed to a function defined by implicit lambda. However, such
parameters need some special name to be accessed and are unlikely
to be included in the language. Possible local names for such
parameters are: _, __args__, __. For example:
reduce(:_[0] + _[1], [1,2,3], 0)
reduce(:__[0] + __[1], [1,2,3], 0)
reduce(:__args__[0] + __args__[1], [1,2,3], 0)
These forms do not look very nice, and in the PEP author's opinion
do not justify the removal of the lambda keyword in such cases.
Credits
The idea of dropping lambda was first coined by Paul Rubin at 08
Feb 2003 16:39:30 -0800 in comp.lang.python while discussing the
thread "For review: PEP 308 - If-then-else expression".
Copyright
This document has been placed in the public domain.
pep-0313 Adding Roman Numeral Literals to Python
| PEP: | 313 |
|---|---|
| Title: | Adding Roman Numeral Literals to Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Mike Meyer <mwm at mired.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 01-Apr-2003 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
This PEP (also known as PEP CCCXIII) proposes adding Roman
numerals as a literal type. It also proposes the new built-in
function "roman", which converts an object to an integer, then
converts the integer to a string that is the Roman numeral literal
equivalent to the integer.
BDFL Pronouncement
This PEP is rejected. While the majority of Python users deemed this
to be a nice-to-have feature, the community was unable to reach a
consensus on whether nine should be represented as IX, the modern
form, or VIIII, the classic form. Likewise, no agreement was
reached on whether MXM or MCMXC would be considered a well-formed
representation of 1990. A vocal minority of users has also requested
support for lower-cased numerals for use in (i) powerpoint slides,
(ii) academic work, and (iii) Perl documentation.
Rationale
Roman numerals are used in a number of areas, and adding them to
Python as literals would make computations in those areas easier.
For instance, Superbowls are counted with Roman numerals, and many
older movies have copyright dates in Roman numerals. Further,
LISP provides a Roman numerals literal package, so adding Roman
numerals to Python will help ease the LISP-envy sometimes seen in
comp.lang.python. Besides, the author thinks this is the easiest
way to get his name on a PEP.
Syntax for Roman literals
Roman numeral literals will consist of the characters M, D, C, L,
X, V and I, and only those characters. They must be in upper
case, and represent an integer with the following rules:
1. Except as noted below, they must appear in the order M, D, C,
L, X, V then I. Each occurrence of each character adds 1000, 500,
100, 50, 10, 5 and 1 to the value of the literal, respectively.
2. Only one D, V or L may appear in any given literal.
3. At most three each of Is, Xs and Cs may appear consecutively
in any given literal.
4. A single I may appear immediately to the left of the single V,
followed by no Is, and adds 4 to the value of the literal.
5. A single I may likewise appear before the last X, followed by
no Is or Vs, and adds 9 to the value.
6. X is to L and C as I is to V and X, except the values are 40
and 90, respectively.
7. C is to D and M as I is to V and X, except the values are 400
and 900, respectively.
Any literal composed entirely of M, D, C, L, X, V and I characters
that does not follow this format will raise a syntax error,
because explicit is better than implicit.
Built-In "roman" Function
The new built-in function "roman" will aide the translation from
integers to Roman numeral literals. It will accept a single
object as an argument, and return a string containing the literal
of the same value. If the argument is not an integer or a
rational (see PEP 239 [1]) it will passed through the existing
built-in "int" to obtain the value. This may cause a loss of
information if the object was a float. If the object is a
rational, then the result will be formatted as a rational literal
(see PEP 240 [2]) with the integers in the string being Roman
numeral literals.
Compatibility Issues
No new keywords are introduced by this proposal. Programs that
use variable names that are all upper case and contain only the
characters M, D, C, L, X, V and I will be affected by the new
literals. These programs will now have syntax errors when those
variables are assigned, and either syntax errors or subtle bugs
when those variables are referenced in expressions. Since such
variable names violate PEP 8 [3], the code is already broken, it
just wasn't generating exceptions. This proposal corrects that
oversight in the language.
References
[1] PEP 239, Adding a Rational Type to Python
http://www.python.org/dev/peps/pep-0239/
[2] PEP 240, Adding a Rational Literal to Python
http://www.python.org/dev/peps/pep-0240/
[3] PEP 8, Style Guide for Python Code
http://www.python.org/dev/peps/pep-0008/
Copyright
This document has been placed in the public domain.
pep-0314 Metadata for Python Software Packages v1.1
| PEP: | 314 |
|---|---|
| Title: | Metadata for Python Software Packages v1.1 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling, Richard Jones |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 12-Apr-2003 |
| Python-Version: | 2.5 |
| Post-History: | 29-Apr-2003 |
| Replaces: | 241 |
Introduction
This PEP describes a mechanism for adding metadata to Python packages. It includes specifics of the field names, and their semantics and usage. This document specifies version 1.1 of the metadata format. Version 1.0 is specified in PEP 241.
Including Metadata in Packages
The Distutils 'sdist' command will extract the metadata fields
from the arguments and write them to a file in the generated
zipfile or tarball. This file will be named PKG-INFO and will be
placed in the top directory of the source distribution (where the
README, INSTALL, and other files usually go).
Developers may not provide their own PKG-INFO file. The "sdist"
command will, if it detects an existing PKG-INFO file, terminate
with an appropriate error message. This should prevent confusion
caused by the PKG-INFO and setup.py files being out of sync.
The PKG-INFO file format is a single set of RFC-822 headers
parseable by the rfc822.py module. The field names listed in the
following section are used as the header names.
Fields
This section specifies the names and semantics of each of the
supported metadata fields.
Fields marked with "(Multiple use)" may be specified multiple
times in a single PKG-INFO file. Other fields may only occur
once in a PKG-INFO file. Fields marked with "(optional)" are
not required to appear in a valid PKG-INFO file; all other
fields must be present.
Metadata-Version
Version of the file format; currently "1.0" and "1.1" are the
only legal values here.
Example:
Metadata-Version: 1.1
Name
The name of the package.
Example:
Name: BeagleVote
Version
A string containing the package's version number. This
field should be parseable by one of the Version classes
(StrictVersion or LooseVersion) in the distutils.version
module.
Example:
Version: 1.0a2
Platform (multiple use)
A comma-separated list of platform specifications, summarizing
the operating systems supported by the package which are not
listed in the "Operating System" Trove classifiers. See
"Classifier" below.
Example:
Platform: ObscureUnix, RareDOS
Supported-Platform (multiple use)
Binary distributions containing a PKG-INFO file will use the
Supported-Platform field in their metadata to specify the OS and
CPU for which the binary package was compiled. The semantics of
the Supported-Platform field are not specified in this PEP.
Example:
Supported-Platform: RedHat 7.2
Supported-Platform: i386-win32-2791
Summary
A one-line summary of what the package does.
Example:
Summary: A module for collecting votes from beagles.
Description (optional)
A longer description of the package that can run to several
paragraphs. Software that deals with metadata should not assume
any maximum size for this field, though people shouldn't include
their instruction manual as the description.
The contents of this field can be written using reStructuredText
markup [1]. For programs that work with the metadata,
supporting markup is optional; programs can also display the
contents of the field as-is. This means that authors should be
conservative in the markup they use.
Example:
Description: This module collects votes from beagles
in order to determine their electoral wishes.
Do *not* try to use this module with basset hounds;
it makes them grumpy.
Keywords (optional)
A list of additional keywords to be used to assist searching
for the package in a larger catalog.
Example:
Keywords: dog puppy voting election
Home-page (optional)
A string containing the URL for the package's home page.
Example:
Home-page: http://www.example.com/~cschultz/bvote/
Download-URL
A string containing the URL from which this version of the package
can be downloaded. (This means that the URL can't be something like
".../package-latest.tgz", but instead must be "../package-0.45.tgz".)
Author (optional)
A string containing the author's name at a minimum; additional
contact information may be provided.
Example:
Author: C. Schultz, Universal Features Syndicate,
Los Angeles, CA <cschultz@peanuts.example.com>
Author-email
A string containing the author's e-mail address. It can contain
a name and e-mail address in the legal forms for a RFC-822
'From:' header. It's not optional because cataloging systems
can use the e-mail portion of this field as a unique key
representing the author. A catalog might provide authors the
ability to store their GPG key, personal home page, and other
additional metadata *about the author*, and optionally the
ability to associate several e-mail addresses with the same
person. Author-related metadata fields are not covered by this
PEP.
Example:
Author-email: "C. Schultz" <cschultz@example.com>
License
Text indicating the license covering the package where the license
is not a selection from the "License" Trove classifiers. See
"Classifier" below.
Example:
License: This software may only be obtained by sending the
author a postcard, and then the user promises not
to redistribute it.
Classifier (multiple use)
Each entry is a string giving a single classification value
for the package. Classifiers are described in PEP 301 [2].
Examples:
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console (Text Based)
Requires (multiple use)
Each entry contains a string describing some other module or
package required by this package.
The format of a requirement string is identical to that of a
module or package name usable with the 'import' statement,
optionally followed by a version declaration within parentheses.
A version declaration is a series of conditional operators and
version numbers, separated by commas. Conditional operators
must be one of "<", ">", "<=", ">=", "==", and "!=". Version
numbers must be in the format accepted by the
distutils.version.StrictVersion class: two or three
dot-separated numeric components, with an optional "pre-release"
tag on the end consisting of the letter 'a' or 'b' followed by a
number. Example version numbers are "1.0", "2.3a2", "1.3.99",
Any number of conditional operators can be specified, e.g.
the string ">1.0, !=1.3.4, <2.0" is a legal version declaration.
All of the following are possible requirement strings: "rfc822",
"zlib (>=1.1.4)", "zope".
There's no canonical list of what strings should be used; the
Python community is left to choose its own standards.
Example:
Requires: re
Requires: sys
Requires: zlib
Requires: xml.parsers.expat (>1.0)
Requires: psycopg
Provides (multiple use)
Each entry contains a string describing a package or module that
will be provided by this package once it is installed. These
strings should match the ones used in Requirements fields. A
version declaration may be supplied (without a comparison
operator); the package's version number will be implied if none
is specified.
Example:
Provides: xml
Provides: xml.utils
Provides: xml.utils.iso8601
Provides: xml.dom
Provides: xmltools (1.3)
Obsoletes (multiple use)
Each entry contains a string describing a package or module
that this package renders obsolete, meaning that the two packages
should not be installed at the same time. Version declarations
can be supplied.
The most common use of this field will be in case a package name
changes, e.g. Gorgon 2.3 gets subsumed into Torqued Python 1.0.
When you install Torqued Python, the Gorgon package should be
removed.
Example:
Obsoletes: Gorgon
Summary of Differences From PEP 241
* Metadata-Version is now 1.1.
* Added the Classifiers field from PEP 301.
* The License and Platform files should now only be used if the
platform or license can't be handled by an appropriate Classifier
value.
* Added fields: Download-URL, Requires, Provides, Obsoletes.
Open issues
None.
Acknowledgements
None.
References
[1] reStructuredText
http://docutils.sourceforge.net/
[2] PEP 301
http://www.python.org/dev/peps/pep-0301/
Copyright
This document has been placed in the public domain.
pep-0315 Enhanced While Loop
| PEP: | 315 |
|---|---|
| Title: | Enhanced While Loop |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 25-Apr-2003 |
| Python-Version: | 2.5 |
| Post-History: |
Abstract
This PEP proposes adding an optional "do" clause to the beginning
of the while loop to make loop code clearer and reduce errors
caused by code duplication.
Notice
Rejected; see
http://mail.python.org/pipermail/python-ideas/2013-June/021610.html
This PEP has been deferred since 2006; see
http://mail.python.org/pipermail/python-dev/2006-February/060718.html
Subsequent efforts to revive the PEP in April 2009 did not
meet with success because no syntax emerged that could
compete with the following form:
while True:
<setup code>
if not <condition>:
break
<loop body>
A syntax alternative to the one proposed in the PEP was found for
a basic do-while loop but it gained little support because the
condition was at the top:
do ... while <cond>:
<loop body>
Users of the language are advised to use the while-True form with
an inner if-break when a do-while loop would have been appropriate.
Motivation
It is often necessary for some code to be executed before each
evaluation of the while loop condition. This code is often
duplicated outside the loop, as setup code that executes once
before entering the loop:
<setup code>
while <condition>:
<loop body>
<setup code>
The problem is that duplicated code can be a source of errors if
one instance is changed but the other is not. Also, the purpose
of the second instance of the setup code is not clear because it
comes at the end of the loop.
It is possible to prevent code duplication by moving the loop
condition into a helper function, or an if statement in the loop
body. However, separating the loop condition from the while
keyword makes the behavior of the loop less clear:
def helper(args):
<setup code>
return <condition>
while helper(args):
<loop body>
This last form has the additional drawback of requiring the loop's
else clause to be added to the body of the if statement, further
obscuring the loop's behavior:
while True:
<setup code>
if not <condition>: break
<loop body>
This PEP proposes to solve these problems by adding an optional
clause to the while loop, which allows the setup code to be
expressed in a natural way:
do:
<setup code>
while <condition>:
<loop body>
This keeps the loop condition with the while keyword where it
belongs, and does not require code to be duplicated.
Syntax
The syntax of the while statement
while_stmt : "while" expression ":" suite
["else" ":" suite]
is extended as follows:
while_stmt : ["do" ":" suite]
"while" expression ":" suite
["else" ":" suite]
Semantics of break and continue
In the do-while loop the break statement will behave the same as
in the standard while loop: It will immediately terminate the loop
without evaluating the loop condition or executing the else
clause.
A continue statement in the do-while loop jumps to the while
condition check.
In general, when the while suite is empty (a pass statement),
the do-while loop and break and continue statements should match
the semantics of do-while in other languages.
Likewise, when the do suite is empty, the do-while loop and
break and continue statements should match behavior found
in regular while loops.
Future Statement
Because of the new keyword "do", the statement
from __future__ import do_while
will initially be required to use the do-while form.
Implementation
The first implementation of this PEP can compile the do-while loop
as an infinite loop with a test that exits the loop.
Copyright
This document is placed in the public domain.
pep-0316 Programming by Contract for Python
| PEP: | 316 |
|---|---|
| Title: | Programming by Contract for Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Terence Way <terry at wayforward.net> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 02-May-2003 |
| Python-Version: | |
| Post-History: |
Contents
Abstract
This submission describes programming by contract for Python. Eiffel's Design By Contract(tm) is perhaps the most popular use of programming contracts [2].
Programming contracts extends the language to include invariant expressions for classes and modules, and pre- and post-condition expressions for functions and methods.
These expressions (contracts) are similar to assertions: they must be true or the program is stopped, and run-time checking of the contracts is typically only enabled while debugging. Contracts are higher-level than straight assertions and are typically included in documentation.
Motivation
Python already has assertions, why add extra stuff to the language to support something like contracts? The two best reasons are 1) better, more accurate documentation, and 2) easier testing.
Complex modules and classes never seem to be documented quite right. The documentation provided may be enough to convince a programmer to use a particular module or class over another, but the programmer almost always has to read the source code when the real debugging starts.
Contracts extend the excellent example provided by the doctest module [4]. Documentation is readable by programmers, yet has executable tests embedded in it.
Testing code with contracts is easier too. Comprehensive contracts are equivalent to unit tests [8]. Tests exercise the full range of pre-conditions, and fail if the post-conditions are triggered. Theoretically, a correctly specified function can be tested completely randomly.
So why add this to the language? Why not have several different implementations, or let programmers implement their own assertions? The answer is the behavior of contracts under inheritance.
Suppose Alice and Bob use different assertions packages. If Alice produces a class library protected by assertions, Bob cannot derive classes from Alice's library and expect proper checking of post-conditions and invariants. If they both use the same assertions package, then Bob can override Alice's methods yet still test against Alice's contract assertions. The natural place to find this assertions system is in the language's run-time library.
Specification
The docstring of any module or class can include invariant contracts marked off with a line that starts with the keyword inv followed by a colon (:). Whitespace at the start of the line and around the colon is ignored. The colon is either immediately followed by a single expression on the same line, or by a series of expressions on following lines indented past the inv keyword. The normal Python rules about implicit and explicit line continuations are followed here. Any number of invariant contracts can be in a docstring.
Some examples:
# state enumeration
START, CONNECTING, CONNECTED, CLOSING, CLOSED = range(5)
class conn:
"""A network connection
inv: self.state in [START, CLOSED, # closed states
CONNECTING, CLOSING, # transition states
CONNECTED]
inv: 0 <= self.seqno < 256
"""
class circbuf:
"""A circular buffer.
inv:
# there can be from 0 to max items on the buffer
0 <= self.len <= len(self.buf)
# g is a valid index into buf
0 <= self.g < len(self.buf)
# p is also a valid index into buf
0 <= self.p < len(self.buf)
# there are len items between get and put
(self.p - self.g) % len(self.buf) == \
self.len % len(self.buf)
"""
Module invariants must be true after the module is loaded, and at the entry and exit of every public function within the module.
Class invariants must be true after the __init__ function returns, at the entry of the __del__ function, and at the entry and exit of every other public method of the class. Class invariants must use the self variable to access instance variables.
A method or function is public if its name doesn't start with an underscore (_), unless it starts and ends with '__' (two underscores).
The docstring of any function or method can have pre-conditions documented with the keyword pre following the same rules above. Post-conditions are documented with the keyword post optionally followed by a list of variables. The variables are in the same scope as the body of the function or method. This list declares the variables that the function/method is allowed to modify.
An example:
class circbuf:
def __init__(self, leng):
"""Construct an empty circular buffer.
pre: leng > 0
post[self]:
self.is_empty()
len(self.buf) == leng
"""
A double-colon (::) can be used instead of a single colon (:) to support docstrings written using reStructuredText [7]. For example, the following two docstrings describe the same contract:
"""pre: leng > 0""" """pre:: leng > 0"""
Expressions in pre- and post-conditions are defined in the module namespace -- they have access to nearly all the variables that the function can access, except closure variables.
The contract expressions in post-conditions have access to two additional variables: __old__ which is filled with shallow copies of values declared in the variable list immediately following the post keyword, and __return__ which is bound to the return value of the function or method.
An example:
class circbuf:
def get(self):
"""Pull an entry from a non-empty circular buffer.
pre: not self.is_empty()
post[self.g, self.len]:
__return__ == self.buf[__old__.self.g]
self.len == __old__.self.len - 1
"""
All contract expressions have access to some additional convenience functions. To make evaluating the truth of sequences easier, two functions forall and exists are defined as:
def forall(a, fn = bool):
"""Return True only if all elements in a are true.
>>> forall([])
1
>>> even = lambda x: x % 2 == 0
>>> forall([2, 4, 6, 8], even)
1
>>> forall('this is a test'.split(), lambda x: len(x) == 4)
0
"""
def exists(a, fn = bool):
"""Returns True if there is at least one true value in a.
>>> exists([])
0
>>> exists('this is a test'.split(), lambda x: len(x) == 4)
1
"""
An example:
def sort(a):
"""Sort a list.
pre: isinstance(a, type(list))
post[a]:
# array size is unchanged
len(a) == len(__old__.a)
# array is ordered
forall([a[i] >= a[i-1] for i in range(1, len(a))])
# all the old elements are still in the array
forall(__old__.a, lambda e: __old__.a.count(e) == a.count(e))
"""
To make evaluating conditions easier, the function implies is defined. With two arguments, this is similar to the logical implies (=>) operator. With three arguments, this is similar to C's conditional expression (x?a:b). This is defined as:
implies(False, a) => True implies(True, a) => a implies(False, a, b) => b implies(True, a, b) => a
On entry to a function, the function's pre-conditions are checked. An assertion error is raised if any pre-condition is false. If the function is public, then the class or module's invariants are also checked. Copies of variables declared in the post are saved, the function is called, and if the function exits without raising an exception, the post-conditions are checked.
Exceptions
Class/module invariants are checked even if a function or method exits by signalling an exception (post-conditions are not).
All failed contracts raise exceptions which are subclasses of the ContractViolationError exception, which is in turn a subclass of the AssertionError exception. Failed pre-conditions raise a PreconditionViolationError exception. Failed post-conditions raise a PostconditionViolationError exception, and failed invariants raise a InvariantViolationError exception.
The class hierarchy:
AssertionError
ContractViolationError
PreconditionViolationError
PostconditionViolationError
InvariantViolationError
InvalidPreconditionError
The InvalidPreconditionError is raised when pre-conditions are illegally strengthened, see the next section on Inheritance.
Example:
try:
some_func()
except contract.PreconditionViolationError:
# failed pre-condition, ok
pass
Inheritance
A class's invariants include all the invariants for all super-classes (class invariants are ANDed with super-class invariants). These invariants are checked in method-resolution order.
A method's post-conditions also include all overridden post-conditions (method post-conditions are ANDed with all overridden method post-conditions).
An overridden method's pre-conditions can be ignored if the overriding method's pre-conditions are met. However, if the overriding method's pre-conditions fail, all of the overridden method's pre-conditions must also fail. If not, a separate exception is raised, the InvalidPreconditionError. This supports weakening pre-conditions.
A somewhat contrived example:
class SimpleMailClient:
def send(self, msg, dest):
"""Sends a message to a destination:
pre: self.is_open() # we must have an open connection
"""
def recv(self):
"""Gets the next unread mail message.
Returns None if no message is available.
pre: self.is_open() # we must have an open connection
post: __return__ == None or isinstance(__return__, Message)
"""
class ComplexMailClient(SimpleMailClient):
def send(self, msg, dest):
"""Sends a message to a destination.
The message is sent immediately if currently connected.
Otherwise, the message is queued locally until a
connection is made.
pre: True # weakens the pre-condition from SimpleMailClient
"""
def recv(self):
"""Gets the next unread mail message.
Waits until a message is available.
pre: True # can always be called
post: isinstance(__return__, Message)
"""
Because pre-conditions can only be weakened, a ComplexMailClient can replace a SimpleMailClient with no fear of breaking existing code.
Rationale
Except for the following differences, programming-by-contract for Python mirrors the Eiffel DBC specification [3].
Embedding contracts in docstrings is patterned after the doctest module. It removes the need for extra syntax, ensures that programs with contracts are backwards-compatible, and no further work is necessary to have the contracts included in the docs.
The keywords pre, post, and inv were chosen instead of the Eiffel-style REQUIRE, ENSURE, and INVARIANT because they're shorter, more in line with mathematical notation, and for a more subtle reason: the word 'require' implies caller responsibilities, while 'ensure' implies provider guarantees. Yet pre-conditions can fail through no fault of the caller when using multiple inheritance, and post-conditions can fail through no fault of the function when using multiple threads.
Loop invariants as used in Eiffel are unsupported. They're a pain to implement, and not part of the documentation anyway.
The variable names __old__ and __return__ were picked to avoid conflicts with the return keyword and to stay consistent with Python naming conventions: they're public and provided by the Python implementation.
Having variable declarations after a post keyword describes exactly what the function or method is allowed to modify. This removes the need for the NoChange syntax in Eiffel, and makes the implementation of __old__ much easier. It also is more in line with Z schemas [9], which are divided into two parts: declaring what changes followed by limiting the changes.
Shallow copies of variables for the __old__ value prevent an implementation of contract programming from slowing down a system too much. If a function changes values that wouldn't be caught by a shallow copy, it can declare the changes like so:
post[self, self.obj, self.obj.p]
The forall, exists, and implies functions were added after spending some time documenting existing functions with contracts. These capture a majority of common specification idioms. It might seem that defining implies as a function might not work (the arguments are evaluated whether needed or not, in contrast with other boolean operators), but it works for contracts since there should be no side-effects for any expression in a contract.
Reference Implementation
A reference implementation is available [1]. It replaces existing functions with new functions that do contract checking, by directly changing the class' or module's namespace.
Other implementations exist that either hack __getattr__ [5] or use __metaclass__ [6].
References
| [1] | Implementation described in this document. (http://www.wayforward.net/pycontract/) |
| [2] | Design By Contract is a registered trademark of Eiffel Software Inc. (http://archive.eiffel.com/doc/manuals/technology/contract/) |
| [3] | Object-oriented Software Construction, Bertrand Meyer, ISBN 0-13-629031-0 |
| [4] | http://docs.python.org/library/doctest.html doctest -- Test docstrings represent reality |
| [5] | Design by Contract for Python, R. Plosch IEEE Proceedings of the Joint Asia Pacific Software Engineering Conference (APSEC97/ICSC97), Hong Kong, December 2-5, 1997 (http://www.swe.uni-linz.ac.at/publications/abstract/TR-SE-97.24.html) |
| [6] | PyDBC -- Design by Contract for Python 2.2+, Daniel Arbuckle (http://www.nongnu.org/pydbc/) |
| [7] | ReStructuredText (http://docutils.sourceforge.net/rst.html) |
| [8] | Extreme Programming Explained, Kent Beck, ISBN 0-201-61641-6 |
| [9] | The Z Notation, Second Edition, J.M. Spivey ISBN 0-13-978529-9 |
Copyright
This document has been placed in the public domain.
pep-0317 Eliminate Implicit Exception Instantiation
| PEP: | 317 |
|---|---|
| Title: | Eliminate Implicit Exception Instantiation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Steven Taschuk <staschuk at telusplanet.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 06-May-2003 |
| Python-Version: | 2.4 |
| Post-History: | 09-Jun-2003 |
Contents
Abstract
"For clarity in new code, the form raise class(argument, ...) is recommended (i.e. make an explicit call to the constructor)."
—Guido van Rossum, in 1997 [1]
This PEP proposes the formal deprecation and eventual elimination of forms of the raise statement which implicitly instantiate an exception. For example, statements such as
raise HullBreachError raise KitchenError, 'all out of baked beans'
must under this proposal be replaced with their synonyms
raise HullBreachError()
raise KitchenError('all out of baked beans')
Note that these latter statements are already legal, and that this PEP does not change their meaning.
Eliminating these forms of raise makes it impossible to use string exceptions; accordingly, this PEP also proposes the formal deprecation and eventual elimination of string exceptions.
Adoption of this proposal breaks backwards compatibility. Under the proposed implementation schedule, Python 2.4 will introduce warnings about uses of raise which will eventually become incorrect, and Python 3.0 will eliminate them entirely. (It is assumed that this transition period -- 2.4 to 3.0 -- will be at least one year long, to comply with the guidelines of PEP 5 [2].)
Motivation
String Exceptions
It is assumed that removing string exceptions will be uncontroversial, since it has been intended since at least Python 1.5, when the standard exception types were changed to classes [1].
For the record: string exceptions should be removed because the presence of two kinds of exception complicates the language without any compensation. Instance exceptions are superior because, for example,
- the class-instance relationship more naturally expresses the relationship between the exception type and value,
- they can be organized naturally using superclass-subclass relationships, and
- they can encapsulate error-reporting behaviour (for example).
Implicit Instantiation
Guido's 1997 essay [1] on changing the standard exceptions into classes makes clear why raise can instantiate implicitly:
"The raise statement has been extended to allow raising a class exception without explicit instantiation. The following forms, called the "compatibility forms" of the raise statement [...] The motivation for introducing the compatibility forms was to allow backward compatibility with old code that raised a standard exception."
For example, it was desired that pre-1.5 code which used string exception syntax such as
raise TypeError, 'not an int'
would work both on versions of Python in which TypeError was a string, and on versions in which it was a class.
When no such consideration obtains -- that is, when the desired exception type is not a string in any version of the software which the code must support -- there is no good reason to instantiate implicitly, and it is clearer not to. For example:
In the code
try: raise MyError, raised except MyError, caught: passthe syntactic parallel between the raise and except statements strongly suggests that raised and caught refer to the same object. For string exceptions this actually is the case, but for instance exceptions it is not.
When instantiation is implicit, it is not obvious when it occurs, for example, whether it occurs when the exception is raised or when it is caught. Since it actually happens at the raise, the code should say so.
(Note that at the level of the C API, an exception can be "raised" and "caught" without being instantiated; this is used as an optimization by, for example, PyIter_Next. But in Python, no such optimization is or should be available.)
An implicitly instantiating raise statement with no arguments, such as
raise MyError
simply does not do what it says: it does not raise the named object.
The equivalence of
raise MyError raise MyError()
conflates classes and instances, creating a possible source of confusion for beginners. (Moreover, it is not clear that the interpreter could distinguish between a new-style class and an instance of such a class, so implicit instantiation may be an obstacle to any future plan to let exceptions be new-style objects.)
In short, implicit instantiation has no advantages other than backwards compatibility, and so should be phased out along with what it exists to ensure compatibility with, namely, string exceptions.
Specification
The syntax of raise_stmt [3] is to be changed from
raise_stmt ::= "raise" [expression ["," expression ["," expression]]]
to
raise_stmt ::= "raise" [expression ["," expression]]
If no expressions are present, the raise statement behaves as it does presently: it re-raises the last exception that was active in the current scope, and if no exception has been active in the current scope, a TypeError is raised indicating that this is the problem.
Otherwise, the first expression is evaluated, producing the raised object. Then the second expression is evaluated, if present, producing the substituted traceback. If no second expression is present, the substituted traceback is None.
The raised object must be an instance. The class of the instance is the exception type, and the instance itself is the exception value. If the raised object is not an instance -- for example, if it is a class or string -- a TypeError is raised.
If the substituted traceback is not None, it must be a traceback object, and it is substituted instead of the current location as the place where the exception occurred. If it is neither a traceback object nor None, a TypeError is raised.
Backwards Compatibility
Migration Plan
Future Statement
Under the future statement [4]
from __future__ import raise_with_two_args
the syntax and semantics of the raise statement will be as described above. This future feature is to appear in Python 2.4; its effect is to become standard in Python 3.0.
As the examples below illustrate, this future statement is only needed for code which uses the substituted traceback argument to raise; simple exception raising does not require it.
Warnings
Three new warnings [5], all of category DeprecationWarning, are to be issued to point out uses of raise which will become incorrect under the proposed changes.
The first warning is issued when a raise statement is executed in which the first expression evaluates to a string. The message for this warning is:
raising strings will be impossible in the future
The second warning is issued when a raise statement is executed in which the first expression evaluates to a class. The message for this warning is:
raising classes will be impossible in the future
The third warning is issued when a raise statement with three expressions is compiled. (Not, note, when it is executed; this is important because the SyntaxError which this warning presages will occur at compile-time.) The message for this warning is:
raising with three arguments will be impossible in the future
These warnings are to appear in Python 2.4, and disappear in Python 3.0, when the conditions which cause them are simply errors.
Examples
Code Using Implicit Instantiation
Code such as
class MyError(Exception):
pass
raise MyError, 'spam'
will issue a warning when the raise statement is executed. The raise statement should be changed to instantiate explicitly:
raise MyError('spam')
Code Using String Exceptions
Code such as
MyError = 'spam' raise MyError, 'eggs'
will issue a warning when the raise statement is executed. The exception type should be changed to a class:
class MyError(Exception):
pass
and, as in the previous example, the raise statement should be changed to instantiate explicitly
raise MyError('eggs')
Code Supplying a Traceback Object
Code such as
raise MyError, 'spam', mytraceback
will issue a warning when compiled. The statement should be changed to
raise MyError('spam'), mytraceback
and the future statement
from __future__ import raise_with_two_args
should be added at the top of the module. Note that adding this future statement also turns the other two warnings into errors, so the changes described in the previous examples must also be applied.
The special case
raise sys.exc_type, sys.exc_info, sys.exc_traceback
(which is intended to re-raise a previous exception) should be changed simply to
raise
A Failure of the Plan
It may occur that a raise statement which raises a string or implicitly instantiates is not executed in production or testing during the phase-in period for this PEP. In that case, it will not issue any warnings, but will instead suddenly fail one day in Python 3.0 or a subsequent version. (The failure is that the wrong exception gets raised, namely a TypeError complaining about the arguments to raise, instead of the exception intended.)
Such cases can be made rarer by prolonging the phase-in period; they cannot be made impossible short of issuing at compile-time a warning for every raise statement.
Rejection
If this PEP were accepted, nearly all existing Python code would need to be reviewed and probably revised; even if all the above arguments in favour of explicit instantiation are accepted, the improvement in clarity is too minor to justify the cost of doing the revision and the risk of new bugs introduced thereby.
This proposal has therefore been rejected [6].
Note that string exceptions are slated for removal independently of this proposal; what is rejected is the removal of implicit exception instantiation.
Summary of Discussion
A small minority of respondents were in favour of the proposal, but the dominant response was that any such migration would be costly out of proportion to the putative benefit. As noted above, this point is sufficient in itself to reject the PEP.
New-Style Exceptions
Implicit instantiation might conflict with future plans to allow instances of new-style classes to be used as exceptions. In order to decide whether to instantiate implicitly, the raise machinery must determine whether the first argument is a class or an instance -- but with new-style classes there is no clear and strong distinction.
Under this proposal, the problem would be avoided because the exception would already have been instantiated. However, there are two plausible alternative solutions:
Require exception types to be subclasses of Exception, and instantiate implicitly if and only if
issubclass(firstarg, Exception)
Instantiate implicitly if and only if
isinstance(firstarg, type)
Thus eliminating implicit instantiation entirely is not necessary to solve this problem.
Ugliness of Explicit Instantiation
Some respondents felt that the explicitly instantiating syntax is uglier, especially in cases when no arguments are supplied to the exception constructor:
raise TypeError()
The problem is particularly acute when the exception instance itself is not of interest, that is, when the only relevant point is the exception type:
try:
# ... deeply nested search loop ...
raise Found
except Found:
# ...
In such cases the symmetry between raise and except can be more expressive of the intent of the code.
Guido opined that the implicitly instantiating syntax is "a tad prettier" even for cases with a single argument, since it has less punctuation.
Performance Penalty of Warnings
Experience with deprecating apply() shows that use of the warning framework can incur a significant performance penalty.
Code which instantiates explicitly would not be affected, since the run-time checks necessary to determine whether to issue a warning are exactly those which are needed to determine whether to instantiate implicitly in the first place. That is, such statements are already incurring the cost of these checks.
Code which instantiates implicitly would incur a large cost: timing trials indicate that issuing a warning (whether it is suppressed or not) takes about five times more time than simply instantiating, raising, and catching an exception.
This penalty is mitigated by the fact that raise statements are rarely on performance-critical execution paths.
Traceback Argument
As the proposal stands, it would be impossible to use the traceback argument to raise conveniently with all 2.x versions of Python.
For compatibility with versions < 2.4, the three-argument form must be used; but this form would produce warnings with versions >= 2.4. Those warnings could be suppressed, but doing so is awkward because the relevant type of warning is issued at compile-time.
If this PEP were still under consideration, this objection would be met by extending the phase-in period. For example, warnings could first be issued in 3.0, and become errors in some later release.
References
| [1] | (1, 2, 3) "Standard Exception Classes in Python 1.5", Guido van Rossum. http://www.python.org/doc/essays/stdexceptions.html |
| [2] | "Guidelines for Language Evolution", Paul Prescod. http://www.python.org/dev/peps/pep-0005/ |
| [3] | "Python Language Reference", Guido van Rossum. http://docs.python.org/reference/simple_stmts.html#raise |
| [4] | PEP 236 "Back to the __future__", Tim Peters. http://www.python.org/dev/peps/pep-0236/ |
| [5] | PEP 230 "Warning Framework", Guido van Rossum. http://www.python.org/dev/peps/pep-0230/ |
| [6] | Guido van Rossum, 11 June 2003 post to python-dev. http://mail.python.org/pipermail/python-dev/2003-June/036176.html |
Copyright
This document has been placed in the public domain.
pep-0318 Decorators for Functions and Methods
| PEP: | 318 |
|---|---|
| Title: | Decorators for Functions and Methods |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Kevin D. Smith <Kevin.Smith at theMorgue.org>, Jim J. Jewett, Skip Montanaro, Anthony Baxter |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 05-Jun-2003 |
| Python-Version: | 2.4 |
| Post-History: | 09-Jun-2003, 10-Jun-2003, 27-Feb-2004, 23-Mar-2004, 30-Aug-2004, 2-Sep-2004 |
Contents
WarningWarningWarning
This document is meant to describe the decorator syntax and the process that resulted in the decisions that were made. It does not attempt to cover the huge number of potential alternative syntaxes, nor is it an attempt to exhaustively list all the positives and negatives of each form.
Abstract
The current method for transforming functions and methods (for instance, declaring them as a class or static method) is awkward and can lead to code that is difficult to understand. Ideally, these transformations should be made at the same point in the code where the declaration itself is made. This PEP introduces new syntax for transformations of a function or method declaration.
Motivation
The current method of applying a transformation to a function or method places the actual transformation after the function body. For large functions this separates a key component of the function's behavior from the definition of the rest of the function's external interface. For example:
def foo(self):
perform method operation
foo = classmethod(foo)
This becomes less readable with longer methods. It also seems less than pythonic to name the function three times for what is conceptually a single declaration. A solution to this problem is to move the transformation of the method closer to the method's own declaration. The intent of the new syntax is to replace
def foo(cls):
pass
foo = synchronized(lock)(foo)
foo = classmethod(foo)
with an alternative that places the decoration in the function's declaration:
@classmethod
@synchronized(lock)
def foo(cls):
pass
Modifying classes in this fashion is also possible, though the benefits are not as immediately apparent. Almost certainly, anything which could be done with class decorators could be done using metaclasses, but using metaclasses is sufficiently obscure that there is some attraction to having an easier way to make simple modifications to classes. For Python 2.4, only function/method decorators are being added.
PEP 3129 [#PEP-3129] proposes to add class decorators as of Python 2.6.
Why Is This So Hard?
Two decorators (classmethod() and staticmethod()) have been available in Python since version 2.2. It's been assumed since approximately that time that some syntactic support for them would eventually be added to the language. Given this assumption, one might wonder why it's been so difficult to arrive at a consensus. Discussions have raged off-and-on at times in both comp.lang.python and the python-dev mailing list about how best to implement function decorators. There is no one clear reason why this should be so, but a few problems seem to be most divisive.
- Disagreement about where the "declaration of intent" belongs. Almost everyone agrees that decorating/transforming a function at the end of its definition is suboptimal. Beyond that there seems to be no clear consensus where to place this information.
- Syntactic constraints. Python is a syntactically simple language with fairly strong constraints on what can and can't be done without "messing things up" (both visually and with regards to the language parser). There's no obvious way to structure this information so that people new to the concept will think, "Oh yeah, I know what you're doing." The best that seems possible is to keep new users from creating a wildly incorrect mental model of what the syntax means.
- Overall unfamiliarity with the concept. For people who have a passing acquaintance with algebra (or even basic arithmetic) or have used at least one other programming language, much of Python is intuitive. Very few people will have had any experience with the decorator concept before encountering it in Python. There's just no strong preexisting meme that captures the concept.
- Syntax discussions in general appear to cause more contention than almost anything else. Readers are pointed to the ternary operator discussions that were associated with PEP 308 for another example of this.
Background
There is general agreement that syntactic support is desirable to the current state of affairs. Guido mentioned syntactic support for decorators [2] in his DevDay keynote presentation at the 10th Python Conference [3], though he later said [5] it was only one of several extensions he proposed there "semi-jokingly". Michael Hudson raised the topic [4] on python-dev shortly after the conference, attributing the initial bracketed syntax to an earlier proposal on comp.lang.python by Gareth McCaughan [6].
Class decorations seem like an obvious next step because class definition and function definition are syntactically similar, however Guido remains unconvinced, and class decorators will almost certainly not be in Python 2.4.
The discussion continued on and off on python-dev from February 2002 through July 2004. Hundreds and hundreds of posts were made, with people proposing many possible syntax variations. Guido took a list of proposals to EuroPython 2004 [7], where a discussion took place. Subsequent to this, he decided that we'd have the Java-style [10] @decorator syntax, and this appeared for the first time in 2.4a2. Barry Warsaw named this the 'pie-decorator' syntax, in honor of the Pie-thon Parrot shootout which was occured around the same time as the decorator syntax, and because the @ looks a little like a pie. Guido outlined his case [8] on Python-dev, including this piece [9] on some of the (many) rejected forms.
On the name 'Decorator'
There's been a number of complaints about the choice of the name 'decorator' for this feature. The major one is that the name is not consistent with its use in the GoF book [11]. The name 'decorator' probably owes more to its use in the compiler area -- a syntax tree is walked and annotated. It's quite possible that a better name may turn up.
Design Goals
The new syntax should
- work for arbitrary wrappers, including user-defined callables and the existing builtins classmethod() and staticmethod(). This requirement also means that a decorator syntax must support passing arguments to the wrapper constructor
- work with multiple wrappers per definition
- make it obvious what is happening; at the very least it should be obvious that new users can safely ignore it when writing their own code
- be a syntax "that ... [is] easy to remember once explained"
- not make future extensions more difficult
- be easy to type; programs that use it are expected to use it very frequently
- not make it more difficult to scan through code quickly. It should still be easy to search for all definitions, a particular definition, or the arguments that a function accepts
- not needlessly complicate secondary support tools such as language-sensitive editors and other "toy parser tools out there [12]"
- allow future compilers to optimize for decorators. With the hope of a JIT compiler for Python coming into existence at some point this tends to require the syntax for decorators to come before the function definition
- move from the end of the function, where it's currently hidden, to the front where it is more in your face [13]
Andrew Kuchling has links to a bunch of the discussions about motivations and use cases in his blog [14]. Particularly notable is Jim Huginin's list of use cases [15].
Current Syntax
The current syntax for function decorators as implemented in Python 2.4a2 is:
@dec2
@dec1
def func(arg1, arg2, ...):
pass
This is equivalent to:
def func(arg1, arg2, ...):
pass
func = dec2(dec1(func))
without the intermediate assignment to the variable func. The decorators are near the function declaration. The @ sign makes it clear that something new is going on here.
The rationale for the order of application [16] (bottom to top) is that it matches the usual order for function-application. In mathematics, composition of functions (g o f)(x) translates to g(f(x)). In Python, @g @f def foo() translates to foo=g(f(foo).
The decorator statement is limited in what it can accept -- arbitrary expressions will not work. Guido preferred this because of a gut feeling [17].
The current syntax also allows decorator declarations to call a function that returns a decorator:
@decomaker(argA, argB, ...)
def func(arg1, arg2, ...):
pass
This is equivalent to:
func = decomaker(argA, argB, ...)(func)
The rationale for having a function that returns a decorator is that the part after the @ sign can be considered to be an expression (though syntactically restricted to just a function), and whatever that expression returns is called. See declaration arguments [16].
Syntax Alternatives
There have been a large number [18] of different syntaxes proposed -- rather than attempting to work through these individual syntaxes, it's worthwhile to break the syntax discussion down into a number of areas. Attempting to discuss each possible syntax [19] individually would be an act of madness, and produce a completely unwieldy PEP.
Decorator Location
The first syntax point is the location of the decorators. For the following examples, we use the @syntax used in 2.4a2.
Decorators before the def statement are the first alternative, and the syntax used in 2.4a2:
@classmethod
def foo(arg1,arg2):
pass
@accepts(int,int)
@returns(float)
def bar(low,high):
pass
There have been a number of objections raised to this location -- the primary one is that it's the first real Python case where a line of code has an effect on a following line. The syntax available in 2.4a3 requires one decorator per line (in a2, multiple decorators could be specified on the same line), and the final decision for 2.4 final stayed one decorator per line.
People also complained that the syntax quickly got unwieldy when multiple decorators were used. The point was made, though, that the chances of a large number of decorators being used on a single function were small and thus this was not a large worry.
Some of the advantages of this form are that the decorators live outside the method body -- they are obviously executed at the time the function is defined.
Another advantage is that a prefix to the function definition fits the idea of knowing about a change to the semantics of the code before the code itself, thus you know how to interpret the code's semantics properly without having to go back and change your initial perceptions if the syntax did not come before the function definition.
Guido decided he preferred [20] having the decorators on the line before the 'def', because it was felt that a long argument list would mean that the decorators would be 'hidden'
The second form is the decorators between the def and the function name, or the function name and the argument list:
def @classmethod foo(arg1,arg2):
pass
def @accepts(int,int),@returns(float) bar(low,high):
pass
def foo @classmethod (arg1,arg2):
pass
def bar @accepts(int,int),@returns(float) (low,high):
pass
There are a couple of objections to this form. The first is that it breaks easily 'greppability' of the source -- you can no longer search for 'def foo(' and find the definition of the function. The second, more serious, objection is that in the case of multiple decorators, the syntax would be extremely unwieldy.
The next form, which has had a number of strong proponents, is to have the decorators between the argument list and the trailing : in the 'def' line:
def foo(arg1,arg2) @classmethod:
pass
def bar(low,high) @accepts(int,int),@returns(float):
pass
Guido summarized the arguments [13] against this form (many of which also apply to the previous form) as:
- it hides crucial information (e.g. that it is a static method) after the signature, where it is easily missed
- it's easy to miss the transition between a long argument list and a long decorator list
- it's cumbersome to cut and paste a decorator list for reuse, because it starts and ends in the middle of a line
The next form is that the decorator syntax goes inside the method body at the start, in the same place that docstrings currently live:
- def foo(arg1,arg2):
- @classmethod pass
- def bar(low,high):
- @accepts(int,int) @returns(float) pass
The primary objection to this form is that it requires "peeking inside" the method body to determine the decorators. In addition, even though the code is inside the method body, it is not executed when the method is run. Guido felt that docstrings were not a good counter-example, and that it was quite possible that a 'docstring' decorator could help move the docstring to outside the function body.
The final form is a new block that encloses the method's code. For this example, we'll use a 'decorate' keyword, as it makes no sense with the @syntax.
decorate:
classmethod
def foo(arg1,arg2):
pass
decorate:
accepts(int,int)
returns(float)
def bar(low,high):
pass
This form would result in inconsistent indentation for decorated and undecorated methods. In addition, a decorated method's body would start three indent levels in.
Syntax forms
@decorator:
@classmethod def foo(arg1,arg2): pass @accepts(int,int) @returns(float) def bar(low,high): passThe major objections against this syntax are that the @ symbol is not currently used in Python (and is used in both IPython and Leo), and that the @ symbol is not meaningful. Another objection is that this "wastes" a currently unused character (from a limited set) on something that is not perceived as a major use.
|decorator:
|classmethod def foo(arg1,arg2): pass |accepts(int,int) |returns(float) def bar(low,high): passThis is a variant on the @decorator syntax -- it has the advantage that it does not break IPython and Leo. Its major disadvantage compared to the @syntax is that the | symbol looks like both a capital I and a lowercase l.
list syntax:
[classmethod] def foo(arg1,arg2): pass [accepts(int,int), returns(float)] def bar(low,high): passThe major objection to the list syntax is that it's currently meaningful (when used in the form before the method). It's also lacking any indication that the expression is a decorator.
list syntax using other brackets (<...>, [[...]], ...):
<classmethod> def foo(arg1,arg2): pass <accepts(int,int), returns(float)> def bar(low,high): passNone of these alternatives gained much traction. The alternatives which involve square brackets only serve to make it obvious that the decorator construct is not a list. They do nothing to make parsing any easier. The '<...>' alternative presents parsing problems because '<' and '>' already parse as un-paired. They present a further parsing ambiguity because a right angle bracket might be a greater than symbol instead of a closer for the decorators.
decorate()
The decorate() proposal was that no new syntax be implemented -- instead a magic function that used introspection to manipulate the following function. Both Jp Calderone and Philip Eby produced implementations of functions that did this. Guido was pretty firmly against this -- with no new syntax, the magicness of a function like this is extremely high:
Using functions with "action-at-a-distance" through sys.settraceback may be okay for an obscure feature that can't be had any other way yet doesn't merit changes to the language, but that's not the situation for decorators. The widely held view here is that decorators need to be added as a syntactic feature to avoid the problems with the postfix notation used in 2.2 and 2.3. Decorators are slated to be an important new language feature and their design needs to be forward-looking, not constrained by what can be implemented in 2.3.
new keyword (and block)
This idea was the consensus alternate from comp.lang.python (more on this in Community Consensus below.) Robert Brewer wrote up a detailed J2 proposal [21] document outlining the arguments in favor of this form. The initial issues with this form are:
- It requires a new keyword, and therefore a from __future__ import decorators statement.
- The choice of keyword is contentious. However using emerged as the consensus choice, and is used in the proposal and implementation.
- The keyword/block form produces something that looks like a normal code block, but isn't. Attempts to use statements in this block will cause a syntax error, which may confuse users.
A few days later, Guido rejected the proposal [22] on two main grounds, firstly:
... the syntactic form of an indented block strongly suggests that its contents should be a sequence of statements, but in fact it is not -- only expressions are allowed, and there is an implicit "collecting" of these expressions going on until they can be applied to the subsequent function definition. ...
and secondly:
... the keyword starting the line that heads a block draws a lot of attention to it. This is true for "if", "while", "for", "try", "def" and "class". But the "using" keyword (or any other keyword in its place) doesn't deserve that attention; the emphasis should be on the decorator or decorators inside the suite, since those are the important modifiers to the function definition that follows. ...
Readers are invited to read the full response [22].
Other forms
There are plenty of other variants and proposals on the wiki page [18].
Why @?
There is some history in Java using @ initially as a marker in Javadoc comments [23] and later in Java 1.5 for annotations [10], which are similar to Python decorators. The fact that @ was previously unused as a token in Python also means it's clear there is no possibility of such code being parsed by an earlier version of Python, leading to possibly subtle semantic bugs. It also means that ambiguity of what is a decorator and what isn't is removed. That said, @ is still a fairly arbitrary choice. Some have suggested using | instead.
For syntax options which use a list-like syntax (no matter where it appears) to specify the decorators a few alternatives were proposed: [|...|], *[...]*, and <...>.
Current Implementation, History
Guido asked for a volunteer to implement his preferred syntax, and Mark Russell stepped up and posted a patch [24] to SF. This new syntax was available in 2.4a2.
@dec2
@dec1
def func(arg1, arg2, ...):
pass
This is equivalent to:
def func(arg1, arg2, ...):
pass
func = dec2(dec1(func))
though without the intermediate creation of a variable named func.
The version implemented in 2.4a2 allowed multiple @decorator clauses on a single line. In 2.4a3, this was tightened up to only allowing one decorator per line.
A previous patch [25] from Michael Hudson which implements the list-after-def syntax is also still kicking around.
After 2.4a2 was released, in response to community reaction, Guido stated that he'd re-examine a community proposal, if the community could come up with a community consensus, a decent proposal, and an implementation. After an amazing number of posts, collecting a vast number of alternatives in the Python wiki [18], a community consensus emerged (below). Guido subsequently rejected [22] this alternate form, but added:
In Python 2.4a3 (to be released this Thursday), everything remains as currently in CVS. For 2.4b1, I will consider a change of @ to some other single character, even though I think that @ has the advantage of being the same character used by a similar feature in Java. It's been argued that it's not quite the same, since @ in Java is used for attributes that don't change semantics. But Python's dynamic nature makes that its syntactic elements never mean quite the same thing as similar constructs in other languages, and there is definitely significant overlap. Regarding the impact on 3rd party tools: IPython's author doesn't think there's going to be much impact; Leo's author has said that Leo will survive (although it will cause him and his users some transitional pain). I actually expect that picking a character that's already used elsewhere in Python's syntax might be harder for external tools to adapt to, since parsing will have to be more subtle in that case. But I'm frankly undecided, so there's some wiggle room here. I don't want to consider further syntactic alternatives at this point: the buck has to stop at some point, everyone has had their say, and the show must go on.
Community Consensus
This section documents the rejected J2 syntax, and is included for historical completeness.
The consensus that emerged on comp.lang.python was the proposed J2 syntax (the "J2" was how it was referenced on the PythonDecorators wiki page): the new keyword using prefixing a block of decorators before the def statement. For example:
using:
classmethod
synchronized(lock)
def func(cls):
pass
The main arguments for this syntax fall under the "readability counts" doctrine. In brief, they are:
- A suite is better than multiple @lines. The using keyword and block transforms the single-block def statement into a multiple-block compound construct, akin to try/finally and others.
- A keyword is better than punctuation for a new token. A keyword matches the existing use of tokens. No new token category is necessary. A keyword distinguishes Python decorators from Java annotations and .Net attributes, which are significantly different beasts.
Robert Brewer wrote a detailed proposal [21] for this form, and Michael Sparks produced a patch [26].
As noted previously, Guido rejected this form, outlining his problems with it in a message [22] to python-dev and comp.lang.python.
Examples
Much of the discussion on comp.lang.python and the python-dev mailing list focuses on the use of decorators as a cleaner way to use the staticmethod() and classmethod() builtins. This capability is much more powerful than that. This section presents some examples of use.
Define a function to be executed at exit. Note that the function isn't actually "wrapped" in the usual sense.
def onexit(f): import atexit atexit.register(f) return f @onexit def func(): ...Note that this example is probably not suitable for real usage, but is for example purposes only.
Define a class with a singleton instance. Note that once the class disappears enterprising programmers would have to be more creative to create more instances. (From Shane Hathaway on python-dev.)
def singleton(cls): instances = {} def getinstance(): if cls not in instances: instances[cls] = cls() return instances[cls] return getinstance @singleton class MyClass: ...Add attributes to a function. (Based on an example posted by Anders Munch on python-dev.)
def attrs(**kwds): def decorate(f): for k in kwds: setattr(f, k, kwds[k]) return f return decorate @attrs(versionadded="2.2", author="Guido van Rossum") def mymethod(f): ...Enforce function argument and return types. Note that this copies the func_name attribute from the old to the new function. func_name was made writable in Python 2.4a3:
def accepts(*types): def check_accepts(f): assert len(types) == f.func_code.co_argcount def new_f(*args, **kwds): for (a, t) in zip(args, types): assert isinstance(a, t), \ "arg %r does not match %s" % (a,t) return f(*args, **kwds) new_f.func_name = f.func_name return new_f return check_accepts def returns(rtype): def check_returns(f): def new_f(*args, **kwds): result = f(*args, **kwds) assert isinstance(result, rtype), \ "return value %r does not match %s" % (result,rtype) return result new_f.func_name = f.func_name return new_f return check_returns @accepts(int, (int,float)) @returns((int,float)) def func(arg1, arg2): return arg1 * arg2Declare that a class implements a particular (set of) interface(s). This is from a posting by Bob Ippolito on python-dev based on experience with PyProtocols [27].
def provides(*interfaces): """ An actual, working, implementation of provides for the current implementation of PyProtocols. Not particularly important for the PEP text. """ def provides(typ): declareImplementation(typ, instancesProvide=interfaces) return typ return provides class IBar(Interface): """Declare something about IBar here""" @provides(IBar) class Foo(object): """Implement something here..."""
Of course, all these examples are possible today, though without syntactic support.
(No longer) Open Issues
It's not yet certain that class decorators will be incorporated into the language at a future point. Guido expressed skepticism about the concept, but various people have made some strong arguments [28] (search for PEP 318 -- posting draft) on their behalf in python-dev. It's exceedingly unlikely that class decorators will be in Python 2.4.
PEP 3129 [#PEP-3129] proposes to add class decorators as of Python 2.6.
The choice of the @ character will be re-examined before Python 2.4b1.
In the end, the @ character was kept.
References
| [1] | PEP 3129, "Class Decorators", Winter http://www.python.org/dev/peps/pep-3129 |
| [2] | http://www.python.org/doc/essays/ppt/python10/py10keynote.pdf |
| [3] | http://www.python.org/workshops/2002-02/ |
| [4] | http://mail.python.org/pipermail/python-dev/2002-February/020005.html |
| [5] | http://mail.python.org/pipermail/python-dev/2002-February/020017.html |
| [6] | http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=slrna40k88.2h9o.Gareth.McCaughan%40g.local |
| [7] | http://www.python.org/doc/essays/ppt/euro2004/euro2004.pdf |
| [8] | http://mail.python.org/pipermail/python-dev/2004-August/author.html |
| [9] | http://mail.python.org/pipermail/python-dev/2004-August/046672.html |
| [10] | (1, 2) http://java.sun.com/j2se/1.5.0/docs/guide/language/annotations.html |
| [11] | http://patterndigest.com/patterns/Decorator.html |
| [12] | http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=mailman.1010809396.32158.python-list%40python.org |
| [13] | (1, 2) http://mail.python.org/pipermail/python-dev/2004-August/047112.html |
| [14] | http://www.amk.ca/diary/archives/cat_python.html#003255 |
| [15] | http://mail.python.org/pipermail/python-dev/2004-April/044132.html |
| [16] | (1, 2) http://mail.python.org/pipermail/python-dev/2004-September/048874.html |
| [17] | http://mail.python.org/pipermail/python-dev/2004-August/046711.html |
| [18] | (1, 2, 3) http://www.python.org/moin/PythonDecorators |
| [19] | http://ucsu.colorado.edu/~bethard/py/decorators-output.py |
| [20] | http://mail.python.org/pipermail/python-dev/2004-March/043756.html |
| [21] | (1, 2) http://www.aminus.org/rbre/python/pydec.html |
| [22] | (1, 2, 3, 4) http://mail.python.org/pipermail/python-dev/2004-September/048518.html |
| [23] | http://java.sun.com/j2se/javadoc/writingdoccomments/ |
| [24] | http://www.python.org/sf/979728 |
| [25] | http://starship.python.net/crew/mwh/hacks/meth-syntax-sugar-3.diff |
| [26] | http://www.python.org/sf/1013835 |
| [27] | http://peak.telecommunity.com/PyProtocols.html |
| [28] | http://mail.python.org/pipermail/python-dev/2004-March/thread.html |
Copyright
This document has been placed in the public domain.
pep-0319 Python Synchronize/Asynchronize Block
| PEP: | 319 |
|---|---|
| Title: | Python Synchronize/Asynchronize Block |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michel Pelletier <michel at users.sourceforge.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 24-Feb-2003 |
| Python-Version: | 2.4? |
| Post-History: |
Abstract
This PEP proposes adding two new keywords to Python, `synchronize'
and 'asynchronize'.
Pronouncement
This PEP is rejected in favor of PEP 343.
The `synchronize' Keyword
The concept of code synchronization in Python is too low-level.
To synchronize code a programmer must be aware of the details of
the following pseudo-code pattern:
initialize_lock()
...
acquire_lock()
try:
change_shared_data()
finally:
release_lock()
This synchronized block pattern is not the only pattern (more
discussed below) but it is very common. This PEP proposes
replacing the above code with the following equivalent:
synchronize:
change_shared_data()
The advantages of this scheme are simpler syntax and less room for
user error. Currently users are required to write code about
acquiring and releasing thread locks in 'try/finally' blocks;
errors in this code can cause notoriously difficult concurrent
thread locking issues.
The `asynchronize' Keyword
While executing a `synchronize' block of code a programmer may
want to "drop back" to running asynchronously momentarily to run
blocking input/output routines or something else that might take a
indeterminate amount of time and does not require synchronization.
This code usually follows the pattern:
initialize_lock()
...
acquire_lock()
try:
change_shared_data()
release_lock() # become async
do_blocking_io()
acquire_lock() # sync again
change_shared_data2()
finally:
release_lock()
The asynchronous section of the code is not very obvious visually,
so it is marked up with comments. Using the proposed
'asynchronize' keyword this code becomes much cleaner, easier to
understand, and less prone to error:
synchronize:
change_shared_data()
asynchronize:
do_blocking_io()
change_shared_data2()
Encountering an `asynchronize' keyword inside a non-synchronized
block can raise either an error or issue a warning (as all code
blocks are implicitly asynchronous anyway). It is important to
note that the above example is *not* the same as:
synchronize:
change_shared_data()
do_blocking_io()
synchronize:
change_shared_data2()
Because both synchronized blocks of code may be running inside the
same iteration of a loop, Consider:
while in_main_loop():
synchronize:
change_shared_data()
asynchronize:
do_blocking_io()
change_shared_data2()
Many threads may be looping through this code. Without the
'asynchronize' keyword one thread cannot stay in the loop and
release the lock at the same time while blocking IO is going on.
This pattern of releasing locks inside a main loop to do blocking
IO is used extensively inside the CPython interpreter itself.
Synchronization Targets
As proposed the `synchronize' and `asynchronize' keywords
synchronize a block of code. However programmers may want to
specify a target object that threads synchronize on. Any object
can be a synchronization target.
Consider a two-way queue object: two different objects are used by
the same `synchronize' code block to synchronize both queues
separately in the 'get' method:
class TwoWayQueue:
def __init__(self):
self.front = []
self.rear = []
def putFront(self, item):
self.put(item, self.front)
def getFront(self):
item = self.get(self.front)
return item
def putRear(self, item):
self.put(item, self.rear)
def getRear(self):
item = self.get(self.rear)
return item
def put(self, item, queue):
synchronize queue:
queue.append(item)
def get(self, queue):
synchronize queue:
item = queue[0]
del queue[0]
return item
Here is the equivalent code in Python as it is now without a
`synchronize' keyword:
import thread
class LockableQueue:
def __init__(self):
self.queue = []
self.lock = thread.allocate_lock()
class TwoWayQueue:
def __init__(self):
self.front = LockableQueue()
self.rear = LockableQueue()
def putFront(self, item):
self.put(item, self.front)
def getFront(self):
item = self.get(self.front)
return item
def putRear(self, item):
self.put(item, self.rear)
def getRear(self):
item = self.get(self.rear)
return item
def put(self, item, queue):
queue.lock.acquire()
try:
queue.append(item)
finally:
queue.lock.release()
def get(self, queue):
queue.lock.acquire()
try:
item = queue[0]
del queue[0]
return item
finally:
queue.lock.release()
The last example had to define an extra class to associate a lock
with the queue where the first example the `synchronize' keyword
does this association internally and transparently.
Other Patterns that Synchronize
There are some situations where the `synchronize' and
`asynchronize' keywords cannot entirely replace the use of lock
methods like `acquire' and `release'. Some examples are if the
programmer wants to provide arguments for `acquire' or if a lock
is acquired in one code block but released in another, as shown
below.
Here is a class from Zope modified to use both the `synchronize'
and `asynchronize' keywords and also uses a pool of explicit locks
that are acquired and released in different code blocks and thus
don't use `synchronize':
import thread
from ZServerPublisher import ZServerPublisher
class ZRendevous:
def __init__(self, n=1):
pool=[]
self._lists=pool, [], []
synchronize:
while n > 0:
l=thread.allocate_lock()
l.acquire()
pool.append(l)
thread.start_new_thread(ZServerPublisher,
(self.accept,))
n=n-1
def accept(self):
synchronize:
pool, requests, ready = self._lists
while not requests:
l=pool[-1]
del pool[-1]
ready.append(l)
asynchronize:
l.acquire()
pool.append(l)
r=requests[0]
del requests[0]
return r
def handle(self, name, request, response):
synchronize:
pool, requests, ready = self._lists
requests.append((name, request, response))
if ready:
l=ready[-1]
del ready[-1]
l.release()
Here is the original class as found in the
'Zope/ZServer/PubCore/ZRendevous.py' module. The "convenience" of
the '_a' and '_r' shortcut names obscure the code:
import thread
from ZServerPublisher import ZServerPublisher
class ZRendevous:
def __init__(self, n=1):
sync=thread.allocate_lock()
self._a=sync.acquire
self._r=sync.release
pool=[]
self._lists=pool, [], []
self._a()
try:
while n > 0:
l=thread.allocate_lock()
l.acquire()
pool.append(l)
thread.start_new_thread(ZServerPublisher,
(self.accept,))
n=n-1
finally: self._r()
def accept(self):
self._a()
try:
pool, requests, ready = self._lists
while not requests:
l=pool[-1]
del pool[-1]
ready.append(l)
self._r()
l.acquire()
self._a()
pool.append(l)
r=requests[0]
del requests[0]
return r
finally: self._r()
def handle(self, name, request, response):
self._a()
try:
pool, requests, ready = self._lists
requests.append((name, request, response))
if ready:
l=ready[-1]
del ready[-1]
l.release()
finally: self._r()
In particular the asynchronize section of the `accept' method is
not very obvious. To beginner programmers, `synchronize' and
`asynchronize' remove many of the problems encountered when
juggling multiple `acquire' and `release' methods on different
locks in different `try/finally' blocks.
Formal Syntax
Python syntax is defined in a modified BNF grammar notation
described in the Python Language Reference [1]. This section
describes the proposed synchronization syntax using this grammar:
synchronize_stmt: 'synchronize' [test] ':' suite
asynchronize_stmt: 'asynchronize' [test] ':' suite
compound_stmt: ... | synchronized_stmt | asynchronize_stmt
(The '...' indicates other compound statements elided).
Proposed Implementation
The author of this PEP has not explored an implementation yet.
There are several implementation issues that must be resolved.
The main implementation issue is what exactly gets locked and
unlocked during a synchronized block.
During an unqualified synchronized block (the use of the
`synchronize' keyword without an target argument) a lock could be
created and associated with the synchronized code block object.
Any threads that are to execute the block must first acquire the
code block lock.
When an `asynchronize' keyword is encountered in a `synchronize'
block the code block lock is unlocked before the inner block is
executed and re-locked when the inner block terminates.
When a synchronized block target is specified the object is
associated with a lock. How this is implemented cleanly is
probably the highest risk of this proposal. Java Virtual Machines
typically associate a special hidden lock object with target
object and use it to synchronized the block around the target
only.
Backward Compatibility
Backward compatibility is solved with the new `from __future__'
Python syntax [2], and the new warning framework [3] to evolve the
Python language into phasing out any conflicting names that use
the new keywords `synchronize' and `asynchronize'. To use the
syntax now, a developer could use the statement:
from __future__ import threadsync # or whatever
In addition, any code that uses the keyword `synchronize' or
`asynchronize' as an identifier will be issued a warning from
Python. After the appropriate period of time, the syntax would
become standard, the above import statement would do nothing, and
any identifiers named `synchronize' or `asynchronize' would raise
an exception.
PEP 310 Reliable Acquisition/Release Pairs
PEP 310 [4] proposes the 'with' keyword that can serve the same
function as 'synchronize' (but no facility for 'asynchronize').
The pattern:
initialize_lock()
with the_lock:
change_shared_data()
is equivalent to the proposed:
synchronize the_lock:
change_shared_data()
PEP 310 must synchronize on an exsiting lock, while this PEP
proposes that unqualified 'synchronize' statements synchronize on
a global, internal, transparent lock in addition to qualifiled
'synchronize' statements. The 'with' statement also requires lock
initialization, while the 'synchronize' statment can synchronize
on any target object *including* locks.
While limited in this fashion, the 'with' statment is more
abstract and serves more purposes than synchronization. For
example, transactions could be used with the 'with' keyword:
initialize_transaction()
with my_transaction:
do_in_transaction()
# when the block terminates, the transaction is committed.
The 'synchronize' and 'asynchronize' keywords cannot serve this or
any other general acquire/release pattern other than thread
synchronization.
How Java Does It
Java defines a 'synchronized' keyword (note the grammatical tense
different between the Java keyword and this PEP's 'synchronize')
which must be qualified on any object. The syntax is:
synchronized (Expression) Block
Expression must yeild a valid object (null raises an error and
exceptions during 'Expression' terminate the 'synchronized' block
for the same reason) upon which 'Block' is synchronized.
How Jython Does It
Jython uses a 'synchronize' class with the static method
'make_synchronized' that accepts one callable argument and returns
a newly created, synchronized, callable "wrapper" around the
argument.
Summary of Proposed Changes to Python
Adding new `synchronize' and `asynchronize' keywords to the
language.
Risks
This PEP proposes adding two keywords to the Python language. This
may break code.
There is no implementation to test.
It's not the most important problem facing Python programmers
today (although it is a fairly notorious one).
The equivalent Java keyword is the past participle 'synchronized'.
This PEP proposes the present tense, 'synchronize' as being more
in spirit with Python (there being less distinction between
compile-time and run-time in Python than Java).
Dissenting Opinion
This PEP has not been discussed on python-dev.
References
[1] The Python Language Reference
http://docs.python.org/reference/
[2] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
[3] PEP 230, Warning Framework, van Rossum
http://www.python.org/dev/peps/pep-0230/
[4] PEP 310, Reliable Acquisition/Release Pairs, Hudson, Moore
http://www.python.org/dev/peps/pep-0310/
Copyright
This document has been placed in the public domain.
pep-0320 Python 2.4 Release Schedule
| PEP: | 320 |
|---|---|
| Title: | Python 2.4 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw, Raymond Hettinger, Anthony Baxter |
| Status: | Final |
| Type: | Informational |
| Created: | 29-Jul-2003 |
| Python-Version: | 2.4 |
| Post-History: | 1-Dec-2004 |
Abstract
This document describes the development and release schedule for
Python 2.4. The schedule primarily concerns itself with PEP-sized
items. Small features may be added up to and including the first
beta release. Bugs may be fixed until the final release.
There will be at least two alpha releases, two beta releases, and
one release candidate. The release date was 30th November, 2004.
Release Manager
Anthony Baxter
Martin von Lowis is building the Windows installers, Fred the
doc packages, Sean the RPMs.
Release Schedule
July 9: alpha 1 [completed]
August 5/6: alpha 2 [completed]
Sept 3: alpha 3 [completed]
October 15: beta 1 [completed]
November 3: beta 2 [completed]
November 18: release candidate 1 [completed]
November 30: final [completed]
Completed features for 2.4
PEP 218 Builtin Set Objects.
PEP 289 Generator expressions.
PEP 292 Simpler String Substitutions to be implemented as a module.
PEP 318: Function/method decorator syntax, using @syntax
PEP 322 Reverse Iteration.
PEP 327: A Decimal package for fixed precision arithmetic.
PEP 328: Multi-line Imports
Encapsulate the decorate-sort-undecorate pattern in a keyword for
list.sort().
Added a builtin called sorted() which may be used in expressions.
The itertools module has two new functions, tee() and groupby().
Add a collections module with a deque() object.
Add two statistical/reduction functions, nlargest() and nsmallest()
to the heapq module.
Python's windows installer now uses MSI
Deferred until 2.5:
- Deprecate and/or remove the modules listed in PEP 4 (posixfile,
gopherlib, pre, others)
- Remove support for platforms as described in PEP 11.
- Finish implementing the Distutils bdist_dpkg command. (AMK)
- Add support for reading shadow passwords (www.python.org/sf/579435)
- It would be nice if the built-in SSL socket type could be used
for non-blocking SSL I/O. Currently packages such as Twisted
which implement async servers using SSL have to require third-party
packages such as pyopenssl.
- AST-based compiler: this branch was not completed in time for
2.4, but will land on the trunk some time after 2.4 final is
out, for inclusion in 2.5.
- reST is going to be used a lot in Zope3. Maybe it could become
a standard library module? (Since reST's author thinks it's too
instable, I'm inclined not to do this.)
Ongoing tasks
The following are ongoing TO-DO items which we should attempt to
work on without hoping for completion by any particular date.
- Documentation: complete the distribution and installation
manuals.
- Documentation: complete the documentation for new-style
classes.
- Look over the Demos/ directory and update where required (Andrew
Kuchling has done a lot of this)
- New tests.
- Fix doc bugs on SF.
- Remove use of deprecated features in the core.
- Document deprecated features appropriately.
- Mark deprecated C APIs with Py_DEPRECATED.
- Deprecate modules which are unmaintained, or perhaps make a new
category for modules 'Unmaintained'
- In general, lots of cleanup so it is easier to move forward.
Open issues
None at this time.
Carryover features from Python 2.3
- The import lock could use some redesign. (SF 683658.)
- A nicer API to open text files, replacing the ugly (in some
people's eyes) "U" mode flag. There's a proposal out there to
have a new built-in type textfile(filename, mode, encoding).
(Shouldn't it have a bufsize argument too?)
- New widgets for Tkinter???
Has anyone gotten the time for this? *Are* there any new
widgets in Tk 8.4? Note that we've got better Tix support
already (though not on Windows yet).
- PEP 304 (Controlling Generation of Bytecode Files by Montanaro)
seems to have lost steam.
- For a class defined inside another class, the __name__ should be
"outer.inner", and pickling should work. (SF 633930. I'm no
longer certain this is easy or even right.)
- Decide on a clearer deprecation policy (especially for modules)
and act on it. For a start, see this message from Neal Norwitz:
http://mail.python.org/pipermail/python-dev/2002-April/023165.html
There seems insufficient interest in moving this further in an
organized fashion, and it's not particularly important.
- Provide alternatives for common uses of the types module;
Skip Montanaro has posted a proto-PEP for this idea:
http://mail.python.org/pipermail/python-dev/2002-May/024346.html
There hasn't been any progress on this, AFAICT.
- Use pending deprecation for the types and string modules. This
requires providing alternatives for the parts that aren't
covered yet (e.g. string.whitespace and types.TracebackType).
It seems we can't get consensus on this.
- PEP 262 Database of Installed Python Packages Kuchling
This turns out to be useful for Jack Jansen's Python installer,
so the database is worth implementing. Code will go in
sandbox/pep262.
- PEP 269 Pgen Module for Python Riehl
(Some necessary changes are in; the pgen module itself needs to
mature more.)
- PEP 266 Optimizing Global Variable/Attribute Access Montanaro
PEP 267 Optimized Access to Module Namespaces Hylton
PEP 280 Optimizing access to globals van Rossum
These are basically three friendly competing proposals. Jeremy
has made a little progress with a new compiler, but it's going
slowly and the compiler is only the first step. Maybe we'll be
able to refactor the compiler in this release. I'm tempted to
say we won't hold our breath.
- Lazily tracking tuples?
http://mail.python.org/pipermail/python-dev/2002-May/023926.html
http://www.python.org/sf/558745
Not much enthusiasm I believe.
- PEP 286 Enhanced Argument Tuples von Loewis
I haven't had the time to review this thoroughly. It seems a
deep optimization hack (also makes better correctness guarantees
though).
- Make 'as' a keyword. It has been a pseudo-keyword long enough.
Too much effort to bother.
Copyright
This document has been placed in the public domain.
pep-0321 Date/Time Parsing and Formatting
| PEP: | 321 |
|---|---|
| Title: | Date/Time Parsing and Formatting |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 16-Sep-2003 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
Python 2.3 added a number of simple date and time types in the datetime module. There's no support for parsing strings in various formats and returning a corresponding instance of one of the types. This PEP proposes adding a family of predefined parsing function for several commonly used date and time formats, and a facility for generic parsing.
The types provided by the datetime module all have .isoformat() and .ctime() methods that return string representations of a time, and the .strftime() method can be used to construct new formats. There are a number of additional commonly-used formats that would be useful to have as part of the standard library; this PEP also suggests how to add them.
Input Formats
Useful formats to support include:
- ISO8601 [2]
- ARPA/RFC2822 [1]
- ctime [4]
- Formats commonly written by humans such as the American "MM/DD/YYYY", the European "YYYY/MM/DD", and variants such as "DD-Month-YYYY".
- CVS-style or tar-style dates ("tomorrow", "12 hours ago", etc.)
XXX The Perl ParseDate.pm [3] module supports many different input formats, both absolute and relative. Should we try to support them all?
Options:
Add functions to the datetime module:
import datetime d = datetime.parse_iso8601("2003-09-15T10:34:54")Add class methods to the various types. There are already various class methods such as .now(), so this would be pretty natural.:
import datetime d = datetime.date.parse_iso8601("2003-09-15T10:34:54")Add a separate module (possible names: date, date_parse, parse_date) or subpackage (possible names: datetime.parser) containing parsing functions:
import datetime d = datetime.parser.parse_iso8601("2003-09-15T10:34:54")
Unresolved questions:
- Naming convention to use.
- What exception to raise on errors? ValueError, or a specialized exception?
- Should you know what type you're expecting, or should the parsing figure it out? (e.g. parse_iso8601("yyyy-mm-dd") returns a date instance, but parsing "yyyy-mm-ddThh:mm:ss" returns a datetime.) Should there be an option to signal an error if a time is provided where none is expected, or if no time is provided?
- Anything special required for I18N? For time zones?
Generic Input Parsing
Is a strptime() implementation that returns datetime types sufficient?
XXX if yes, describe strptime here. Can the existing pure-Python implementation be easily retargeted?
Output Formats
Not all input formats need to be supported as output formats, because it's pretty trivial to get the strftime() argument right for simple things such as YYYY/MM/DD. Only complicated formats need to be supported; RFC2822 is currently the only one I can think of.
Options:
Provide predefined format strings, so you could write this:
import datetime d = datetime.datetime(...) print d.strftime(d.RFC2822_FORMAT) # or datetime.RFC2822_FORMAT?
Provide new methods on all the objects:
d = datetime.datetime(...) print d.rfc822_time()
Relevant functionality in other languages includes the PHP date [5] function (Python implementation by Simon Willison at http://simon.incutio.com/archive/2003/10/07/dateInPython)
References
Other useful links:
http://www.egenix.com/files/python/mxDateTime.html http://ringmaster.arc.nasa.gov/tools/time_formats.html http://www.thinkage.ca/english/gcos/expl/b/lib/0tosec.html https://moin.conectiva.com.br/DateUtil
| [1] | http://rfc2822.x42.com |
| [2] | http://www.cl.cam.ac.uk/~mgk25/iso-time.html |
| [3] | http://search.cpan.org/author/MUIR/Time-modules-2003.0211/lib/Time/ParseDate.pm |
| [4] | http://www.opengroup.org/onlinepubs/007908799/xsh/asctime.html |
| [5] | http://www.php.net/date |
Copyright
This document has been placed in the public domain.
pep-0322 Reverse Iteration
| PEP: | 322 |
|---|---|
| Title: | Reverse Iteration |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 24-Sep-2003 |
| Python-Version: | 2.4 |
| Post-History: | 24-Sep-2003 |
Contents
Abstract
This proposal is to add a builtin function to support reverse iteration over sequences.
Motivation
For indexable objects, current approaches for reverse iteration are error prone, unnatural, and not especially readable:
for i in xrange(n-1, -1, -1):
print seqn[i]
One other current approach involves reversing a list before iterating over it. That technique wastes computer cycles, memory, and lines of code:
rseqn = list(seqn)
rseqn.reverse()
for value in rseqn:
print value
Extended slicing is a third approach that minimizes the code overhead but does nothing for memory efficiency, beauty, or clarity.
Reverse iteration is much less common than forward iteration, but it does arise regularly in practice. See Real World Use Cases below.
Proposal
Add a builtin function called reversed() that makes a reverse iterator over sequence objects that support __getitem__() and __len__().
The above examples then simplify to:
for i in reversed(xrange(n)):
print seqn[i]
for elem in reversed(seqn):
print elem
The core idea is that the clearest, least error-prone way of specifying reverse iteration is to specify it in a forward direction and then say reversed.
The implementation could be as simple as:
def reversed(x):
if hasattr(x, 'keys'):
raise ValueError("mappings do not support reverse iteration")
i = len(x)
while i > 0:
i -= 1
yield x[i]
No language syntax changes are needed. The proposal is fully backwards compatible.
A C implementation and unit tests are at: http://www.python.org/sf/834422
BDFL Pronouncement
This PEP has been conditionally accepted for Py2.4. The condition means that if the function is found to be useless, it can be removed before Py2.4b1.
Alternative Method Names
- reviter -- Jeremy Fincher's suggestion matches use of iter()
- ireverse -- uses the itertools naming convention
- inreverse -- no one seems to like this one except me
The name reverse is not a candidate because it duplicates the name of the list.reverse() which mutates the underlying list.
Discussion
The case against adoption of the PEP is a desire to keep the number of builtin functions small. This needs to weighed against the simplicity and convenience of having it as builtin instead of being tucked away in some other namespace.
Real World Use Cases
Here are some instances of reverse iteration taken from the standard library and comments on why reverse iteration was necessary:
atexit.exit_handlers() uses:
while _exithandlers: func, targs, kargs = _exithandlers.pop() . . .In this application popping is required, so the new function would not help.
heapq.heapify() uses for i in xrange(n//2 - 1, -1, -1) because higher-level orderings are more easily formed from pairs of lower-level orderings. A forward version of this algorithm is possible; however, that would complicate the rest of the heap code which iterates over the underlying list in the opposite direction. The replacement code for i in reversed(xrange(n//2)) makes clear the range covered and how many iterations it takes.
mhlib.test() uses:
testfolders.reverse(); for t in testfolders: do('mh.deletefolder(%s)' % `t`)The need for reverse iteration arises because the tail of the underlying list is altered during iteration.
platform._dist_try_harder() uses for n in range(len(verfiles)-1,-1,-1) because the loop deletes selected elements from verfiles but needs to leave the rest of the list intact for further iteration.
random.shuffle() uses for i in xrange(len(x)-1, 0, -1) because the algorithm is most easily understood as randomly selecting elements from an ever diminishing pool. In fact, the algorithm can be run in a forward direction but is less intuitive and rarely presented that way in literature. The replacement code for i in reversed(xrange(1, len(x))) is much easier to verify visually.
rfc822.Message.__delitem__() uses:
list.reverse() for i in list: del self.headers[i]The need for reverse iteration arises because the tail of the underlying list is altered during iteration.
Rejected Alternatives
Several variants were submitted that attempted to apply reversed() to all iterables by running the iterable to completion, saving the results, and then returning a reverse iterator over the results. While satisfying some notions of full generality, running the input to the end is contrary to the purpose of using iterators in the first place. Also, a small disaster ensues if the underlying iterator is infinite.
Putting the function in another module or attaching it to a type object is not being considered. Like its cousins, zip() and enumerate(), the function needs to be directly accessible in daily programming. Each solves a basic looping problem: lock-step iteration, loop counting, and reverse iteration. Requiring some form of dotted access would interfere with their simplicity, daily utility, and accessibility. They are core looping constructs, independent of any one application domain.
Copyright
This document has been placed in the public domain.
pep-0323 Copyable Iterators
| PEP: | 323 |
|---|---|
| Title: | Copyable Iterators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Alex Martelli <aleaxit at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 25-Oct-2003 |
| Python-Version: | 2.5 |
| Post-History: | 29-Oct-2003 |
Deferral
This PEP has been deferred. Copyable iterators are a nice idea, but after four years, no implementation or widespread interest has emerged.
Abstract
This PEP suggests that some iterator types should support shallow
copies of their instances by exposing a __copy__ method which meets
some specific requirements, and indicates how code using an iterator
might exploit such a __copy__ method when present.
Update and Comments
Support for __copy__ was included in Py2.4's itertools.tee().
Adding __copy__ methods to existing iterators will change the
behavior under tee(). Currently, the copied iterators remain
tied to the original iterator. If the original advances, then
so do all of the copies. Good practice is to overwrite the
original so that anamolies don't result: a,b=tee(a).
Code that doesn't follow that practice may observe a semantic
change if a __copy__ method is added to an iterator.
Motivation
In Python up to 2.3, most built-in iterator types don't let the user
copy their instances. User-coded iterators that do let their clients
call copy.copy on their instances may, or may not, happen to return,
as a result of the copy, a separate iterator object that may be
iterated upon independently from the original.
Currently, "support" for copy.copy in a user-coded iterator type is
almost invariably "accidental" -- i.e., the standard machinery of the
copy method in Python's standard library's copy module does build and
return a copy. However, the copy will be independently iterable with
respect to the original only if calling .next() on an instance of that
class happens to change instance state solely by rebinding some
attributes to new values, and not by mutating some attributes'
existing values.
For example, an iterator whose "index" state is held as an integer
attribute will probably give usable copies, since (integers being
immutable) .next() presumably just rebinds that attribute. On the
other hand, another iterator whose "index" state is held as a list
attribute will probably mutate the same list object when .next()
executes, and therefore copies of such an iterator will not be
iterable separately and independently from the original.
Given this existing situation, copy.copy(it) on some iterator object
isn't very useful, nor, therefore, is it at all widely used. However,
there are many cases in which being able to get a "snapshot" of an
iterator, as a "bookmark", so as to be able to keep iterating along
the sequence but later iterate again on the same sequence from the
bookmark onwards, is useful. To support such "bookmarking", module
itertools, in 2.4, has grown a 'tee' function, to be used as:
it, bookmark = itertools.tee(it)
The previous value of 'it' must not be used again, which is why this
typical usage idiom rebinds the name. After this call, 'it' and
'bookmark' are independently-iterable iterators on the same underlying
sequence as the original value of 'it': this satisfies application
needs for "iterator copying".
However, when itertools.tee can make no hypotheses about the nature of
the iterator it is passed as an argument, it must save in memory all
items through which one of the two 'teed' iterators, but not yet both,
have stepped. This can be quite costly in terms of memory, if the two
iterators get very far from each other in their stepping; indeed, in
some cases it may be preferable to make a list from the iterator so as
to be able to step repeatedly through the subsequence, or, if that is
too costy in terms of memory, save items to disk, again in order to be
able to iterate through them repeatedly.
This PEP proposes another idea that will, in some important cases,
allow itertools.tee to do its job with minimal cost in terms of
memory; user code may also occasionally be able to exploit the idea in
order to decide whether to copy an iterator, make a list from it, or
use an auxiliary disk file.
The key consideration is that some important iterators, such as those
which built-in function iter builds over sequences, would be
intrinsically easy to copy: just get another reference to the same
sequence, and a copy of the integer index. However, in Python 2.3,
those iterators don't expose the state, and don't support copy.copy.
The purpose of this PEP, therefore, is to have those iterator types
expose a suitable __copy__ method. Similarly, user-coded iterator
types that can provide copies of their instances, suitable for
separate and independent iteration, with limited costs in time and
space, should also expose a suitable __copy__ method. While
copy.copy also supports other ways to let a type control the way
its instances are copied, it is suggested, for simplicity, that
iterator types that support copying always do so by exposing a
__copy__ method, and not in the other ways copy.copy supports.
Having iterators expose a suitable __copy__ when feasible will afford
easy optimization of itertools.tee and similar user code, as in:
def tee(it):
it = iter(it)
try: copier = it.__copy__
except AttributeError:
# non-copyable iterator, do all the needed hard work
# [snipped!]
else:
return it, copier()
Note that this function does NOT call "copy.copy(it)", which (even
after this PEP is implemented) might well still "just happen to
succeed". for some iterator type that is implemented as a user-coded
class. without really supplying an adequate "independently iterable"
copy object as its result.
Specification
Any iterator type X may expose a method __copy__ that is callable
without arguments on any instance x of X. The method should be
exposed if and only if the iterator type can provide copyability with
reasonably little computational and memory effort. Furthermore, the
new object y returned by method __copy__ should be a new instance
of X that is iterable independently and separately from x, stepping
along the same "underlying sequence" of items.
For example, suppose a class Iter essentially duplicated the
functionality of the iter builtin for iterating on a sequence:
class Iter(object):
def __init__(self, sequence):
self.sequence = sequence
self.index = 0
def __iter__(self):
return self
def next(self):
try: result = self.sequence[self.index]
except IndexError: raise StopIteration
self.index += 1
return result
To make this Iter class compliant with this PEP, the following
addition to the body of class Iter would suffice:
def __copy__(self):
result = self.__class__(self.sequence)
result.index = self.index
return result
Note that __copy__, in this case, does not even try to copy the
sequence; if the sequence is altered while either or both of the
original and copied iterators are still stepping on it, the iteration
behavior is quite likely to go awry anyway -- it is not __copy__'s
responsibility to change this normal Python behavior for iterators
which iterate on mutable sequences (that might, perhaps, be the
specification for a __deepcopy__ method of iterators, which, however,
this PEP does not deal with).
Consider also a "random iterator", which provides a nonterminating
sequence of results from some method of a random instance, called
with given arguments:
class RandomIterator(object):
def __init__(self, bound_method, *args):
self.call = bound_method
self.args = args
def __iter__(self):
return self
def next(self):
return self.call(*self.args)
def __copy__(self):
import copy, new
im_self = copy.copy(self.call.im_self)
method = new.instancemethod(self.call.im_func, im_self)
return self.__class__(method, *self.args)
This iterator type is slightly more general than its name implies, as
it supports calls to any bound method (or other callable, but if the
callable is not a bound method, then method __copy__ will fail). But
the use case is for the purpose of generating random streams, as in:
import random
def show5(it):
for i, result in enumerate(it):
print '%6.3f'%result,
if i==4: break
print
normit = RandomIterator(random.Random().gauss, 0, 1)
show5(normit)
copit = normit.__copy__()
show5(normit)
show5(copit)
which will display some output such as:
-0.536 1.936 -1.182 -1.690 -1.184
0.666 -0.701 1.214 0.348 1.373
0.666 -0.701 1.214 0.348 1.373
the key point being that the second and third lines are equal, because
the normit and copit iterators will step along the same "underlying
sequence". (As an aside, note that to get a copy of self.call.im_self
we must use copy.copy, NOT try getting at a __copy__ method directly,
because for example instances of random.Random support copying via
__getstate__ and __setstate__, NOT via __copy__; indeed, using
copy.copy is the normal way to get a shallow copy of any object --
copyable iterators are different because of the already-mentioned
uncertainty about the result of copy.copy supporting these "copyable
iterator" specs).
Details
Besides adding to the Python docs a recommendation that user-coded
iterator types support a __copy__ method (if and only if it can be
implemented with small costs in memory and runtime, and produce an
independently-iterable copy of an iterator object), this PEP's
implementation will specifically include the addition of copyability
to the iterators over sequences that built-in iter returns, and also
to the iterators over a dictionary returned by the methods __iter__,
iterkeys, itervalues, and iteritems of built-in type dict.
Iterators produced by generator functions will not be copyable.
However, iterators produced by the new "generator expressions" of
Python 2.4 (PEP 289 [3]) should be copyable if their underlying
iterator[s] are; the strict limitations on what is possible in a
generator expression, compared to the much vaster generality of a
generator, should make that feasible. Similarly, the iterators
produced by the built-in function enumerate, and certain functions
suppiled by module itertools, should be copyable if the underlying
iterators are.
The implementation of this PEP will also include the optimization of
the new itertools.tee function mentioned in the Motivation section.
Rationale
The main use case for (shallow) copying of an iterator is the same as
for the function itertools.tee (new in 2.4). User code will not
directly attempt to copy an iterator, because it would have to deal
separately with uncopyable cases; calling itertools.tee will
internally perform the copy when appropriate, and implicitly fallback
to a maximally efficient non-copying strategy for iterators that are
not copyable. (Occasionally, user code may want more direct control,
specifically in order to deal with non-copyable iterators by other
strategies, such as making a list or saving the sequence to disk).
A tee'd iterator may serve as a "reference point", allowing processing
of a sequence to continue or resume from a known point, while the
other independent iterator can be freely advanced to "explore" a
further part of the sequence as needed. A simple example: a generator
function which, given an iterator of numbers (assumed to be positive),
returns a corresponding iterator, each of whose items is the fraction
of the total corresponding to each corresponding item of the input
iterator. The caller may pass the total as a value, if known in
advance; otherwise, the iterator returned by calling this generator
function will first compute the total.
def fractions(numbers, total=None):
if total is None:
numbers, aux = itertools.tee(numbers)
total = sum(aux)
total = float(total)
for item in numbers:
yield item / total
The ability to tee the numbers iterator allows this generator to
precompute the total, if needed, without necessarily requiring
O(N) auxiliary memory if the numbers iterator is copyable.
As another example of "iterator bookmarking", consider a stream of
numbers with an occasional string as a "postfix operator" now and
then. By far most frequent such operator is a '+', whereupon we must
sum all previous numbers (since the last previous operator if any, or
else since the start) and yield the result. Sometimes we find a '*'
instead, which is the same except that the previous numbers must
instead be multiplied, not summed.
def filter_weird_stream(stream):
it = iter(stream)
while True:
it, bookmark = itertools.tee(it)
total = 0
for item in it:
if item=='+':
yield total
break
elif item=='*':
product = 1
for item in bookmark:
if item=='*':
yield product
break
else:
product *= item
else:
total += item
Similar use cases of itertools.tee can support such tasks as
"undo" on a stream of commands represented by an iterator,
"backtracking" on the parse of a stream of tokens, and so on.
(Of course, in each case, one should also consider simpler
possibilities such as saving relevant portions of the sequence
into lists while stepping on the sequence with just one iterator,
depending on the details of one's task).
Here is an example, in pure Python, of how the 'enumerate'
built-in could be extended to support __copy__ if its underlying
iterator also supported __copy__:
class enumerate(object):
def __init__(self, it):
self.it = iter(it)
self.i = -1
def __iter__(self):
return self
def next(self):
self.i += 1
return self.i, self.it.next()
def __copy__(self):
result = self.__class__.__new__()
result.it = self.it.__copy__()
result.i = self.i
return result
Here is an example of the kind of "fragility" produced by "accidental
copyability" of an iterator -- the reason why one must NOT use
copy.copy expecting, if it succeeds, to receive as a result an
iterator which is iterable-on independently from the original. Here
is an iterator class that iterates (in preorder) on "trees" which, for
simplicity, are just nested lists -- any item that's a list is treated
as a subtree, any other item as a leaf.
class ListreeIter(object):
def __init__(self, tree):
self.tree = [tree]
self.indx = [-1]
def __iter__(self):
return self
def next(self):
if not self.indx:
raise StopIteration
self.indx[-1] += 1
try:
result = self.tree[-1][self.indx[-1]]
except IndexError:
self.tree.pop()
self.indx.pop()
return self.next()
if type(result) is not list:
return result
self.tree.append(result)
self.indx.append(-1)
return self.next()
Now, for example, the following code:
import copy
x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ]
print 'showing all items:',
it = ListreeIter(x)
for i in it:
print i,
if i==6: cop = copy.copy(it)
print
print 'showing items >6 again:'
for i in cop: print i,
print
does NOT work as intended -- the "cop" iterator gets consumed, and
exhausted, step by step as the original "it" iterator is, because
the accidental (rather than deliberate) copying performed by
copy.copy shares, rather than duplicating the "index" list, which
is the mutable attribute it.indx (a list of numerical indices).
Thus, this "client code" of the iterator, which attemps to iterate
twice over a portion of the sequence via a copy.copy on the
iterator, is NOT correct.
Some correct solutions include using itertools.tee, i.e., changing
the first for loop into:
for i in it:
print i,
if i==6:
it, cop = itertools.tee(it)
break
for i in it: print i,
(note that we MUST break the loop in two, otherwise we'd still
be looping on the ORIGINAL value of it, which must NOT be used
further after the call to tee!!!); or making a list, i.e.:
for i in it:
print i,
if i==6:
cop = lit = list(it)
break
for i in lit: print i,
(again, the loop must be broken in two, since iterator 'it'
gets exhausted by the call list(it)).
Finally, all of these solutions would work if Listiter supplied
a suitable __copy__ method, as this PEP recommends:
def __copy__(self):
result = self.__class__.new()
result.tree = copy.copy(self.tree)
result.indx = copy.copy(self.indx)
return result
There is no need to get any "deeper" in the copy, but the two
mutable "index state" attributes must indeed be copied in order
to achieve a "proper" (independently iterable) iterator-copy.
The recommended solution is to have class Listiter supply this
__copy__ method AND have client code use itertools.tee (with
the split-in-two-parts loop as shown above). This will make
client code maximally tolerant of different iterator types it
might be using AND achieve good performance for tee'ing of this
specific iterator type at the same time.
References
[1] Discussion on python-dev starting at post:
http://mail.python.org/pipermail/python-dev/2003-October/038969.html
[2] Online documentation for the copy module of the standard library:
http://docs.python.org/library/copy.html
[3] PEP 289, Generator Expressions, Hettinger
http://www.python.org/dev/peps/pep-0289/
Copyright
This document has been placed in the public domain.
pep-0324 subprocess - New process module
| PEP: | 324 |
|---|---|
| Title: | subprocess - New process module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Peter Astrand <astrand at lysator.liu.se> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 19-Nov-2003 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
This PEP describes a new module for starting and communicating
with processes.
Motivation
Starting new processes is a common task in any programming
language, and very common in a high-level language like Python.
Good support for this task is needed, because:
- Inappropriate functions for starting processes could mean a
security risk: If the program is started through the shell, and
the arguments contain shell meta characters, the result can be
disastrous. [1]
- It makes Python an even better replacement language for
over-complicated shell scripts.
Currently, Python has a large number of different functions for
process creation. This makes it hard for developers to choose.
The subprocess module provides the following enhancements over
previous functions:
- One "unified" module provides all functionality from previous
functions.
- Cross-process exceptions: Exceptions happening in the child
before the new process has started to execute are re-raised in
the parent. This means that it's easy to handle exec()
failures, for example. With popen2, for example, it's
impossible to detect if the execution failed.
- A hook for executing custom code between fork and exec. This
can be used for, for example, changing uid.
- No implicit call of /bin/sh. This means that there is no need
for escaping dangerous shell meta characters.
- All combinations of file descriptor redirection is possible.
For example, the "python-dialog" [2] needs to spawn a process
and redirect stderr, but not stdout. This is not possible with
current functions, without using temporary files.
- With the subprocess module, it's possible to control if all open
file descriptors should be closed before the new program is
executed.
- Support for connecting several subprocesses (shell "pipe").
- Universal newline support.
- A communicate() method, which makes it easy to send stdin data
and read stdout and stderr data, without risking deadlocks.
Most people are aware of the flow control issues involved with
child process communication, but not all have the patience or
skills to write a fully correct and deadlock-free select loop.
This means that many Python applications contain race
conditions. A communicate() method in the standard library
solves this problem.
Rationale
The following points summarizes the design:
- subprocess was based on popen2, which is tried-and-tested.
- The factory functions in popen2 have been removed, because I
consider the class constructor equally easy to work with.
- popen2 contains several factory functions and classes for
different combinations of redirection. subprocess, however,
contains one single class. Since the subprocess module supports
12 different combinations of redirection, providing a class or
function for each of them would be cumbersome and not very
intuitive. Even with popen2, this is a readability problem.
For example, many people cannot tell the difference between
popen2.popen2 and popen2.popen4 without using the documentation.
- One small utility function is provided: subprocess.call(). It
aims to be an enhancement over os.system(), while still very
easy to use:
- It does not use the Standard C function system(), which has
limitations.
- It does not call the shell implicitly.
- No need for quoting; using an argument list.
- The return value is easier to work with.
The call() utility function accepts an 'args' argument, just
like the Popen class constructor. It waits for the command to
complete, then returns the returncode attribute. The
implementation is very simple:
def call(*args, **kwargs):
return Popen(*args, **kwargs).wait()
The motivation behind the call() function is simple: Starting a
process and wait for it to finish is a common task.
While Popen supports a wide range of options, many users have
simple needs. Many people are using os.system() today, mainly
because it provides a simple interface. Consider this example:
os.system("stty sane -F " + device)
With subprocess.call(), this would look like:
subprocess.call(["stty", "sane", "-F", device])
or, if executing through the shell:
subprocess.call("stty sane -F " + device, shell=True)
- The "preexec" functionality makes it possible to run arbitrary
code between fork and exec. One might ask why there are special
arguments for setting the environment and current directory, but
not for, for example, setting the uid. The answer is:
- Changing environment and working directory is considered
fairly common.
- Old functions like spawn() has support for an
"env"-argument.
- env and cwd are considered quite cross-platform: They make
sense even on Windows.
- On POSIX platforms, no extension module is required: the module
uses os.fork(), os.execvp() etc.
- On Windows platforms, the module requires either Mark Hammond's
Windows extensions[5], or a small extension module called
_subprocess.
Specification
This module defines one class called Popen:
class Popen(args, bufsize=0, executable=None,
stdin=None, stdout=None, stderr=None,
preexec_fn=None, close_fds=False, shell=False,
cwd=None, env=None, universal_newlines=False,
startupinfo=None, creationflags=0):
Arguments are:
- args should be a string, or a sequence of program arguments.
The program to execute is normally the first item in the args
sequence or string, but can be explicitly set by using the
executable argument.
On UNIX, with shell=False (default): In this case, the Popen
class uses os.execvp() to execute the child program. args
should normally be a sequence. A string will be treated as a
sequence with the string as the only item (the program to
execute).
On UNIX, with shell=True: If args is a string, it specifies the
command string to execute through the shell. If args is a
sequence, the first item specifies the command string, and any
additional items will be treated as additional shell arguments.
On Windows: the Popen class uses CreateProcess() to execute the
child program, which operates on strings. If args is a
sequence, it will be converted to a string using the
list2cmdline method. Please note that not all MS Windows
applications interpret the command line the same way: The
list2cmdline is designed for applications using the same rules
as the MS C runtime.
- bufsize, if given, has the same meaning as the corresponding
argument to the built-in open() function: 0 means unbuffered, 1
means line buffered, any other positive value means use a buffer
of (approximately) that size. A negative bufsize means to use
the system default, which usually means fully buffered. The
default value for bufsize is 0 (unbuffered).
- stdin, stdout and stderr specify the executed programs' standard
input, standard output and standard error file handles,
respectively. Valid values are PIPE, an existing file
descriptor (a positive integer), an existing file object, and
None. PIPE indicates that a new pipe to the child should be
created. With None, no redirection will occur; the child's file
handles will be inherited from the parent. Additionally, stderr
can be STDOUT, which indicates that the stderr data from the
applications should be captured into the same file handle as for
stdout.
- If preexec_fn is set to a callable object, this object will be
called in the child process just before the child is executed.
- If close_fds is true, all file descriptors except 0, 1 and 2
will be closed before the child process is executed.
- If shell is true, the specified command will be executed through
the shell.
- If cwd is not None, the current directory will be changed to cwd
before the child is executed.
- If env is not None, it defines the environment variables for the
new process.
- If universal_newlines is true, the file objects stdout and
stderr are opened as a text file, but lines may be terminated
by any of '\n', the Unix end-of-line convention, '\r', the
Macintosh convention or '\r\n', the Windows convention. All of
these external representations are seen as '\n' by the Python
program. Note: This feature is only available if Python is
built with universal newline support (the default). Also, the
newlines attribute of the file objects stdout, stdin and stderr
are not updated by the communicate() method.
- The startupinfo and creationflags, if given, will be passed to
the underlying CreateProcess() function. They can specify
things such as appearance of the main window and priority for
the new process. (Windows only)
This module also defines two shortcut functions:
- call(*args, **kwargs):
Run command with arguments. Wait for command to complete,
then return the returncode attribute.
The arguments are the same as for the Popen constructor.
Example:
retcode = call(["ls", "-l"])
Exceptions
----------
Exceptions raised in the child process, before the new program has
started to execute, will be re-raised in the parent.
Additionally, the exception object will have one extra attribute
called 'child_traceback', which is a string containing traceback
information from the child's point of view.
The most common exception raised is OSError. This occurs, for
example, when trying to execute a non-existent file. Applications
should prepare for OSErrors.
A ValueError will be raised if Popen is called with invalid
arguments.
Security
--------
Unlike some other popen functions, this implementation will never
call /bin/sh implicitly. This means that all characters,
including shell meta-characters, can safely be passed to child
processes.
Popen objects
-------------
Instances of the Popen class have the following methods:
poll()
Check if child process has terminated. Returns returncode
attribute.
wait()
Wait for child process to terminate. Returns returncode
attribute.
communicate(input=None)
Interact with process: Send data to stdin. Read data from
stdout and stderr, until end-of-file is reached. Wait for
process to terminate. The optional stdin argument should be a
string to be sent to the child process, or None, if no data
should be sent to the child.
communicate() returns a tuple (stdout, stderr).
Note: The data read is buffered in memory, so do not use this
method if the data size is large or unlimited.
The following attributes are also available:
stdin
If the stdin argument is PIPE, this attribute is a file object
that provides input to the child process. Otherwise, it is
None.
stdout
If the stdout argument is PIPE, this attribute is a file
object that provides output from the child process.
Otherwise, it is None.
stderr
If the stderr argument is PIPE, this attribute is file object
that provides error output from the child process. Otherwise,
it is None.
pid
The process ID of the child process.
returncode
The child return code. A None value indicates that the
process hasn't terminated yet. A negative value -N indicates
that the child was terminated by signal N (UNIX only).
Replacing older functions with the subprocess module
In this section, "a ==> b" means that b can be used as a
replacement for a.
Note: All functions in this section fail (more or less) silently
if the executed program cannot be found; this module raises an
OSError exception.
In the following examples, we assume that the subprocess module is
imported with "from subprocess import *".
Replacing /bin/sh shell backquote
---------------------------------
output=`mycmd myarg`
==>
output = Popen(["mycmd", "myarg"], stdout=PIPE).communicate()[0]
Replacing shell pipe line
-------------------------
output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
Replacing os.system()
---------------------
sts = os.system("mycmd" + " myarg")
==>
p = Popen("mycmd" + " myarg", shell=True)
sts = os.waitpid(p.pid, 0)
Note:
* Calling the program through the shell is usually not required.
* It's easier to look at the returncode attribute than the
exit status.
A more real-world example would look like this:
try:
retcode = call("mycmd" + " myarg", shell=True)
if retcode < 0:
print >>sys.stderr, "Child was terminated by signal", -retcode
else:
print >>sys.stderr, "Child returned", retcode
except OSError, e:
print >>sys.stderr, "Execution failed:", e
Replacing os.spawn*
-------------------
P_NOWAIT example:
pid = os.spawnlp(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg")
==>
pid = Popen(["/bin/mycmd", "myarg"]).pid
P_WAIT example:
retcode = os.spawnlp(os.P_WAIT, "/bin/mycmd", "mycmd", "myarg")
==>
retcode = call(["/bin/mycmd", "myarg"])
Vector example:
os.spawnvp(os.P_NOWAIT, path, args)
==>
Popen([path] + args[1:])
Environment example:
os.spawnlpe(os.P_NOWAIT, "/bin/mycmd", "mycmd", "myarg", env)
==>
Popen(["/bin/mycmd", "myarg"], env={"PATH": "/usr/bin"})
Replacing os.popen*
-------------------
pipe = os.popen(cmd, mode='r', bufsize)
==>
pipe = Popen(cmd, shell=True, bufsize=bufsize, stdout=PIPE).stdout
pipe = os.popen(cmd, mode='w', bufsize)
==>
pipe = Popen(cmd, shell=True, bufsize=bufsize, stdin=PIPE).stdin
(child_stdin, child_stdout) = os.popen2(cmd, mode, bufsize)
==>
p = Popen(cmd, shell=True, bufsize=bufsize,
stdin=PIPE, stdout=PIPE, close_fds=True)
(child_stdin, child_stdout) = (p.stdin, p.stdout)
(child_stdin,
child_stdout,
child_stderr) = os.popen3(cmd, mode, bufsize)
==>
p = Popen(cmd, shell=True, bufsize=bufsize,
stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
(child_stdin,
child_stdout,
child_stderr) = (p.stdin, p.stdout, p.stderr)
(child_stdin, child_stdout_and_stderr) = os.popen4(cmd, mode, bufsize)
==>
p = Popen(cmd, shell=True, bufsize=bufsize,
stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True)
(child_stdin, child_stdout_and_stderr) = (p.stdin, p.stdout)
Replacing popen2.*
------------------
Note: If the cmd argument to popen2 functions is a string, the
command is executed through /bin/sh. If it is a list, the command
is directly executed.
(child_stdout, child_stdin) = popen2.popen2("somestring", bufsize, mode)
==>
p = Popen(["somestring"], shell=True, bufsize=bufsize
stdin=PIPE, stdout=PIPE, close_fds=True)
(child_stdout, child_stdin) = (p.stdout, p.stdin)
(child_stdout, child_stdin) = popen2.popen2(["mycmd", "myarg"], bufsize, mode)
==>
p = Popen(["mycmd", "myarg"], bufsize=bufsize,
stdin=PIPE, stdout=PIPE, close_fds=True)
(child_stdout, child_stdin) = (p.stdout, p.stdin)
The popen2.Popen3 and popen3.Popen4 basically works as
subprocess.Popen, except that:
* subprocess.Popen raises an exception if the execution fails
* the capturestderr argument is replaced with the stderr argument.
* stdin=PIPE and stdout=PIPE must be specified.
* popen2 closes all file descriptors by default, but you have to
specify close_fds=True with subprocess.Popen.
Open Issues
Some features have been requested but is not yet implemented.
This includes:
* Support for managing a whole flock of subprocesses
* Support for managing "daemon" processes
* Built-in method for killing subprocesses
While these are useful features, it's expected that these can be
added later without problems.
* expect-like functionality, including pty support.
pty support is highly platform-dependent, which is a
problem. Also, there are already other modules that provide this
kind of functionality[6].
Backwards Compatibility
Since this is a new module, no major backward compatible issues
are expected. The module name "subprocess" might collide with
other, previous modules[3] with the same name, but the name
"subprocess" seems to be the best suggested name so far. The
first name of this module was "popen5", but this name was
considered too unintuitive. For a while, the module was called
"process", but this name is already used by Trent Mick's
module[4].
The functions and modules that this new module is trying to
replace (os.system, os.spawn*, os.popen*, popen2.*, commands.*)
are expected to be available in future Python versions for a long
time, to preserve backwards compatibility.
Reference Implementation
A reference implementation is available from
http://www.lysator.liu.se/~astrand/popen5/.
References
[1] Secure Programming for Linux and Unix HOWTO, section 8.3.
http://www.dwheeler.com/secure-programs/
[2] Python Dialog
http://pythondialog.sourceforge.net/
[3] http://www.iol.ie/~padraiga/libs/subProcess.py
[4] http://starship.python.net/crew/tmick/
[5] http://starship.python.net/crew/mhammond/win32/
[6] http://www.lysator.liu.se/~ceder/pcl-expect/
Copyright
This document has been placed in the public domain.
pep-0325 Resource-Release Support for Generators
| PEP: | 325 |
|---|---|
| Title: | Resource-Release Support for Generators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Samuele Pedroni <pedronis at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 25-Aug-2003 |
| Python-Version: | 2.4 |
| Post-History: |
Abstract
Generators allow for natural coding and abstraction of traversal
over data. Currently if external resources needing proper timely
release are involved, generators are unfortunately not adequate.
The typical idiom for timely release is not supported, a yield
statement is not allowed in the try clause of a try-finally
statement inside a generator. The finally clause execution can be
neither guaranteed nor enforced.
This PEP proposes that the built-in generator type implement a
close method and destruction semantics, such that the restriction
on yield placement can be lifted, expanding the applicability of
generators.
Pronouncement
Rejected in favor of PEP 342 which includes substantially all of
the requested behavior in a more refined form.
Rationale
Python generators allow for natural coding of many data traversal
scenarios. Their instantiation produces iterators,
i.e. first-class objects abstracting traversal (with all the
advantages of first- classness). In this respect they match in
power and offer some advantages over the approach using iterator
methods taking a (smalltalkish) block. On the other hand, given
current limitations (no yield allowed in a try clause of a
try-finally inside a generator) the latter approach seems better
suited to encapsulating not only traversal but also exception
handling and proper resource acquisition and release.
Let's consider an example (for simplicity, files in read-mode are
used):
def all_lines(index_path):
for path in file(index_path, "r"):
for line in file(path.strip(), "r"):
yield line
this is short and to the point, but the try-finally for timely
closing of the files cannot be added. (While instead of a path, a
file, whose closing then would be responsibility of the caller,
could be passed in as argument, the same is not applicable for the
files opened depending on the contents of the index).
If we want timely release, we have to sacrifice the simplicity and
directness of the generator-only approach: (e.g.)
class AllLines:
def __init__(self,index_path):
self.index_path = index_path
self.index = None
self.document = None
def __iter__(self):
self.index = file(self.index_path,"r")
for path in self.index:
self.document = file(path.strip(),"r")
for line in self.document:
yield line
self.document.close()
self.document = None
def close(self):
if self.index:
self.index.close()
if self.document:
self.document.close()
to be used as:
all_lines = AllLines("index.txt")
try:
for line in all_lines:
...
finally:
all_lines.close()
The more convoluted solution implementing timely release, seems
to offer a precious hint. What we have done is encapsulate our
traversal in an object (iterator) with a close method.
This PEP proposes that generators should grow such a close method
with such semantics that the example could be rewritten as:
# Today this is not valid Python: yield is not allowed between
# try and finally, and generator type instances support no
# close method.
def all_lines(index_path):
index = file(index_path,"r")
try:
for path in index:
document = file(path.strip(),"r")
try:
for line in document:
yield line
finally:
document.close()
finally:
index.close()
all = all_lines("index.txt")
try:
for line in all:
...
finally:
all.close() # close on generator
Currently PEP 255 [1] disallows yield inside a try clause of a
try-finally statement, because the execution of the finally clause
cannot be guaranteed as required by try-finally semantics.
The semantics of the proposed close method should be such that
while the finally clause execution still cannot be guaranteed, it
can be enforced when required. Specifically, the close method
behavior should trigger the execution of the finally clauses
inside the generator, either by forcing a return in the generator
frame or by throwing an exception in it. In situations requiring
timely resource release, close could then be explicitly invoked.
The semantics of generator destruction on the other hand should be
extended in order to implement a best-effort policy for the
general case. Specifically, destruction should invoke close().
The best-effort limitation comes from the fact that the
destructor's execution is not guaranteed in the first place.
This seems to be a reasonable compromise, the resulting global
behavior being similar to that of files and closing.
Possible Semantics
The built-in generator type should have a close method
implemented, which can then be invoked as:
gen.close()
where gen is an instance of the built-in generator type.
Generator destruction should also invoke close method behavior.
If a generator is already terminated, close should be a no-op.
Otherwise, there are two alternative solutions, Return or
Exception Semantics:
A - Return Semantics: The generator should be resumed, generator
execution should continue as if the instruction at the re-entry
point is a return. Consequently finally clauses surrounding the
re-entry point would be executed, in the case of a then allowed
try-yield-finally pattern.
Issues: is it important to be able to distinguish forced
termination by close, normal termination, exception propagation
from generator or generator-called code? In the normal case it
seems not, finally clauses should be there to work the same in all
these cases, still this semantics could make such a distinction
hard.
Except-clauses, like by a normal return, are not executed, such
clauses in legacy generators expect to be executed for exceptions
raised by the generator or by code called from it. Not executing
them in the close case seems correct.
B - Exception Semantics: The generator should be resumed and
execution should continue as if a special-purpose exception
(e.g. CloseGenerator) has been raised at re-entry point. Close
implementation should consume and not propagate further this
exception.
Issues: should StopIteration be reused for this purpose? Probably
not. We would like close to be a harmless operation for legacy
generators, which could contain code catching StopIteration to
deal with other generators/iterators.
In general, with exception semantics, it is unclear what to do if
the generator does not terminate or we do not receive the special
exception propagated back. Other different exceptions should
probably be propagated, but consider this possible legacy
generator code:
try:
...
yield ...
...
except: # or except Exception:, etc
raise Exception("boom")
If close is invoked with the generator suspended after the yield,
the except clause would catch our special purpose exception, so we
would get a different exception propagated back, which in this
case ought to be reasonably consumed and ignored but in general
should be propagated, but separating these scenarios seems hard.
The exception approach has the advantage to let the generator
distinguish between termination cases and have more control. On
the other hand clear-cut semantics seem harder to define.
Remarks
If this proposal is accepted, it should become common practice to
document whether a generator acquires resources, so that its close
method ought to be called. If a generator is no longer used,
calling close should be harmless.
On the other hand, in the typical scenario the code that
instantiated the generator should call close if required by it.
Generic code dealing with iterators/generators instantiated
elsewhere should typically not be littered with close calls.
The rare case of code that has acquired ownership of and need to
properly deal with all of iterators, generators and generators
acquiring resources that need timely release, is easily solved:
if hasattr(iterator, 'close'):
iterator.close()
Open Issues
Definitive semantics ought to be chosen. Currently Guido favors
Exception Semantics. If the generator yields a value instead of
terminating, or propagating back the special exception, a special
exception should be raised again on the generator side.
It is still unclear whether spuriously converted special
exceptions (as discussed in Possible Semantics) are a problem and
what to do about them.
Implementation issues should be explored.
Alternative Ideas
The idea that the yield placement limitation should be removed and
that generator destruction should trigger execution of finally
clauses has been proposed more than once. Alone it cannot
guarantee that timely release of resources acquired by a generator
can be enforced.
PEP 288 [2] proposes a more general solution, allowing custom
exception passing to generators. The proposal in this PEP
addresses more directly the problem of resource release. Were PEP
288 implemented, Exceptions Semantics for close could be layered
on top of it, on the other hand PEP 288 should make a separate
case for the more general functionality.
References
[1] PEP 255 Simple Generators
http://www.python.org/dev/peps/pep-0255/
[2] PEP 288 Generators Attributes and Exceptions
http://www.python.org/dev/peps/pep-0288/
Copyright
This document has been placed in the public domain.
pep-0326 A Case for Top and Bottom Values
| PEP: | 326 |
|---|---|
| Title: | A Case for Top and Bottom Values |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Josiah Carlson <jcarlson at uci.edu>, Terry Reedy <tjreedy at udel.edu> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Dec-2003 |
| Python-Version: | 2.4 |
| Post-History: | 20-Dec-2003, 03-Jan-2004, 05-Jan-2004, 07-Jan-2004, 21-Feb-2004 |
Contents
Results
This PEP has been rejected by the BDFL [12]. As per the pseudo-sunset clause [13], PEP 326 is being updated one last time with the latest suggestions, code modifications, etc., and includes a link to a module [14] that implements the behavior described in the PEP. Users who desire the behavior listed in this PEP are encouraged to use the module for the reasons listed in Independent Implementations?.
Abstract
This PEP proposes two singleton constants that represent a top and bottom [3] value: Max and Min (or two similarly suggestive names [4]; see Open Issues).
As suggested by their names, Max and Min would compare higher or lower than any other object (respectively). Such behavior results in easier to understand code and fewer special cases in which a temporary minimum or maximum value is required, and an actual minimum or maximum numeric value is not limited.
Rationale
While None can be used as an absolute minimum that any value can attain [1], this may be deprecated [4] in Python 3.0 and shouldn't be relied upon.
As a replacement for None being used as an absolute minimum, as well as the introduction of an absolute maximum, the introduction of two singleton constants Max and Min address concerns for the constants to be self-documenting.
What is commonly done to deal with absolute minimum or maximum values, is to set a value that is larger than the script author ever expects the input to reach, and hope that it isn't reached.
Guido has brought up [2] the fact that there exists two constants that can be used in the interim for maximum values: sys.maxint and floating point positive infinity (1e309 will evaluate to positive infinity). However, each has their drawbacks.
On most architectures sys.maxint is arbitrarily small (2**31-1 or 2**63-1) and can be easily eclipsed by large 'long' integers or floating point numbers.
Comparing long integers larger than the largest floating point number representable against any float will result in an exception being raised:
>>> cmp(1.0, 10**309) Traceback (most recent call last): File "<stdin>", line 1, in ? OverflowError: long int too large to convert to float
Even when large integers are compared against positive infinity:
>>> cmp(1e309, 10**309) Traceback (most recent call last): File "<stdin>", line 1, in ? OverflowError: long int too large to convert to float
These same drawbacks exist when numbers are negative.
Introducing Max and Min that work as described above does not take much effort. A sample Python reference implementation of both is included.
Motivation
There are hundreds of algorithms that begin by initializing some set of values to a logical (or numeric) infinity or negative infinity. Python lacks either infinity that works consistently or really is the most extreme value that can be attained. By adding Max and Min, Python would have a real maximum and minimum value, and such algorithms can become clearer due to the reduction of special cases.
Max Examples
When testing various kinds of servers, it is sometimes necessary to only serve a certain number of clients before exiting, which results in code like the following:
count = 5
def counts(stop):
i = 0
while i < stop:
yield i
i += 1
for client_number in counts(count):
handle_one_client()
When using Max as the value assigned to count, our testing server becomes a production server with minimal effort.
As another example, in Dijkstra's shortest path algorithm on a graph with weighted edges (all positive).
- Set distances to every node in the graph to infinity.
- Set the distance to the start node to zero.
- Set visited to be an empty mapping.
- While shortest distance of a node that has not been visited is less
than infinity and the destination has not been visited.
- Get the node with the shortest distance.
- Visit the node.
- Update neighbor distances and parent pointers if necessary for neighbors that have not been visited.
- If the destination has been visited, step back through parent pointers to find the reverse of the path to be taken.
Below is an example of Dijkstra's shortest path algorithm on a graph with weighted edges using a table (a faster version that uses a heap is available, but this version is offered due to its similarity to the description above, the heap version is available via older versions of this document).
def DijkstraSP_table(graph, S, T):
table = {} #3
for node in graph.iterkeys():
#(visited, distance, node, parent)
table[node] = (0, Max, node, None) #1
table[S] = (0, 0, S, None) #2
cur = min(table.values()) #4a
while (not cur[0]) and cur[1] < Max: #4
(visited, distance, node, parent) = cur
table[node] = (1, distance, node, parent) #4b
for cdist, child in graph[node]: #4c
ndist = distance+cdist #|
if not table[child][0] and ndist < table[child][1]:#|
table[child] = (0, ndist, child, node) #|_
cur = min(table.values()) #4a
if not table[T][0]:
return None
cur = T #5
path = [T] #|
while table[cur][3] is not None: #|
path.append(table[cur][3]) #|
cur = path[-1] #|
path.reverse() #|
return path #|_
Readers should note that replacing Max in the above code with an arbitrarily large number does not guarantee that the shortest path distance to a node will never exceed that number. Well, with one caveat: one could certainly sum up the weights of every edge in the graph, and set the 'arbitrarily large number' to that total. However, doing so does not make the algorithm any easier to understand and has potential problems with numeric overflows.
Gustavo Niemeyer [9] points out that using a more Pythonic data structure than tuples, to store information about node distances, increases readability. Two equivalent node structures (one using None, the other using Max) and their use in a suitably modified Dijkstra's shortest path algorithm is given below.
class SuperNode:
def __init__(self, node, parent, distance, visited):
self.node = node
self.parent = parent
self.distance = distance
self.visited = visited
class MaxNode(SuperNode):
def __init__(self, node, parent=None, distance=Max,
visited=False):
SuperNode.__init__(self, node, parent, distance, visited)
def __cmp__(self, other):
return cmp((self.visited, self.distance),
(other.visited, other.distance))
class NoneNode(SuperNode):
def __init__(self, node, parent=None, distance=None,
visited=False):
SuperNode.__init__(self, node, parent, distance, visited)
def __cmp__(self, other):
pair = ((self.visited, self.distance),
(other.visited, other.distance))
if None in (self.distance, other.distance):
return -cmp(*pair)
return cmp(*pair)
def DijkstraSP_table_node(graph, S, T, Node):
table = {} #3
for node in graph.iterkeys():
table[node] = Node(node) #1
table[S] = Node(S, distance=0) #2
cur = min(table.values()) #4a
sentinel = Node(None).distance
while not cur.visited and cur.distance != sentinel: #4
cur.visited = True #4b
for cdist, child in graph[node]: #4c
ndist = distance+cdist #|
if not table[child].visited and\ #|
ndist < table[child].distance: #|
table[child].distance = ndist #|_
cur = min(table.values()) #4a
if not table[T].visited:
return None
cur = T #5
path = [T] #|
while table[cur].parent is not None: #|
path.append(table[cur].parent) #|
cur = path[-1] #|
path.reverse() #|
return path #|_
In the above, passing in either NoneNode or MaxNode would be sufficient to use either None or Max for the node distance 'infinity'. Note the additional special case required for None being used as a sentinel in NoneNode in the __cmp__ method.
This example highlights the special case handling where None is used as a sentinel value for maximum values "in the wild", even though None itself compares smaller than any other object in the standard distribution.
As an aside, it is not clear to to the author that using Nodes as a replacement for tuples has increased readability significantly, if at all.
A Min Example
An example of usage for Min is an algorithm that solves the following problem [6]:
Suppose you are given a directed graph, representing a communication network. The vertices are the nodes in the network, and each edge is a communication channel. Each edge (u, v) has an associated value r(u, v), with 0 <= r(u, v) <= 1, which represents the reliability of the channel from u to v (i.e., the probability that the channel from u to v will not fail). Assume that the reliability probabilities of the channels are independent. (This implies that the reliability of any path is the product of the reliability of the edges along the path.) Now suppose you are given two nodes in the graph, A and B.
Such an algorithm is a 7 line modification to the DijkstraSP_table algorithm given above (modified lines prefixed with *):
def DijkstraSP_table(graph, S, T):
table = {} #3
for node in graph.iterkeys():
#(visited, distance, node, parent)
* table[node] = (0, Min, node, None) #1
* table[S] = (0, 1, S, None) #2
* cur = max(table.values()) #4a
* while (not cur[0]) and cur[1] > Min: #4
(visited, distance, node, parent) = cur
table[node] = (1, distance, node, parent) #4b
for cdist, child in graph[node]: #4c
* ndist = distance*cdist #|
* if not table[child][0] and ndist > table[child][1]:#|
table[child] = (0, ndist, child, node) #|_
* cur = max(table.values()) #4a
if not table[T][0]:
return None
cur = T #5
path = [T] #|
while table[cur][3] is not None: #|
path.append(table[cur][3]) #|
cur = path[-1] #|
path.reverse() #|
return path #|_
Note that there is a way of translating the graph to so that it can be passed unchanged into the original DijkstraSP_table algorithm. There also exists a handful of easy methods for constructing Node objects that would work with DijkstraSP_table_node. Such translations are left as an exercise to the reader.
Other Examples
Andrew P. Lentvorski, Jr. [7] has pointed out that various data structures involving range searching have immediate use for Max and Min values. More specifically; Segment trees, Range trees, k-d trees and database keys:
...The issue is that a range can be open on one side and does not always have an initialized case.
The solutions I have seen are to either overload None as the extremum or use an arbitrary large magnitude number. Overloading None means that the built-ins can't really be used without special case checks to work around the undefined (or "wrongly defined") ordering of None. These checks tend to swamp the nice performance of built-ins like max() and min().
Choosing a large magnitude number throws away the ability of Python to cope with arbitrarily large integers and introduces a potential source of overrun/underrun bugs.
Further use examples of both Max and Min are available in the realm of graph algorithms, range searching algorithms, computational geometry algorithms, and others.
Independent Implementations?
Independent implementations of the Min/Max concept by users desiring such functionality are not likely to be compatible, and certainly will produce inconsistent orderings. The following examples seek to show how inconsistent they can be.
Let us pretend we have created proper separate implementations of MyMax, MyMin, YourMax and YourMin with the same code as given in the sample implementation (with some minor renaming):
>>> lst = [YourMin, MyMin, MyMin, YourMin, MyMax, YourMin, MyMax, YourMax, MyMax] >>> lst.sort() >>> lst [YourMin, YourMin, MyMin, MyMin, YourMin, MyMax, MyMax, YourMax, MyMax]
Notice that while all the "Min"s are before the "Max"s, there is no guarantee that all instances of YourMin will come before MyMin, the reverse, or the equivalent MyMax and YourMax.
The problem is also evident when using the heapq module:
>>> lst = [YourMin, MyMin, MyMin, YourMin, MyMax, YourMin, MyMax, YourMax, MyMax] >>> heapq.heapify(lst) #not needed, but it can't hurt >>> while lst: print heapq.heappop(lst), ... YourMin MyMin YourMin YourMin MyMin MyMax MyMax YourMax MyMax
Furthermore, the findmin_Max code and both versions of Dijkstra could result in incorrect output by passing in secondary versions of Max.
It has been pointed out [9] that the reference implementation given below would be incompatible with independent implementations of Max/Min. The point of this PEP is for the introduction of "The One True Implementation" of "The One True Maximum" and "The One True Minimum". User-based implementations of Max and Min objects would thusly be discouraged, and use of "The One True Implementation" would obviously be encouraged. Ambiguous behavior resulting from mixing users' implementations of Max and Min with "The One True Implementation" should be easy to discover through variable and/or source code introspection.
Reference Implementation
class _ExtremeType(object):
def __init__(self, cmpr, rep):
object.__init__(self)
self._cmpr = cmpr
self._rep = rep
def __cmp__(self, other):
if isinstance(other, self.__class__) and\
other._cmpr == self._cmpr:
return 0
return self._cmpr
def __repr__(self):
return self._rep
Max = _ExtremeType(1, "Max")
Min = _ExtremeType(-1, "Min")
Results of Test Run:
>>> max(Max, 2**65536) Max >>> min(Max, 2**65536) 20035299304068464649790... (lines removed for brevity) ...72339445587895905719156736L >>> min(Min, -2**65536) Min >>> max(Min, -2**65536) -2003529930406846464979... (lines removed for brevity) ...072339445587895905719156736L
Open Issues
As the PEP was rejected, all open issues are now closed and inconsequential. The module will use the names UniversalMaximum and UniversalMinimum due to the fact that it would be very difficult to mistake what each does. For those who require a shorter name, renaming the singletons during import is suggested:
from extremes import UniversalMaximum as uMax,
UniversalMinimum as uMin
References
| [1] | RE: [Python-Dev] Re: Got None. Maybe Some?, Peters, Tim (http://mail.python.org/pipermail/python-dev/2003-December/041374.html) |
| [2] | Re: [Python-Dev] Got None. Maybe Some?, van Rossum, Guido (http://mail.python.org/pipermail/python-dev/2003-December/041352.html) |
| [3] | RE: [Python-Dev] Got None. Maybe Some?, Peters, Tim (http://mail.python.org/pipermail/python-dev/2003-December/041332.html) |
| [4] | (1, 2) [Python-Dev] Re: PEP 326 now online, Reedy, Terry (http://mail.python.org/pipermail/python-dev/2004-January/041685.html) |
| [5] | [Python-Dev] PEP 326 now online, Chermside, Michael (http://mail.python.org/pipermail/python-dev/2004-January/041704.html) |
| [6] | Homework 6, Problem 7, Dillencourt, Michael (link may not be valid in the future) (http://www.ics.uci.edu/~dillenco/ics161/hw/hw6.pdf) |
| [7] | RE: [Python-Dev] PEP 326 now online, Lentvorski, Andrew P., Jr. (http://mail.python.org/pipermail/python-dev/2004-January/041727.html) |
| [8] | Re: It's not really Some is it?, Ippolito, Bob (http://www.livejournal.com/users/chouyu_31/138195.html?thread=274643#t274643) |
| [9] | (1, 2) [Python-Dev] Re: PEP 326 now online, Niemeyer, Gustavo (http://mail.python.org/pipermail/python-dev/2004-January/042261.html); [Python-Dev] Re: PEP 326 now online, Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042272.html) |
| [11] | [Python-Dev] PEP 326 (quick location possibility), Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042275.html) |
| [12] | (1, 2) [Python-Dev] PEP 326 (quick location possibility), van Rossum, Guido (http://mail.python.org/pipermail/python-dev/2004-January/042306.html) |
| [13] | [Python-Dev] PEP 326 (quick location possibility), Carlson, Josiah (http://mail.python.org/pipermail/python-dev/2004-January/042300.html) |
| [14] | Recommended standard implementation of PEP 326, extremes.py, Carlson, Josiah (http://www.ics.uci.edu/~jcarlson/pep326/extremes.py) |
Changes
- Added this section.
- Added Motivation section.
- Changed markup to reStructuredText.
- Clarified Abstract, Motivation, Reference Implementation and Open Issues based on the simultaneous concepts of Max and Min.
- Added two implementations of Dijkstra's Shortest Path algorithm that show where Max can be used to remove special cases.
- Added an example of use for Min to Motivation.
- Added an example and Other Examples subheading.
- Modified Reference Implementation to instantiate both items from a single class/type.
- Removed a large number of open issues that are not within the scope of this PEP.
- Replaced an example from Max Examples, changed an example in A Min Example.
- Added some References.
- BDFL rejects [12] PEP 326
Copyright
This document has been placed in the public domain.
pep-0327 Decimal Data Type
| PEP: | 327 |
|---|---|
| Title: | Decimal Data Type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Facundo Batista <facundo at taniquetil.com.ar> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 17-Oct-2003 |
| Python-Version: | 2.4 |
| Post-History: | 30-Nov-2003, 02-Jan-2004, 29-Jan-2004 |
Contents
Abstract
The idea is to have a Decimal data type, for every use where decimals are needed but binary floating point is too inexact.
The Decimal data type will support the Python standard functions and operations, and must comply with the decimal arithmetic ANSI standard X3.274-1996 [1].
Decimal will be floating point (as opposed to fixed point) and will have bounded precision (the precision is the upper limit on the number of significant digits in a result). However, precision is user-settable, and a notion of significant trailing zeroes is supported so that fixed-point usage is also possible.
This work is based on code and test functions written by Eric Price, Aahz and Tim Peters. Just before Python 2.4a1, the decimal.py reference implementation was moved into the standard library; along with the documentation and the test suite, this was the work of Raymond Hettinger. Much of the explanation in this PEP is taken from Cowlishaw's work [2], comp.lang.python and python-dev.
Motivation
Here I'll expose the reasons of why I think a Decimal data type is needed and why other numeric data types are not enough.
I wanted a Money data type, and after proposing a pre-PEP in comp.lang.python, the community agreed to have a numeric data type with the needed arithmetic behaviour, and then build Money over it: all the considerations about quantity of digits after the decimal point, rounding, etc., will be handled through Money. It is not the purpose of this PEP to have a data type that can be used as Money without further effort.
One of the biggest advantages of implementing a standard is that someone already thought out all the creepy cases for you. And to a standard GvR redirected me: Mike Cowlishaw's General Decimal Arithmetic specification [2]. This document defines a general purpose decimal arithmetic. A correct implementation of this specification will conform to the decimal arithmetic defined in ANSI/IEEE standard 854-1987, except for some minor restrictions, and will also provide unrounded decimal arithmetic and integer arithmetic as proper subsets.
The problem with binary float
In decimal math, there are many numbers that can't be represented with a fixed number of decimal digits, e.g. 1/3 = 0.3333333333.......
In base 2 (the way that standard floating point is calculated), 1/2 = 0.1, 1/4 = 0.01, 1/8 = 0.001, etc. Decimal 0.2 equals 2/10 equals 1/5, resulting in the binary fractional number 0.001100110011001... As you can see, the problem is that some decimal numbers can't be represented exactly in binary, resulting in small roundoff errors.
So we need a decimal data type that represents exactly decimal numbers. Instead of a binary data type, we need a decimal one.
Why floating point?
So we go to decimal, but why floating point?
Floating point numbers use a fixed quantity of digits (precision) to represent a number, working with an exponent when the number gets too big or too small. For example, with a precision of 5:
1234 ==> 1234e0 12345 ==> 12345e0 123456 ==> 12346e1
(note that in the last line the number got rounded to fit in five digits).
In contrast, we have the example of a long integer with infinite precision, meaning that you can have the number as big as you want, and you'll never lose any information.
In a fixed point number, the position of the decimal point is fixed. For a fixed point data type, check Tim Peter's FixedPoint at SourceForge [4]. I'll go for floating point because it's easier to implement the arithmetic behaviour of the standard, and then you can implement a fixed point data type over Decimal.
But why can't we have a floating point number with infinite precision? It's not so easy, because of inexact divisions. E.g.: 1/3 = 0.3333333333333... ad infinitum. In this case you should store a infinite amount of 3s, which takes too much memory, ;).
John Roth proposed to eliminate the division operator and force the user to use an explicit method, just to avoid this kind of trouble. This generated adverse reactions in comp.lang.python, as everybody wants to have support for the / operator in a numeric data type.
With this exposed maybe you're thinking "Hey! Can we just store the 1 and the 3 as numerator and denominator?", which takes us to the next point.
Why not rational?
Rational numbers are stored using two integer numbers, the numerator and the denominator. This implies that the arithmetic operations can't be executed directly (e.g. to add two rational numbers you first need to calculate the common denominator).
Quoting Alex Martelli:
The performance implications of the fact that summing two rationals (which take O(M) and O(N) space respectively) gives a rational which takes O(M+N) memory space is just too troublesome. There are excellent Rational implementations in both pure Python and as extensions (e.g., gmpy), but they'll always be a "niche market" IMHO. Probably worth PEPping, not worth doing without Decimal -- which is the right way to represent sums of money, a truly major use case in the real world.
Anyway, if you're interested in this data type, you maybe will want to take a look at PEP 239: Adding a Rational Type to Python.
So, what do we have?
The result is a Decimal data type, with bounded precision and floating point.
Will it be useful? I can't say it better than Alex Martelli:
Python (out of the box) doesn't let you have binary floating point numbers with whatever precision you specify: you're limited to what your hardware supplies. Decimal, be it used as a fixed or floating point number, should suffer from no such limitation: whatever bounded precision you may specify on number creation (your memory permitting) should work just as well. Most of the expense of programming simplicity can be hidden from application programs and placed in a suitable decimal arithmetic type. As per http://speleotrove.com/decimal/, a single data type can be used for integer, fixed-point, and floating-point decimal arithmetic -- and for money arithmetic which doesn't drive the application programmer crazy.
There are several uses for such a data type. As I said before, I will use it as base for Money. In this case the bounded precision is not an issue; quoting Tim Peters:
A precision of 20 would be way more than enough to account for total world economic output, down to the penny, since the beginning of time.
General Decimal Arithmetic Specification
Here I'll include information and descriptions that are part of the specification [2] (the structure of the number, the context, etc.). All the requirements included in this section are not for discussion (barring typos or other mistakes), as they are in the standard, and the PEP is just for implementing the standard.
Because of copyright restrictions, I can not copy here explanations taken from the specification, so I'll try to explain it in my own words. I firmly encourage you to read the original specification document [2] for details or if you have any doubt.
The Arithmetic Model
The specification is based on a decimal arithmetic model, as defined by the relevant standards: IEEE 854 [3], ANSI X3-274 [1], and the proposed revision [5] of IEEE 754 [6].
The model has three components:
- Numbers: just the values that the operation uses as input or output.
- Operations: addition, multiplication, etc.
- Context: a set of parameters and rules that the user can select and which govern the results of operations (for example, the precision to be used).
Numbers
Numbers may be finite or special values. The former can be represented exactly. The latter are infinites and undefined (such as 0/0).
Finite numbers are defined by three parameters:
- Sign: 0 (positive) or 1 (negative).
- Coefficient: a non-negative integer.
- Exponent: a signed integer, the power of ten of the coefficient multiplier.
The numerical value of a finite number is given by:
(-1)**sign * coefficient * 10**exponent
Special values are named as following:
- Infinity: a value which is infinitely large. Could be positive or negative.
- Quiet NaN ("qNaN"): represent undefined results (Not a Number). Does not cause an Invalid operation condition. The sign in a NaN has no meaning.
- Signaling NaN ("sNaN"): also Not a Number, but will cause an Invalid operation condition if used in any operation.
Context
The context is a set of parameters and rules that the user can select and which govern the results of operations (for example, the precision to be used).
The context gets that name because it surrounds the Decimal numbers, with parts of context acting as input to, and output of, operations. It's up to the application to work with one or several contexts, but definitely the idea is not to get a context per Decimal number. For example, a typical use would be to set the context's precision to 20 digits at the start of a program, and never explicitly use context again.
These definitions don't affect the internal storage of the Decimal numbers, just the way that the arithmetic operations are performed.
The context is mainly defined by the following parameters (see Context Attributes for all context attributes):
- Precision: The maximum number of significant digits that can result from an arithmetic operation (integer > 0). There is no maximum for this value.
- Rounding: The name of the algorithm to be used when rounding is necessary, one of "round-down", "round-half-up", "round-half-even", "round-ceiling", "round-floor", "round-half-down", and "round-up". See Rounding Algorithms below.
- Flags and trap-enablers: Exceptional conditions are grouped into signals, controllable individually, each consisting of a flag (boolean, set when the signal occurs) and a trap-enabler (a boolean that controls behavior). The signals are: "clamped", "division-by-zero", "inexact", "invalid-operation", "overflow", "rounded", "subnormal" and "underflow".
Default Contexts
The specification defines two default contexts, which should be easily selectable by the user.
Basic Default Context:
- flags: all set to 0
- trap-enablers: inexact, rounded, and subnormal are set to 0; all others are set to 1
- precision: is set to 9
- rounding: is set to round-half-up
Extended Default Context:
- flags: all set to 0
- trap-enablers: all set to 0
- precision: is set to 9
- rounding: is set to round-half-even
Exceptional Conditions
The table below lists the exceptional conditions that may arise during the arithmetic operations, the corresponding signal, and the defined result. For details, see the specification [2].
| Condition | Signal | Result |
|---|---|---|
| Clamped | clamped | see spec [2] |
| Division by zero | division-by-zero | [sign,inf] |
| Inexact | inexact | unchanged |
| Invalid operation | invalid-operation | [0,qNaN] (or [s,qNaN] or [s,qNaN,d] when the cause is a signaling NaN) |
| Overflow | overflow | depends on the rounding mode |
| Rounded | rounded | unchanged |
| Subnormal | subnormal | unchanged |
| Underflow | underflow | see spec [2] |
Note: when the standard talks about "Insufficient storage", as long as this is implementation-specific behaviour about not having enough storage to keep the internals of the number, this implementation will raise MemoryError.
Regarding Overflow and Underflow, there's been a long discussion in python-dev about artificial limits. The general consensus is to keep the artificial limits only if there are important reasons to do that. Tim Peters gives us three:
...eliminating bounds on exponents effectively means overflow (and underflow) can never happen. But overflow is a valuable safety net in real life fp use, like a canary in a coal mine, giving danger signs early when a program goes insane.
Virtually all implementations of 854 use (and as IBM's standard even suggests) "forbidden" exponent values to encode non-finite numbers (infinities and NaNs). A bounded exponent can do this at virtually no extra storage cost. If the exponent is unbounded, then additional bits have to be used instead. This cost remains hidden until more time- and space- efficient implementations are attempted.
Big as it is, the IBM standard is a tiny start at supplying a complete numeric facility. Having no bound on exponent size will enormously complicate the implementations of, e.g., decimal sin() and cos() (there's then no a priori limit on how many digits of pi effectively need to be known in order to perform argument reduction).
Edward Loper give us an example of when the limits are to be crossed: probabilities.
That said, Robert Brewer and Andrew Lentvorski want the limits to be easily modifiable by the users. Actually, this is quite posible:
>>> d1 = Decimal("1e999999999") # at the exponent limit
>>> d1
Decimal("1E+999999999")
>>> d1 * 10 # exceed the limit, got infinity
Traceback (most recent call last):
File "<pyshell#3>", line 1, in ?
d1 * 10
...
...
Overflow: above Emax
>>> getcontext().Emax = 1000000000 # increase the limit
>>> d1 * 10 # does not exceed any more
Decimal("1.0E+1000000000")
>>> d1 * 100 # exceed again
Traceback (most recent call last):
File "<pyshell#3>", line 1, in ?
d1 * 100
...
...
Overflow: above Emax
Rounding Algorithms
round-down: The discarded digits are ignored; the result is unchanged (round toward 0, truncate):
1.123 --> 1.12 1.128 --> 1.12 1.125 --> 1.12 1.135 --> 1.13
round-half-up: If the discarded digits represent greater than or equal to half (0.5) then the result should be incremented by 1; otherwise the discarded digits are ignored:
1.123 --> 1.12 1.128 --> 1.13 1.125 --> 1.13 1.135 --> 1.14
round-half-even: If the discarded digits represent greater than half (0.5) then the result coefficient is incremented by 1; if they represent less than half, then the result is not adjusted; otherwise the result is unaltered if its rightmost digit is even, or incremented by 1 if its rightmost digit is odd (to make an even digit):
1.123 --> 1.12 1.128 --> 1.13 1.125 --> 1.12 1.135 --> 1.14
round-ceiling: If all of the discarded digits are zero or if the sign is negative the result is unchanged; otherwise, the result is incremented by 1 (round toward positive infinity):
1.123 --> 1.13 1.128 --> 1.13 -1.123 --> -1.12 -1.128 --> -1.12
round-floor: If all of the discarded digits are zero or if the sign is positive the result is unchanged; otherwise, the absolute value of the result is incremented by 1 (round toward negative infinty):
1.123 --> 1.12 1.128 --> 1.12 -1.123 --> -1.13 -1.128 --> -1.13
round-half-down: If the discarded digits represent greater than half (0.5) then the result is incremented by 1; otherwise the discarded digits are ignored:
1.123 --> 1.12 1.128 --> 1.13 1.125 --> 1.12 1.135 --> 1.13
round-up: If all of the discarded digits are zero the result is unchanged, otherwise the result is incremented by 1 (round away from 0):
1.123 --> 1.13 1.128 --> 1.13 1.125 --> 1.13 1.135 --> 1.14
Rationale
I must separate the requirements in two sections. The first is to comply with the ANSI standard. All the requirements for this are specified in the Mike Cowlishaw's work [2]. He also provided a very large suite of test cases.
The second section of requirements (standard Python functions support, usability, etc.) is detailed from here, where I'll include all the decisions made and why, and all the subjects still being discussed.
Explicit construction
The explicit construction does not get affected by the context (there is no rounding, no limits by the precision, etc.), because the context affects just operations' results. The only exception to this is when you're Creating from Context.
From int or long
There's no loss and no need to specify any other information:
Decimal(35) Decimal(-124)
From string
Strings containing Python decimal integer literals and Python float literals will be supported. In this transformation there is no loss of information, as the string is directly converted to Decimal (there is not an intermediate conversion through float):
Decimal("-12")
Decimal("23.2e-7")
Also, you can construct in this way all special values (Infinity and Not a Number):
Decimal("Inf")
Decimal("NaN")
From float
The initial discussion on this item was what should happen when passing floating point to the constructor:
- Decimal(1.1) == Decimal('1.1')
- Decimal(1.1) == Decimal('110000000000000008881784197001252...e-51')
- an exception is raised
Several people alleged that (1) is the better option here, because it's what you expect when writing Decimal(1.1). And quoting John Roth, it's easy to implement:
It's not at all difficult to find where the actual number ends and where the fuzz begins. You can do it visually, and the algorithms to do it are quite well known.
But If I really want my number to be Decimal('110000000000000008881784197001252...e-51'), why can't I write Decimal(1.1)? Why should I expect Decimal to be "rounding" it? Remember that 1.1 is binary floating point, so I can predict the result. It's not intuitive to a beginner, but that's the way it is.
Anyway, Paul Moore showed that (1) can't work, because:
(1) says D(1.1) == D('1.1')
but 1.1 == 1.1000000000000001
so D(1.1) == D(1.1000000000000001)
together: D(1.1000000000000001) == D('1.1')
which is wrong, because if I write Decimal('1.1') it is exact, not D(1.1000000000000001). He also proposed to have an explicit conversion to float. bokr says you need to put the precision in the constructor and mwilson agreed:
d = Decimal (1.1, 1) # take float value to 1 decimal place d = Decimal (1.1) # gets `places` from pre-set context
But Alex Martelli says that:
Constructing with some specified precision would be fine. Thus, I think "construction from float with some default precision" runs a substantial risk of tricking naive users.
So, the accepted solution through c.l.p is that you can not call Decimal with a float. Instead you must use a method: Decimal.from_float(). The syntax:
Decimal.from_float(floatNumber, [decimal_places])
where floatNumber is the float number origin of the construction and decimal_places are the number of digits after the decimal point where you apply a round-half-up rounding, if any. In this way you can do, for example:
Decimal.from_float(1.1, 2): The same as doing Decimal('1.1').
Decimal.from_float(1.1, 16): The same as doing Decimal('1.1000000000000001').
Decimal.from_float(1.1): The same as doing Decimal('1100000000000000088817841970012523233890533447265625e-51').
Based on later discussions, it was decided to omit from_float() from the API for Py2.4. Several ideas contributed to the thought process:
Interactions between decimal and binary floating point force the user to deal with tricky issues of representation and round-off. Avoidance of those issues is a primary reason for having the module in the first place.
The first release of the module should focus on that which is safe, minimal, and essential.
While theoretically nice, real world use cases for interactions between floats and decimals are lacking. Java included float/decimal conversions to handle an obscure case where calculations are best performed in decimal eventhough a legacy data structure requires the inputs and outputs to be stored in binary floating point.
If the need arises, users can use string representations as an intermediate type. The advantage of this approach is that it makes explicit the assumptions about precision and representation (no wondering what is going on under the hood).
The Java docs for BigDecimal(double val) reflected their experiences with the constructor:
The results of this constructor can be somewhat unpredictable and its use is generally not recommended.
From tuples
Aahz suggested to construct from tuples: it's easier to implement eval()'s round trip and "someone who has numeric values representing a Decimal does not need to convert them to a string."
The structure will be a tuple of three elements: sign, number and exponent. The sign is 1 or 0, the number is a tuple of decimal digits and the exponent is a signed int or long:
Decimal((1, (3, 2, 2, 5), -2)) # for -32.25
Of course, you can construct in this way all special values:
Decimal( (0, (0,), 'F') ) # for Infinity Decimal( (0, (0,), 'n') ) # for Not a Number
From Decimal
No mystery here, just a copy.
Syntax for All Cases
Decimal(value1) Decimal.from_float(value2, [decimal_places])
where value1 can be int, long, string, 3-tuple or Decimal, value2 can only be float, and decimal_places is an optional non negative int.
Creating from Context
This item arose in python-dev from two sources in parallel. Ka-Ping Yee proposes to pass the context as an argument at instance creation (he wants the context he passes to be used only in creation time: "It would not be persistent"). Tony Meyer asks from_string to honor the context if it receives a parameter "honour_context" with a True value. (I don't like it, because the doc specifies that the context be honored and I don't want the method to comply with the specification regarding the value of an argument.)
Tim Peters gives us a reason to have a creation that uses context:
In general number-crunching, literals may be given to high precision, but that precision isn't free and usually isn't needed
Casey Duncan wants to use another method, not a bool arg:
I find boolean arguments a general anti-pattern, especially given we have class methods. Why not use an alternate constructor like Decimal.rounded_to_context("3.14159265").
In the process of deciding the syntax of that, Tim came up with a better idea: he proposes not to have a method in Decimal to create with a different context, but having instead a method in Context to create a Decimal instance. Basically, instead of:
D.using_context(number, context)
it will be:
context.create_decimal(number)
From Tim:
While all operations in the spec except for the two to-string operations use context, no operations in the spec support an optional local context. That the Decimal() constructor ignores context by default is an extension to the spec. We must supply a context-honoring from-string operation to meet the spec. I recommend against any concept of "local context" in any operation -- it complicates the model and isn't necessary.
So, we decided to use a context method to create a Decimal that will use (only to be created) that context in particular (for further operations it will use the context of the thread). But, a method with what name?
Tim Peters proposes three methods to create from diverse sources (from_string, from_int, from_float). I proposed to use one method, create_decimal(), without caring about the data type. Michael Chermside: "The name just fits my brain. The fact that it uses the context is obvious from the fact that it's Context method".
The community agreed with that. I think that it's OK because a newbie will not be using the creation method from Context (the separate method in Decimal to construct from float is just to prevent newbies from encountering binary floating point issues).
So, in short, if you want to create a Decimal instance using a particular context (that will be used just at creation time and not any further), you'll have to use a method of that context:
# n is any datatype accepted in Decimal(n) plus float mycontext.create_decimal(n)
Example:
>>> # create a standard decimal instance
>>> Decimal("11.2233445566778899")
Decimal("11.2233445566778899")
>>>
>>> # create a decimal instance using the thread context
>>> thread_context = getcontext()
>>> thread_context.prec
28
>>> thread_context.create_decimal("11.2233445566778899")
Decimal("11.2233445566778899")
>>>
>>> # create a decimal instance using other context
>>> other_context = thread_context.copy()
>>> other_context.prec = 4
>>> other_context.create_decimal("11.2233445566778899")
Decimal("11.22")
Implicit construction
As the implicit construction is the consequence of an operation, it will be affected by the context as is detailed in each point.
John Roth suggested that "The other type should be handled in the same way the decimal() constructor would handle it". But Alex Martelli thinks that
this total breach with Python tradition would be a terrible mistake. 23+"43" is NOT handled in the same way as 23+int("45"), and a VERY good thing that is too. It's a completely different thing for a user to EXPLICITLY indicate they want construction (conversion) and to just happen to sum two objects one of which by mistake could be a string.
So, here I define the behaviour again for each data type.
From int or long
An int or long is a treated like a Decimal explicitly constructed from Decimal(str(x)) in the current context (meaning that the to-string rules for rounding are applied and the appropriate flags are set). This guarantees that expressions like Decimal('1234567') + 13579 match the mental model of Decimal('1234567') + Decimal('13579'). That model works because all integers are representable as strings without representation error.
From string
Everybody agrees to raise an exception here.
From float
Aahz is strongly opposed to interact with float, suggesting an explicit conversion:
The problem is that Decimal is capable of greater precision, accuracy, and range than float.
The example of the valid python expression, 35 + 1.1, seems to suggest that Decimal(35) + 1.1 should also be valid. However, a closer look shows that it only demonstrates the feasibility of integer to floating point conversions. Hence, the correct analog for decimal floating point is 35 + Decimal(1.1). Both coercions, int-to-float and int-to-Decimal, can be done without incurring representation error.
The question of how to coerce between binary and decimal floating point is more complex. I proposed allowing the interaction with float, making an exact conversion and raising ValueError if exceeds the precision in the current context (this is maybe too tricky, because for example with a precision of 9, Decimal(35) + 1.2 is OK but Decimal(35) + 1.1 raises an error).
This resulted to be too tricky. So tricky, that c.l.p agreed to raise TypeError in this case: you could not mix Decimal and float.
From Decimal
There isn't any issue here.
Use of Context
In the last pre-PEP I said that "The Context must be omnipresent, meaning that changes to it affects all the current and future Decimal instances". I was wrong. In response, John Roth said:
The context should be selectable for the particular usage. That is, it should be possible to have several different contexts in play at one time in an application.
In comp.lang.python, Aahz explained that the idea is to have a "context per thread". So, all the instances of a thread belongs to a context, and you can change a context in thread A (and the behaviour of the instances of that thread) without changing nothing in thread B.
Also, and again correcting me, he said:
(the) Context applies only to operations, not to Decimal instances; changing the Context does not affect existing instances if there are no operations on them.
Arguing about special cases when there's need to perform operations with other rules that those of the current context, Tim Peters said that the context will have the operations as methods. This way, the user "can create whatever private context object(s) it needs, and spell arithmetic as explicit method calls on its private context object(s), so that the default thread context object is neither consulted nor modified".
Python Usability
Decimal should support the basic arithmetic (+, -, *, /, //, **, %, divmod) and comparison (==, !=, <, >, <=, >=, cmp) operators in the following cases (check Implicit Construction to see what types could OtherType be, and what happens in each case):
- Decimal op Decimal
- Decimal op otherType
- otherType op Decimal
- Decimal op= Decimal
- Decimal op= otherType
Decimal should support unary operators (-, +, abs).
repr() should round trip, meaning that:
m = Decimal(...) m == eval(repr(m))
Decimal should be immutable.
Decimal should support the built-in methods:
- min, max
- float, int, long
- str, repr
- hash
- bool (0 is false, otherwise true)
There's been some discussion in python-dev about the behaviour of hash(). The community agrees that if the values are the same, the hashes of those values should also be the same. So, while Decimal(25) == 25 is True, hash(Decimal(25)) should be equal to hash(25).
The detail is that you can NOT compare Decimal to floats or strings, so we should not worry about them giving the same hashes. In short:
hash(n) == hash(Decimal(n)) # Only if n is int, long, or Decimal
Regarding str() and repr() behaviour, Ka-Ping Yee proposes that repr() have the same behaviour as str() and Tim Peters proposes that str() behave like the to-scientific-string operation from the Spec.
This is posible, because (from Aahz): "The string form already contains all the necessary information to reconstruct a Decimal object".
And it also complies with the Spec; Tim Peters:
There's no requirement to have a method named "to_sci_string", the only requirement is that some way to spell to-sci-string's functionality be supplied. The meaning of to-sci-string is precisely specified by the standard, and is a good choice for both str(Decimal) and repr(Decimal).
Documentation
This section explains all the public methods and attributes of Decimal and Context.
Decimal Attributes
Decimal has no public attributes. The internal information is stored in slots and should not be accessed by end users.
Decimal Methods
Following are the conversion and arithmetic operations defined in the Spec, and how that functionality can be achieved with the actual implementation.
to-scientific-string: Use builtin function str():
>>> d = Decimal('123456789012.345') >>> str(d) '1.23456789E+11'to-engineering-string: Use method to_eng_string():
>>> d = Decimal('123456789012.345') >>> d.to_eng_string() '123.456789E+9'to-number: Use Context method create_decimal(). The standard constructor or from_float() constructor cannot be used because these do not use the context (as is specified in the Spec for this conversion).
abs: Use builtin function abs():
>>> d = Decimal('-15.67') >>> abs(d) Decimal('15.67')add: Use operator +:
>>> d = Decimal('15.6') >>> d + 8 Decimal('23.6')subtract: Use operator -:
>>> d = Decimal('15.6') >>> d - 8 Decimal('7.6')compare: Use method compare(). This method (and not the built-in function cmp()) should only be used when dealing with special values:
>>> d = Decimal('-15.67') >>> nan = Decimal('NaN') >>> d.compare(23) '-1' >>> d.compare(nan) 'NaN' >>> cmp(d, 23) -1 >>> cmp(d, nan) 1divide: Use operator /:
>>> d = Decimal('-15.67') >>> d / 2 Decimal('-7.835')divide-integer: Use operator //:
>>> d = Decimal('-15.67') >>> d // 2 Decimal('-7')max: Use method max(). Only use this method (and not the built-in function max()) when dealing with special values:
>>> d = Decimal('15') >>> nan = Decimal('NaN') >>> d.max(8) Decimal('15') >>> d.max(nan) Decimal('NaN')min: Use method min(). Only use this method (and not the built-in function min()) when dealing with special values:
>>> d = Decimal('15') >>> nan = Decimal('NaN') >>> d.min(8) Decimal('8') >>> d.min(nan) Decimal('NaN')minus: Use unary operator -:
>>> d = Decimal('-15.67') >>> -d Decimal('15.67')plus: Use unary operator +:
>>> d = Decimal('-15.67') >>> +d Decimal('-15.67')multiply: Use operator *:
>>> d = Decimal('5.7') >>> d * 3 Decimal('17.1')normalize: Use method normalize():
>>> d = Decimal('123.45000') >>> d.normalize() Decimal('123.45') >>> d = Decimal('120.00') >>> d.normalize() Decimal('1.2E+2')quantize: Use method quantize():
>>> d = Decimal('2.17') >>> d.quantize(Decimal('0.001')) Decimal('2.170') >>> d.quantize(Decimal('0.1')) Decimal('2.2')remainder: Use operator %:
>>> d = Decimal('10') >>> d % 3 Decimal('1') >>> d % 6 Decimal('4')remainder-near: Use method remainder_near():
>>> d = Decimal('10') >>> d.remainder_near(3) Decimal('1') >>> d.remainder_near(6) Decimal('-2')round-to-integral-value: Use method to_integral():
>>> d = Decimal('-123.456') >>> d.to_integral() Decimal('-123')same-quantum: Use method same_quantum():
>>> d = Decimal('123.456') >>> d.same_quantum(Decimal('0.001')) True >>> d.same_quantum(Decimal('0.01')) Falsesquare-root: Use method sqrt():
>>> d = Decimal('123.456') >>> d.sqrt() Decimal('11.1110756')power: User operator **:
>>> d = Decimal('12.56') >>> d ** 2 Decimal('157.7536')
Following are other methods and why they exist:
adjusted(): Returns the adjusted exponent. This concept is defined in the Spec: the adjusted exponent is the value of the exponent of a number when that number is expressed as though in scientific notation with one digit before any decimal point:
>>> d = Decimal('12.56') >>> d.adjusted() 1from_float(): Class method to create instances from float data types:
>>> d = Decimal.from_float(12.35) >>> d Decimal('12.3500000')as_tuple(): Show the internal structure of the Decimal, the triple tuple. This method is not required by the Spec, but Tim Peters proposed it and the community agreed to have it (it's useful for developing and debugging):
>>> d = Decimal('123.4') >>> d.as_tuple() (0, (1, 2, 3, 4), -1) >>> d = Decimal('-2.34e5') >>> d.as_tuple() (1, (2, 3, 4), 3)
Context Attributes
These are the attributes that can be changed to modify the context.
prec (int): the precision:
>>> c.prec 9
rounding (str): rounding type (how to round):
>>> c.rounding 'half_even'
trap_enablers (dict): if trap_enablers[exception] = 1, then an exception is raised when it is caused:
>>> c.trap_enablers[Underflow] 0 >>> c.trap_enablers[Clamped] 0
flags (dict): when an exception is caused, flags[exception] is incremented (whether or not the trap_enabler is set). Should be reset by the user of Decimal instance:
>>> c.flags[Underflow] 0 >>> c.flags[Clamped] 0
Emin (int): minimum exponent:
>>> c.Emin -999999999
Emax (int): maximum exponent:
>>> c.Emax 999999999
capitals (int): boolean flag to use 'E' (True/1) or 'e' (False/0) in the string (for example, '1.32e+2' or '1.32E+2'):
>>> c.capitals 1
Context Methods
The following methods comply with Decimal functionality from the Spec. Be aware that the operations that are called through a specific context use that context and not the thread context.
To use these methods, take note that the syntax changes when the operator is binary or unary, for example:
>>> mycontext.abs(Decimal('-2'))
'2'
>>> mycontext.multiply(Decimal('2.3'), 5)
'11.5'
So, the following are the Spec operations and conversions and how to achieve them through a context (where d is a Decimal instance and n a number that can be used in an Implicit construction):
- to-scientific-string: to_sci_string(d)
- to-engineering-string: to_eng_string(d)
- to-number: create_decimal(number), see Explicit construction for number.
- abs: abs(d)
- add: add(d, n)
- subtract: subtract(d, n)
- compare: compare(d, n)
- divide: divide(d, n)
- divide-integer: divide_int(d, n)
- max: max(d, n)
- min: min(d, n)
- minus: minus(d)
- plus: plus(d)
- multiply: multiply(d, n)
- normalize: normalize(d)
- quantize: quantize(d, d)
- remainder: remainder(d)
- remainder-near: remainder_near(d)
- round-to-integral-value: to_integral(d)
- same-quantum: same_quantum(d, d)
- square-root: sqrt(d)
- power: power(d, n)
The divmod(d, n) method supports decimal functionality through Context.
These are methods that return useful information from the Context:
Etiny(): Minimum exponent considering precision.
>>> c.Emin -999999999 >>> c.Etiny() -1000000007
Etop(): Maximum exponent considering precision.
>>> c.Emax 999999999 >>> c.Etop() 999999991
copy(): Returns a copy of the context.
Reference Implementation
As of Python 2.4-alpha, the code has been checked into the standard library. The latest version is available from:
http://svn.python.org/view/python/trunk/Lib/decimal.py
The test cases are here:
http://svn.python.org/view/python/trunk/Lib/test/test_decimal.py
References
| [1] | (1, 2) ANSI standard X3.274-1996 (Programming Language REXX): http://www.rexxla.org/Standards/ansi.html |
| [2] | (1, 2, 3, 4, 5, 6, 7, 8) General Decimal Arithmetic specification (Cowlishaw): http://speleotrove.com/decimal/decarith.html (related documents and links at http://speleotrove.com/decimal/) |
| [3] | ANSI/IEEE standard 854-1987 (Radix-Independent Floating-Point Arithmetic): http://www.cs.berkeley.edu/~ejr/projects/754/private/drafts/854-1987/dir.html (unofficial text; official copies can be ordered from http://standards.ieee.org/catalog/ordering.html) |
| [4] | Tim Peter's FixedPoint at SourceForge: http://fixedpoint.sourceforge.net/ |
| [5] | IEEE 754 revision: http://grouper.ieee.org/groups/754/revision.html |
| [6] | IEEE 754 references: http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html |
Copyright
This document has been placed in the public domain.
pep-0328 Imports: Multi-Line and Absolute/Relative
| PEP: | 328 |
|---|---|
| Title: | Imports: Multi-Line and Absolute/Relative |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Aahz <aahz at pythoncraft.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 21-Dec-2003 |
| Python-Version: | 2.4, 2,5, 2.6 |
| Post-History: | 8-Mar-2004 |
Contents
Abstract
The import statement has two problems:
- Long import statements can be difficult to write, requiring various contortions to fit Pythonic style guidelines.
- Imports can be ambiguous in the face of packages; within a package, it's not clear whether import foo refers to a module within the package or some module outside the package. (More precisely, a local module or package can shadow another hanging directly off sys.path.)
For the first problem, it is proposed that parentheses be permitted to enclose multiple names, thus allowing Python's standard mechanisms for multi-line values to apply. For the second problem, it is proposed that all import statements be absolute by default (searching sys.path only) with special syntax (leading dots) for accessing package-relative imports.
Timeline
In Python 2.5, you must enable the new absolute import behavior with
from __future__ import absolute_import
You may use relative imports freely. In Python 2.6, any import statement that results in an intra-package import will raise DeprecationWarning (this also applies to from <> import that fails to use the relative import syntax).
Rationale for Parentheses
Currently, if you want to import a lot of names from a module or package, you have to choose one of several unpalatable options:
Write a long line with backslash continuations:
from Tkinter import Tk, Frame, Button, Entry, Canvas, Text, \ LEFT, DISABLED, NORMAL, RIDGE, ENDWrite multiple import statements:
from Tkinter import Tk, Frame, Button, Entry, Canvas, Text from Tkinter import LEFT, DISABLED, NORMAL, RIDGE, END
(import * is not an option ;-)
Instead, it should be possible to use Python's standard grouping mechanism (parentheses) to write the import statement:
from Tkinter import (Tk, Frame, Button, Entry, Canvas, Text,
LEFT, DISABLED, NORMAL, RIDGE, END)
This part of the proposal had BDFL approval from the beginning.
Parentheses support was added to Python 2.4.
Rationale for Absolute Imports
In Python 2.4 and earlier, if you're reading a module located inside a package, it is not clear whether
import foo
refers to a top-level module or to another module inside the package. As Python's library expands, more and more existing package internal modules suddenly shadow standard library modules by accident. It's a particularly difficult problem inside packages because there's no way to specify which module is meant. To resolve the ambiguity, it is proposed that foo will always be a module or package reachable from sys.path. This is called an absolute import.
The python-dev community chose absolute imports as the default because they're the more common use case and because absolute imports can provide all the functionality of relative (intra-package) imports -- albeit at the cost of difficulty when renaming package pieces higher up in the hierarchy or when moving one package inside another.
Because this represents a change in semantics, absolute imports will be optional in Python 2.5 and 2.6 through the use of
from __future__ import absolute_import
This part of the proposal had BDFL approval from the beginning.
Rationale for Relative Imports
With the shift to absolute imports, the question arose whether relative imports should be allowed at all. Several use cases were presented, the most important of which is being able to rearrange the structure of large packages without having to edit sub-packages. In addition, a module inside a package can't easily import itself without relative imports.
Guido approved of the idea of relative imports, but there has been a lot of disagreement on the spelling (syntax). There does seem to be agreement that relative imports will require listing specific names to import (that is, import foo as a bare term will always be an absolute import).
Here are the contenders:
One from Guido:
from .foo import bar
and
from ...foo import bar
These two forms have a couple of different suggested semantics. One semantic is to make each dot represent one level. There have been many complaints about the difficulty of counting dots. Another option is to only allow one level of relative import. That misses a lot of functionality, and people still complained about missing the dot in the one-dot form. The final option is to define an algorithm for finding relative modules and packages; the objection here is "Explicit is better than implicit". (The algorithm proposed is "search up from current package directory until the ultimate package parent gets hit".)
Some people have suggested other punctuation as the separator, such as "-" or "^".
Some people have suggested using "*":
from *.foo import bar
The next set of options is conflated from several posters:
from __pkg__.__pkg__ import
and
from .__parent__.__parent__ import
Many people (Guido included) think these look ugly, but they are clear and explicit. Overall, more people prefer __pkg__ as the shorter option.
One suggestion was to allow only sibling references. In other words, you would not be able to use relative imports to refer to modules higher in the package tree. You would then be able to do either
from .spam import eggs
or
import .spam.eggs
Some people favor allowing indexed parents:
from -2.spam import eggs
In this scenario, importing from the current directory would be a simple
from .spam import eggs
Finally, some people dislike the way you have to change import to from ... import when you want to dig inside a package. They suggest completely rewriting the import syntax:
from MODULE import NAMES as RENAME searching HOW
or
import NAMES as RENAME from MODULE searching HOW [from NAMES] [in WHERE] import ...However, this most likely could not be implemented for Python 2.5 (too big a change), and allowing relative imports is sufficiently critical that we need something now (given that the standard import will change to absolute import). More than that, this proposed syntax has several open questions:
What is the precise proposed syntax? (Which clauses are optional under which circumstances?)
How strongly does the searching clause bind? In other words, do you write:
import foo as bar searching XXX, spam as ham searching XXX
or:
import foo as bar, spam as ham searching XXX
Guido's Decision
Guido has Pronounced [1] that relative imports will use leading dots. A single leading dot indicates a relative import, starting with the current package. Two or more leading dots give a relative import to the parent(s) of the current package, one level per dot after the first. Here's a sample package layout:
package/
__init__.py
subpackage1/
__init__.py
moduleX.py
moduleY.py
subpackage2/
__init__.py
moduleZ.py
moduleA.py
Assuming that the current file is either moduleX.py or subpackage1/__init__.py, following are correct usages of the new syntax:
from .moduleY import spam from .moduleY import spam as ham from . import moduleY from ..subpackage1 import moduleY from ..subpackage2.moduleZ import eggs from ..moduleA import foo from ...package import bar from ...sys import path
Note that while that last case is legal, it is certainly discouraged ("insane" was the word Guido used).
Relative imports must always use from <> import; import <> is always absolute. Of course, absolute imports can use from <> import by omitting the leading dots. The reason import .foo is prohibited is because after
import XXX.YYY.ZZZ
then
XXX.YYY.ZZZ
is usable in an expression. But
.moduleY
is not usable in an expression.
Relative Imports and __name__
Relative imports use a module's __name__ attribute to determine that module's position in the package hierarchy. If the module's name does not contain any package information (e.g. it is set to '__main__') then relative imports are resolved as if the module were a top level module, regardless of where the module is actually located on the file system.
Relative Imports and Indirection Entries in sys.modules
When packages were introduced, the concept of an indirection entry in sys.modules came into existence [2]. When an entry in sys.modules for a module within a package had a value of None, it represented that the module actually referenced the top-level module. For instance, 'Sound.Effects.string' might have a value of None in sys.modules. That meant any import that resolved to that name actually was to import the top-level 'string' module.
This introduced an optimization for when a relative import was meant to resolve to an absolute import. But since this PEP makes a very clear delineation between absolute and relative imports, this optimization is no longer needed. When absolute/relative imports become the only import semantics available then indirection entries in sys.modules will no longer be supported.
References
For more background, see the following python-dev threads:
- Re: Christmas Wishlist
- Re: Python-Dev Digest, Vol 5, Issue 57
- Relative import
- Another Strategy for Relative Import
| [1] | http://mail.python.org/pipermail/python-dev/2004-March/043739.html |
| [2] | https://www.python.org/doc/essays/packages/ |
Copyright
This document has been placed in the public domain.
pep-0329 Treating Builtins as Constants in the Standard Library
| PEP: | 329 |
|---|---|
| Title: | Treating Builtins as Constants in the Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 18-Apr-2004 |
| Python-Version: | 2.4 |
| Post-History: | 18-Apr-2004 |
Contents
Abstract
The proposal is to add a function for treating builtin references as constants and to apply that function throughout the standard library.
Status
The PEP is self rejected by the author. Though the ASPN recipe was well received, there was less willingness to consider this for inclusion in the core distribution.
The Jython implementation does not use byte codes, so its performance would suffer if the current _len=len optimizations were removed.
Also, altering byte codes is one of the least clean ways to improve performance and enable cleaner coding. A more robust solution would likely involve compiler pragma directives or metavariables indicating what can be optimized (similar to const/volatile declarations).
Motivation
The library contains code such as _len=len which is intended to create fast local references instead of slower global lookups. Though necessary for performance, these constructs clutter the code and are usually incomplete (missing many opportunities).
If the proposal is adopted, those constructs could be eliminated from the code base and at the same time improve upon their results in terms of performance.
There are currently over a hundred instances of while 1 in the library. They were not replaced with the more readable while True because of performance reasons (the compiler cannot eliminate the test because True is not known to always be a constant). Conversion of True to a constant will clarify the code while retaining performance.
Many other basic Python operations run much slower because of global lookups. In try/except statements, the trapped exceptions are dynamically looked up before testing whether they match. Similarly, simple identity tests such as while x is not None require the None variable to be re-looked up on every pass. Builtin lookups are especially egregious because the enclosing global scope must be checked first. These lookup chains devour cache space that is best used elsewhere.
In short, if the proposal is adopted, the code will become cleaner and performance will improve across the board.
Proposal
Add a module called codetweaks.py which contains two functions, bind_constants() and bind_all(). The first function performs constant binding and the second recursively applies it to every function and class in a target module.
For most modules in the standard library, add a pair of lines near the end of the script:
import codetweaks, sys codetweaks.bind_all(sys.modules[__name__])
In addition to binding builtins, there are some modules (like sre_compile) where it also makes sense to bind module variables as well as builtins into constants.
Questions and Answers
Will this make everyone divert their attention to optimization issues?
Because it is done automatically, it reduces the need to think about optimizations.
In a nutshell, how does it work?
Every function has attributes with its bytecodes (the language of the Python virtual machine) and a table of constants. The bind function scans the bytecodes for a LOAD_GLOBAL instruction and checks to see whether the value is already known. If so, it adds that value to the constants table and replaces the opcode with LOAD_CONSTANT.
When does it work?
When a module is imported for the first time, python compiles the bytecode and runs the binding optimization. Subsequent imports just re-use the previous work. Each session repeats this process (the results are not saved in pyc files).
How do you know this works?
I implemented it, applied it to every module in library, and the test suite ran without exception.
What if the module defines a variable shadowing a builtin?
This does happen. For instance, True can be redefined at the module level as True = (1==1). The sample implementation below detects the shadowing and leaves the global lookup unchanged.
Are you the first person to recognize that most global lookups are for values that never change?
No, this has long been known. Skip Montanaro provides an eloquent explanation in [1].
What if I want to replace the builtins module and supply my own implementations?
Either do this before importing a module, or just reload the module, or disable codetweaks.py (it will have a disable flag).
How susceptible is this module to changes in Python's byte coding?
It imports opcode.py to protect against renumbering. Also, it uses LOAD_CONST and LOAD_GLOBAL which are fundamental and have been around forever. That notwithstanding, the coding scheme could change and this implementation would have to change along with modules like dis which also rely on the current coding scheme.
What is the effect on startup time?
I could not measure a difference. None of the startup modules are bound except for warnings.py. Also, the binding function is very fast, making just a single pass over the code string in search of the LOAD_GLOBAL opcode.
Sample Implementation
Here is a sample implementation for codetweaks.py:
from types import ClassType, FunctionType
from opcode import opmap, HAVE_ARGUMENT, EXTENDED_ARG
LOAD_GLOBAL, LOAD_CONST = opmap['LOAD_GLOBAL'], opmap['LOAD_CONST']
ABORT_CODES = (EXTENDED_ARG, opmap['STORE_GLOBAL'])
def bind_constants(f, builtin_only=False, stoplist=[], verbose=False):
""" Return a new function with optimized global references.
Replaces global references with their currently defined values.
If not defined, the dynamic (runtime) global lookup is left undisturbed.
If builtin_only is True, then only builtins are optimized.
Variable names in the stoplist are also left undisturbed.
If verbose is True, prints each substitution as is occurs.
"""
import __builtin__
env = vars(__builtin__).copy()
stoplist = dict.fromkeys(stoplist)
if builtin_only:
stoplist.update(f.func_globals)
else:
env.update(f.func_globals)
co = f.func_code
newcode = map(ord, co.co_code)
newconsts = list(co.co_consts)
codelen = len(newcode)
i = 0
while i < codelen:
opcode = newcode[i]
if opcode in ABORT_CODES:
return f # for simplicity, only optimize common cases
if opcode == LOAD_GLOBAL:
oparg = newcode[i+1] + (newcode[i+2] << 8)
name = co.co_names[oparg]
if name in env and name not in stoplist:
value = env[name]
try:
pos = newconsts.index(value)
except ValueError:
pos = len(newconsts)
newconsts.append(value)
newcode[i] = LOAD_CONST
newcode[i+1] = pos & 0xFF
newcode[i+2] = pos >> 8
if verbose:
print name, '-->', value
i += 1
if opcode >= HAVE_ARGUMENT:
i += 2
codestr = ''.join(map(chr, newcode))
codeobj = type(co)(co.co_argcount, co.co_nlocals, co.co_stacksize,
co.co_flags, codestr, tuple(newconsts), co.co_names,
co.co_varnames, co.co_filename, co.co_name,
co.co_firstlineno, co.co_lnotab, co.co_freevars,
co.co_cellvars)
return type(f)(codeobj, f.func_globals, f.func_name, f.func_defaults,
f.func_closure)
def bind_all(mc, builtin_only=False, stoplist=[], verbose=False):
"""Recursively apply bind_constants() to functions in a module or class.
Use as the last line of the module (after everything is defined, but
before test code).
In modules that need modifiable globals, set builtin_only to True.
"""
for k, v in vars(mc).items():
if type(v) is FunctionType:
newv = bind_constants(v, builtin_only, stoplist, verbose)
setattr(mc, k, newv)
elif type(v) in (type, ClassType):
bind_all(v, builtin_only, stoplist, verbose)
def f(): pass
try:
f.func_code.code
except AttributeError: # detect non-CPython environments
bind_all = lambda *args, **kwds: 0
del f
import sys
bind_all(sys.modules[__name__]) # Optimizer, optimize thyself!
Note the automatic detection of a non-CPython environment that does not have bytecodes [3]. In that situation, the bind functions would simply return the original function unchanged. This assures that the two line additions to library modules do not impact other implementations.
The final code should add a flag to make it easy to disable binding.
References
| [1] | Optimizing Global Variable/Attribute Access http://www.python.org/dev/peps/pep-0266/ |
| [2] | ASPN Recipe for a non-private implementation http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/277940 |
| [3] | Differences between CPython and Jython http://www.jython.org/cgi-bin/faqw.py?req=show&file=faq01.003.htp |
Copyright
This document has been placed in the public domain.
pep-0330 Python Bytecode Verification
| PEP: | 330 |
|---|---|
| Title: | Python Bytecode Verification |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michel Pelletier <michel at users.sourceforge.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 17-Jun-2004 |
| Python-Version: | 2.6? |
| Post-History: |
Abstract
If Python Virtual Machine (PVM) bytecode is not "well-formed" it
is possible to crash or exploit the PVM by causing various errors
such as under/overflowing the value stack or reading/writing into
arbitrary areas of the PVM program space. Most of these kinds of
errors can be eliminated by verifying that PVM bytecode does not
violate a set of simple constraints before execution.
This PEP proposes a set of constraints on the format and structure
of Python Virtual Machine (PVM) bytecode and provides an
implementation in Python of this verification process.
Pronouncement
Guido believes that a verification tool has some value. If
someone wants to add it to Tools/scripts, no PEP is required.
Such a tool may have value for validating the output from
"bytecodehacks" or from direct edits of PYC files. As security
measure, its value is somewhat limited because perfectly valid
bytecode can still do horrible things. That situation could
change if the concept of restricted execution were to be
successfully resurrected.
Motivation
The Python Virtual Machine executes Python programs that have been
compiled from the Python language into a bytecode representation.
The PVM assumes that any bytecode being executed is "well-formed"
with regard to a number implicit constraints. Some of these
constraints are checked at run-time, but most of them are not due
to the overhead they would create.
When running in debug mode the PVM does do several run-time checks
to ensure that any particular bytecode cannot violate these
constraints that, to a degree, prevent bytecode from crashing or
exploiting the interpreter. These checks add a measurable
overhead to the interpreter, and are typically turned off in
common use.
Bytecode that is not well-formed and executed by a PVM not running
in debug mode may create a variety of fatal and non-fatal errors.
Typically, ill-formed code will cause the PVM to seg-fault and
cause the OS to immediately and abruptly terminate the
interpreter.
Conceivably, ill-formed bytecode could exploit the interpreter and
allow Python bytecode to execute arbitrary C-level machine
instructions or to modify private, internal data structures in the
interpreter. If used cleverly this could subvert any form of
security policy an application may want to apply to its objects.
Practically, it would be difficult for a malicious user to
"inject" invalid bytecode into a PVM for the purposes of
exploitation, but not impossible. Buffer overflow and memory
overwrite attacks are commonly understood, particularly when the
exploit payload is transmitted unencrypted over a network or when
a file or network security permission weakness is used as a
foothold for further attacks.
Ideally, no bytecode should ever be allowed to read or write
underlying C-level data structures to subvert the operation of the
PVM, whether the bytecode was maliciously crafted or not. A
simple pre-execution verification step could ensure that bytecode
cannot over/underflow the value stack or access other sensitive
areas of PVM program space at run-time.
This PEP proposes several validation steps that should be taken on
Python bytecode before it is executed by the PVM so that it
compiles with static and structure constraints on its instructions
and their operands. These steps are simple and catch a large
class of invalid bytecode that can cause crashes. There is also
some possibility that some run-time checks can be eliminated up
front by a verification pass.
There is, of course, no way to verify that bytecode is "completely
safe", for every definition of complete and safe. Even with
bytecode verification, Python programs can and most likely in the
future will seg-fault for a variety of reasons and continue to
cause many different classes of run-time errors, fatal or not.
The verification step proposed here simply plugs an easy hole that
can cause a large class of fatal and subtle errors at the bytecode
level.
Currently, the Java Virtual Machine (JVM) verifies Java bytecode
in a way very similar to what is proposed here. The JVM
Specification version 2 [1], Sections 4.8 and 4.9 were therefore
used as a basis for some of the constraints explained below. Any
Python bytecode verification implementation at a minimum must
enforce these constraints, but may not be limited to them.
Static Constraints on Bytecode Instructions
1. The bytecode string must not be empty. (len(co_code) > 0).
2. The bytecode string cannot exceed a maximum size
(len(co_code) < sizeof(unsigned char) - 1).
3. The first instruction in the bytecode string begins at index 0.
4. Only valid byte-codes with the correct number of operands can
be in the bytecode string.
Static Constraints on Bytecode Instruction Operands
1. The target of a jump instruction must be within the code
boundaries and must fall on an instruction, never between an
instruction and its operands.
2. The operand of a LOAD_* instruction must be an valid index into
its corresponding data structure.
3. The operand of a STORE_* instruction must be an valid index
into its corresponding data structure.
Structural Constraints between Bytecode Instructions
1. Each instruction must only be executed with the appropriate
number of arguments in the value stack, regardless of the
execution path that leads to its invocation.
2. If an instruction can be executed along several different
execution paths, the value stack must have the same depth prior
to the execution of the instruction, regardless of the path
taken.
3. At no point during execution can the value stack grow to a
depth greater than that implied by co_stacksize.
4. Execution never falls off the bottom of co_code.
Implementation
This PEP is the working document for an Python bytecode
verification implementation written in Python. This
implementation is not used implicitly by the PVM before executing
any bytecode, but is to be used explicitly by users concerned
about possibly invalid bytecode with the following snippet:
import verify
verify.verify(object)
The `verify` module provides a `verify` function which accepts the
same kind of arguments as `dis.dis`: classes, methods, functions,
or code objects. It verifies that the object's bytecode is
well-formed according to the specifications of this PEP.
If the code is well-formed the call to `verify` returns silently
without error. If an error is encountered, it throws a
'VerificationError' whose argument indicates the cause of the
failure. It is up to the programmer whether or not to handle the
error in some way or execute the invalid code regardless.
Phillip Eby has proposed a pseudo-code algorithm for bytecode
stack depth verification used by the reference implementation.
Verification Issues
This PEP describes only a small number of verifications. While
discussion and analysis will lead to many more, it is highly
possible that future verification may need to be done or custom,
project-specific verifications. For this reason, it might be
desirable to add a verification registration interface to the test
implementation to register future verifiers. The need for this is
minimal since custom verifiers can subclass and extend the current
implementation for added behavior.
Required Changes
Armin Rigo noted that several byte-codes will need modification in
order for their stack effect to be statically analyzed. These are
END_FINALLY, POP_BLOCK, and MAKE_CLOSURE. Armin and Guido have
already agreed on how to correct the instructions. Currently the
Python implementation punts on these instructions.
This PEP does not propose to add the verification step to the
interpreter, but only to provide the Python implementation in the
standard library for optional use. Whether or not this
verification procedure is translated into C, included with the PVM
or enforced in any way is left for future discussion.
References
[1] The Java Virtual Machine Specification 2nd Edition
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html
Copyright
This document has been placed in the public domain.
pep-0331 Locale-Independent Float/String Conversions
| PEP: | 331 |
|---|---|
| Title: | Locale-Independent Float/String Conversions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christian R. Reis |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 19-Jul-2003 |
| Python-Version: | 2.4 |
| Post-History: | 21-Jul-2003, 13-Aug-2003, 18-Jun-2004 |
Abstract
Support for the LC_NUMERIC locale category in Python 2.3 is
implemented only in Python-space. This causes inconsistent
behavior and thread-safety issues for applications that use
extension modules and libraries implemented in C that parse and
generate floats from strings. This document proposes a plan for
removing this inconsistency by providing and using substitute
locale-agnostic functions as necessary.
Introduction
Python provides generic localization services through the locale
module, which among other things allows localizing the display and
conversion process of numeric types. Locale categories, such as
LC_TIME and LC_COLLATE, allow configuring precisely what aspects
of the application are to be localized.
The LC_NUMERIC category specifies formatting for non-monetary
numeric information, such as the decimal separator in float and
fixed-precision numbers. Localization of the LC_NUMERIC category
is currently implemented only in Python-space; C libraries invoked
from the Python runtime are unaware of Python's LC_NUMERIC
setting. This is done to avoid changing the behavior of certain
low-level functions that are used by the Python parser and related
code [2].
However, this presents a problem for extension modules that wrap C
libraries. Applications that use these extension modules will
inconsistently display and convert floating-point values.
James Henstridge, the author of PyGTK [3], has additionally
pointed out that the setlocale() function also presents
thread-safety issues, since a thread may call the C library
setlocale() outside of the GIL, and cause Python to parse and
generate floats incorrectly.
Rationale
The inconsistency between Python and C library localization for
LC_NUMERIC is a problem for any localized application using C
extensions. The exact nature of the problem will vary depending
on the application, but it will most likely occur when parsing or
formatting a floating-point value.
Example Problem
The initial problem that motivated this PEP is related to the
GtkSpinButton [4] widget in the GTK+ UI toolkit, wrapped by the
PyGTK module. The widget can be set to numeric mode, and when
this occurs, characters typed into it are evaluated as a number.
Problems occur when LC_NUMERIC is set to a locale with a float
separator that differs from the C locale's standard (for instance,
`,' instead of `.' for the Brazilian locale pt_BR). Because
LC_NUMERIC is not set at the libc level, float values are
displayed incorrectly (using `.' as a separator) in the
spinbutton's text entry, and it is impossible to enter fractional
values using the `,' separator.
This small example demonstrates reduced usability for localized
applications using this toolkit when coded in Python.
Proposal
Martin v. Lรถwis commented on the initial constraints for an
acceptable solution to the problem on python-dev:
- LC_NUMERIC can be set at the C library level without
breaking the parser.
- float() and str() stay locale-unaware.
- locale-aware str() and atof() stay in the locale module.
An analysis of the Python source suggests that the following
functions currently depend on LC_NUMERIC being set to the C
locale:
- Python/compile.c:parsenumber()
- Python/marshal.c:r_object()
- Objects/complexobject.c:complex_to_buf()
- Objects/complexobject.c:complex_subtype_from_string()
- Objects/floatobject.c:PyFloat_FromString()
- Objects/floatobject.c:format_float()
- Objects/stringobject.c:formatfloat()
- Modules/stropmodule.c:strop_atof()
- Modules/cPickle.c:load_float()
The proposed approach is to implement LC_NUMERIC-agnostic
functions for converting from (strtod()/atof()) and to
(snprintf()) float formats, using these functions where the
formatting should not vary according to the user-specified locale.
The locale module should also be changed to remove the
special-casing for LC_NUMERIC.
This change should also solve the aforementioned thread-safety
problems.
Potential Code Contributions
This problem was initially reported as a problem in the GTK+
libraries [5]; since then it has been correctly diagnosed as an
inconsistency in Python's implementation. However, in a fortunate
coincidence, the glib library (developed primarily for GTK+, not
to be confused with the GNU C library) implements a number of
LC_NUMERIC-agnostic functions (for an example, see [6]) for
reasons similar to those presented in this paper.
In the same GTK+ problem report, Havoc Pennington suggested that
the glib authors would be willing to contribute this code to the
PSF, which would simplify implementation of this PEP considerably.
Alex Larsson, the original author of the glib code, submitted a
PSF Contributor Agreement [7] on 2003-08-20 [8] to ensure the code
could be safely integrated; this agreement has been received and
accepted.
Risks
There may be cross-platform issues with the provided
locale-agnostic functions, though this risk is low given that the
code supplied simply reverses any locale-dependent changes made to
floating-point numbers.
Martin and Guido pointed out potential copyright issues with the
contributed code. I believe we will have no problems in this area
as members of the GTK+ and glib teams have said they are fine with
relicensing the code, and a PSF contributor agreement has been
mailed in to ensure this safety.
Tim Peters has pointed out [9] that there are situations involving
threading in which the proposed change is insufficient to solve
the problem completely. A complete solution, however, does not
currently exist.
Implementation
An implementation was developed by Gustavo Carneiro <gjc at
inescporto.pt>, and attached to Sourceforge.net bug 774665 [10]
The final patch [11] was integrated into Python CVS by Martin v.
Lรถwis on 2004-06-08, as stated in the bug report.
References
[1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton
http://www.python.org/dev/peps/pep-0001/
[2] Python locale documentation for embedding,
http://docs.python.org/library/locale.html
[3] PyGTK homepage, http://www.daa.com.au/~james/pygtk/
[4] GtkSpinButton screenshot (demonstrating problem),
http://www.async.com.br/~kiko/spin.png
[5] GNOME bug report, http://bugzilla.gnome.org/show_bug.cgi?id=114132
[6] Code submission of g_ascii_strtod and g_ascii_dtostr (later
renamed g_ascii_formatd) by Alex Larsson,
http://mail.gnome.org/archives/gtk-devel-list/2001-October/msg00114.html
[7] PSF Contributor Agreement,
http://www.python.org/psf/psf-contributor-agreement.html
[8] Alex Larsson's email confirming his agreement was mailed in,
http://mail.python.org/pipermail/python-dev/2003-August/037755.html
[9] Tim Peters' email summarizing LC_NUMERIC trouble with Spambayes,
http://mail.python.org/pipermail/python-dev/2003-September/037898.html
[10] Python bug report, http://www.python.org/sf/774665
[11] Integrated LC_NUMERIC-agnostic patch,
https://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=89685&aid=774665
Copyright
This document has been placed in the public domain.
pep-0332 Byte vectors and String/Unicode Unification
| PEP: | 332 |
|---|---|
| Title: | Byte vectors and String/Unicode Unification |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Skip Montanaro <skip at pobox.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 11-Aug-2004 |
| Python-Version: | 2.5 |
| Post-History: |
Contents
Abstract
This PEP outlines the introduction of a raw bytes sequence object and the unification of the current str and unicode objects.
Rejection Notice
This PEP is rejected in this form. The author has expressed lack of time to continue to shepherd it, and discussion on python-dev has moved to a slightly different proposal which will (eventually) be written up as a new PEP. See the thread starting at http://mail.python.org/pipermail/python-dev/2006-February/060930.html.
Rationale
Python's current string objects are overloaded. They serve both to hold ASCII and non-ASCII character data and to also hold sequences of raw bytes which have no reasonable interpretation as displayable character sequences. This overlap hasn't been a big problem in the past, but as Python moves closer to requiring source code to be properly encoded, the use of strings to represent raw byte sequences will be more problematic. In addition, as Python's Unicode support has improved, it's easier to consider strings as ASCII-encoded Unicode objects.
Proposed Implementation
The number in parentheses indicates the Python version in which the feature will be introduced.
- Add a bytes builtin which is just a synonym for str. (2.5)
- Add a b"..." string literal which is equivalent to raw string literals, with the exception that values which conflict with the source encoding of the containing file not generate warnings. (2.5)
- Warn about the use of variables named "bytes". (2.5 or 2.6)
- Introduce a bytes builtin which refers to a sequence distinct from the str type. (2.6)
- Make str a synonym for unicode. (3.0)
Bytes Object API
TBD.
Issues
- Can this be accomplished before Python 3.0?
- Should bytes objects be mutable or immutable? (Guido seems to like them to be mutable.)
Copyright
This document has been placed in the public domain.
pep-0333 Python Web Server Gateway Interface v1.0
| PEP: | 333 |
|---|---|
| Title: | Python Web Server Gateway Interface v1.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Phillip J. Eby <pje at telecommunity.com> |
| Discussions-To: | Python Web-SIG <web-sig at python.org> |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 07-Dec-2003 |
| Post-History: | 07-Dec-2003, 08-Aug-2004, 20-Aug-2004, 27-Aug-2004, 27-Sep-2010 |
| Superseded-By: | 3333 |
Contents
Preface
Note: For an updated version of this spec that supports Python 3.x and includes community errata, addenda, and clarifications, please see PEP 3333 instead.
Abstract
This document specifies a proposed standard interface between web servers and Python web applications or frameworks, to promote web application portability across a variety of web servers.
Rationale and Goals
Python currently boasts a wide variety of web application frameworks, such as Zope, Quixote, Webware, SkunkWeb, PSO, and Twisted Web -- to name just a few [1]. This wide variety of choices can be a problem for new Python users, because generally speaking, their choice of web framework will limit their choice of usable web servers, and vice versa.
By contrast, although Java has just as many web application frameworks available, Java's "servlet" API makes it possible for applications written with any Java web application framework to run in any web server that supports the servlet API.
The availability and widespread use of such an API in web servers for Python -- whether those servers are written in Python (e.g. Medusa), embed Python (e.g. mod_python), or invoke Python via a gateway protocol (e.g. CGI, FastCGI, etc.) -- would separate choice of framework from choice of web server, freeing users to choose a pairing that suits them, while freeing framework and server developers to focus on their preferred area of specialization.
This PEP, therefore, proposes a simple and universal interface between web servers and web applications or frameworks: the Python Web Server Gateway Interface (WSGI).
But the mere existence of a WSGI spec does nothing to address the existing state of servers and frameworks for Python web applications. Server and framework authors and maintainers must actually implement WSGI for there to be any effect.
However, since no existing servers or frameworks support WSGI, there is little immediate reward for an author who implements WSGI support. Thus, WSGI must be easy to implement, so that an author's initial investment in the interface can be reasonably low.
Thus, simplicity of implementation on both the server and framework sides of the interface is absolutely critical to the utility of the WSGI interface, and is therefore the principal criterion for any design decisions.
Note, however, that simplicity of implementation for a framework author is not the same thing as ease of use for a web application author. WSGI presents an absolutely "no frills" interface to the framework author, because bells and whistles like response objects and cookie handling would just get in the way of existing frameworks' handling of these issues. Again, the goal of WSGI is to facilitate easy interconnection of existing servers and applications or frameworks, not to create a new web framework.
Note also that this goal precludes WSGI from requiring anything that is not already available in deployed versions of Python. Therefore, new standard library modules are not proposed or required by this specification, and nothing in WSGI requires a Python version greater than 2.2.2. (It would be a good idea, however, for future versions of Python to include support for this interface in web servers provided by the standard library.)
In addition to ease of implementation for existing and future frameworks and servers, it should also be easy to create request preprocessors, response postprocessors, and other WSGI-based "middleware" components that look like an application to their containing server, while acting as a server for their contained applications.
If middleware can be both simple and robust, and WSGI is widely available in servers and frameworks, it allows for the possibility of an entirely new kind of Python web application framework: one consisting of loosely-coupled WSGI middleware components. Indeed, existing framework authors may even choose to refactor their frameworks' existing services to be provided in this way, becoming more like libraries used with WSGI, and less like monolithic frameworks. This would then allow application developers to choose "best-of-breed" components for specific functionality, rather than having to commit to all the pros and cons of a single framework.
Of course, as of this writing, that day is doubtless quite far off. In the meantime, it is a sufficient short-term goal for WSGI to enable the use of any framework with any server.
Finally, it should be mentioned that the current version of WSGI does not prescribe any particular mechanism for "deploying" an application for use with a web server or server gateway. At the present time, this is necessarily implementation-defined by the server or gateway. After a sufficient number of servers and frameworks have implemented WSGI to provide field experience with varying deployment requirements, it may make sense to create another PEP, describing a deployment standard for WSGI servers and application frameworks.
Specification Overview
The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.
In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.
Throughout this specification, we will use the term "a callable" to mean "a function, method, class, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Callables are only to be called, not introspected upon.
The Application/Framework Side
The application object is simply a callable object that accepts two arguments. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, class, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests.
(Note: although we refer to it as an "application" object, this should not be construed to mean that application developers will use WSGI as a web programming API! It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. WSGI is a tool for framework and server developers, and is not intended to directly support application developers.)
Here are two example application objects; one is a function, and the other is a class:
def simple_app(environ, start_response):
"""Simplest possible application object"""
status = '200 OK'
response_headers = [('Content-type', 'text/plain')]
start_response(status, response_headers)
return ['Hello world!\n']
class AppClass:
"""Produce the same output, but using a class
(Note: 'AppClass' is the "application" here, so calling it
returns an instance of 'AppClass', which is then the iterable
return value of the "application callable" as required by
the spec.
If we wanted to use *instances* of 'AppClass' as application
objects instead, we would have to implement a '__call__'
method, which would be invoked to execute the application,
and we would need to create an instance for use by the
server or gateway.
"""
def __init__(self, environ, start_response):
self.environ = environ
self.start = start_response
def __iter__(self):
status = '200 OK'
response_headers = [('Content-type', 'text/plain')]
self.start(status, response_headers)
yield "Hello world!\n"
The Server/Gateway Side
The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.
import os, sys
def run_with_cgi(application):
environ = dict(os.environ.items())
environ['wsgi.input'] = sys.stdin
environ['wsgi.errors'] = sys.stderr
environ['wsgi.version'] = (1, 0)
environ['wsgi.multithread'] = False
environ['wsgi.multiprocess'] = True
environ['wsgi.run_once'] = True
if environ.get('HTTPS', 'off') in ('on', '1'):
environ['wsgi.url_scheme'] = 'https'
else:
environ['wsgi.url_scheme'] = 'http'
headers_set = []
headers_sent = []
def write(data):
if not headers_set:
raise AssertionError("write() before start_response()")
elif not headers_sent:
# Before the first output, send the stored headers
status, response_headers = headers_sent[:] = headers_set
sys.stdout.write('Status: %s\r\n' % status)
for header in response_headers:
sys.stdout.write('%s: %s\r\n' % header)
sys.stdout.write('\r\n')
sys.stdout.write(data)
sys.stdout.flush()
def start_response(status, response_headers, exc_info=None):
if exc_info:
try:
if headers_sent:
# Re-raise original exception if headers sent
raise exc_info[0], exc_info[1], exc_info[2]
finally:
exc_info = None # avoid dangling circular ref
elif headers_set:
raise AssertionError("Headers already set!")
headers_set[:] = [status, response_headers]
return write
result = application(environ, start_response)
try:
for data in result:
if data: # don't send headers until body appears
write(data)
if not headers_sent:
write('') # send headers now if body was empty
finally:
if hasattr(result, 'close'):
result.close()
Middleware: Components that Play Both Sides
Note that a single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:
- Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
- Allowing multiple applications or frameworks to run side-by-side in the same process
- Load balancing and remote processing, by forwarding requests and responses over a network
- Perform content postprocessing, such as applying XSL stylesheets
The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".
For the most part, middleware must conform to the restrictions and requirements of both the server and application sides of WSGI. In some cases, however, requirements for middleware are more stringent than for a "pure" server or application, and these points will be noted in the specification.
Here is a (tongue-in-cheek) example of a middleware component that converts text/plain responses to pig latin, using Joe Strout's piglatin.py. (Note: a "real" middleware component would probably use a more robust way of checking the content type, and should also check for a content encoding. Also, this simple example ignores the possibility that a word might be split across a block boundary.)
from piglatin import piglatin
class LatinIter:
"""Transform iterated output to piglatin, if it's okay to do so
Note that the "okayness" can change until the application yields
its first non-empty string, so 'transform_ok' has to be a mutable
truth value.
"""
def __init__(self, result, transform_ok):
if hasattr(result, 'close'):
self.close = result.close
self._next = iter(result).next
self.transform_ok = transform_ok
def __iter__(self):
return self
def next(self):
if self.transform_ok:
return piglatin(self._next())
else:
return self._next()
class Latinator:
# by default, don't transform output
transform = False
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
transform_ok = []
def start_latin(status, response_headers, exc_info=None):
# Reset ok flag, in case this is a repeat call
del transform_ok[:]
for name, value in response_headers:
if name.lower() == 'content-type' and value == 'text/plain':
transform_ok.append(True)
# Strip content-length if present, else it'll be wrong
response_headers = [(name, value)
for name, value in response_headers
if name.lower() != 'content-length'
]
break
write = start_response(status, response_headers, exc_info)
if transform_ok:
def write_latin(data):
write(piglatin(data))
return write_latin
else:
return write
return LatinIter(self.application(environ, start_latin), transform_ok)
# Run foo_app under a Latinator's control, using the example CGI gateway
from foo_app import foo_app
run_with_cgi(Latinator(foo_app))
Specification Details
The application object must accept two positional arguments. For the sake of illustration, we have named them environ and start_response, but they are not required to have these names. A server or gateway must invoke the application object using positional (not keyword) arguments. (E.g. by calling result = application(environ, start_response) as shown above.)
The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain WSGI-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.
The start_response parameter is a callable accepting two required positional arguments, and one optional argument. For the sake of illustration, we have named these arguments status, response_headers, and exc_info, but they are not required to have these names, and the application must invoke the start_response callable using positional arguments (e.g. start_response(status, response_headers)).
The status parameter is a status string of the form "999 Message here", and response_headers is a list of (header_name, header_value) tuples describing the HTTP response header. The optional exc_info parameter is described below in the sections on The start_response() Callable and Error Handling. It is used only when the application has trapped an error and is attempting to display an error message to the browser.
The start_response callable must return a write(body_data) callable that takes one positional parameter: a string to be written as part of the HTTP response body. (Note: the write() callable is provided only to support certain existing frameworks' imperative output APIs; it should not be used by new applications or frameworks if it can be avoided. See the Buffering and Streaming section for more details.)
When called by the server, the application object must return an iterable yielding zero or more strings. This can be accomplished in a variety of ways, such as by returning a list of strings, or by the application being a generator function that yields strings, or by the application being a class whose instances are iterable. Regardless of how it is accomplished, the application object must always return an iterable yielding zero or more strings.
The server or gateway must transmit the yielded strings to the client in an unbuffered fashion, completing the transmission of each string before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)
The server or gateway should treat the yielded strings as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the string(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)
If a call to len(iterable) succeeds, the server must be able to rely on the result being accurate. That is, if the iterable returned by the application provides a working __len__() method, it must return an accurate result. (See the Handling the Content-Length Header section for information on how this would normally be used.)
If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an error. (This is to support resource release by the application. This protocol is intended to complement PEP 325's generator support, and other common iterables with close() methods.
(Note: the application must invoke the start_response() callable before the iterable yields its first body string, so that the server can send the headers before any body content. However, this invocation may be performed by the iterable's first iteration, so servers must not assume that start_response() has been called before they begin iterating over the iterable.)
Finally, servers and gateways must not directly use any other attributes of the iterable returned by the application, unless it is an instance of a type specific to that server or gateway, such as a "file wrapper" returned by wsgi.file_wrapper (see Optional Platform-Specific File Handling). In the general case, only attributes specified here, or accessed via e.g. the PEP 234 iteration APIs are acceptable.
environ Variables
The environ dictionary is required to contain these CGI environment variables, as defined by the Common Gateway Interface specification [2]. The following variables must be present, unless their value would be an empty string, in which case they may be omitted, except as otherwise noted below.
- REQUEST_METHOD
- The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required.
- SCRIPT_NAME
- The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server.
- PATH_INFO
- The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash.
- QUERY_STRING
- The portion of the request URL that follows the "?", if any. May be empty or absent.
- CONTENT_TYPE
- The contents of any Content-Type fields in the HTTP request. May be empty or absent.
- CONTENT_LENGTH
- The contents of any Content-Length fields in the HTTP request. May be empty or absent.
- SERVER_NAME, SERVER_PORT
- When combined with SCRIPT_NAME and PATH_INFO, these variables can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_NAME and SERVER_PORT can never be empty strings, and so are always required.
- SERVER_PROTOCOL
- The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)
- HTTP_ Variables
- Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.
A server or gateway should attempt to provide as many other CGI variables as are applicable. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)
A WSGI-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.
Note: missing variables (such as REMOTE_USER when no authentication has occurred) should be left out of the environ dictionary. Also note that CGI-defined variables must be strings, if they are present at all. It is a violation of this specification for a CGI variable's value to be of any type other than str.
In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following WSGI-defined variables:
| Variable | Value |
|---|---|
| wsgi.version | The tuple (1, 0), representing WSGI version 1.0. |
| wsgi.url_scheme | A string representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value "http" or "https", as appropriate. |
| wsgi.input | An input stream (file-like object) from which the HTTP request body can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.) |
| wsgi.errors | An output stream (file-like object) to which error output can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway. For many servers, wsgi.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired. |
| wsgi.multithread | This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise. |
| wsgi.multiprocess | This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise. |
| wsgi.run_once | This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar). |
Finally, the environ dictionary may also contain server-defined variables. These variables should be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_python might define variables with names like mod_python.some_variable.
Input and Error Streams
The input and error streams provided by the server must support the following methods:
| Method | Stream | Notes |
|---|---|---|
| read(size) | input | 1 |
| readline() | input | 1, 2 |
| readlines(hint) | input | 1, 3 |
| __iter__() | input | |
| flush() | errors | 4 |
| write(str) | errors | |
| writelines(seq) | errors |
The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:
- The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.
- The optional "size" argument to readline() is not supported, as it may be complex for server authors to implement, and is not often used in practice.
- Note that the hint argument to readlines() is optional for both caller and implementer. The application is free not to supply it, and the server or gateway is free to ignore it.
- Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)
The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input or errors objects. In particular, applications must not attempt to close these streams, even if they possess close() methods.
The start_response() Callable
The second parameter passed to the application object is a callable of the form start_response(status, response_headers, exc_info=None). (As with all WSGI callables, the arguments must be supplied positionally, not by keyword.) The start_response callable is used to begin the HTTP response, and it must return a write(body_data) callable (see the Buffering and Streaming section, below).
The status argument is an HTTP "status" string like "200 OK" or "404 Not Found". That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.
The response_headers argument is a list of (header_name, header_value) tuples. It must be a Python list; i.e. type(response_headers) is ListType, and the server may change its contents in any way it desires. Each header_name must be a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation.
Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)
In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway.
(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)
Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied to start_response(). (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)
The start_response callable must not actually transmit the response headers. Instead, it must store them for the server or gateway to transmit only after the first iteration of the application return value that yields a non-empty string, or upon the application's first invocation of the write() callable. In other words, response headers must not be sent until there is actual body data available, or until the application's returned iterable is exhausted. (The only possible exception to this rule is if the response headers explicitly include a Content-Length of zero.)
This delaying of response header transmission is to ensure that buffered and asynchronous applications can replace their originally intended output with error output, up until the last possible moment. For example, the application may need to change the response status from "200 OK" to "500 Internal Error", if an error occurs while the body is being generated within an application buffer.
The exc_info argument, if supplied, must be a Python sys.exc_info() tuple. This argument should be supplied by the application only if start_response is being called by an error handler. If exc_info is supplied, and no HTTP headers have been output yet, start_response should replace the currently-stored HTTP response headers with the newly-supplied ones, thus allowing the application to "change its mind" about the output when an error has occurred.
However, if exc_info is provided, and the HTTP headers have already been sent, start_response must raise an error, and should raise the exc_info tuple. That is:
raise exc_info[0], exc_info[1], exc_info[2]
This will re-raise the exception trapped by the application, and in principle should abort the application. (It is not safe for the application to attempt error output to the browser once the HTTP headers have already been sent.) The application must not trap any exceptions raised by start_response, if it called start_response with exc_info. Instead, it should allow such exceptions to propagate back to the server or gateway. See Error Handling below, for more details.
The application may call start_response more than once, if and only if the exc_info argument is provided. More precisely, it is a fatal error to call start_response without the exc_info argument if start_response has already been called within the current invocation of the application. (See the example CGI gateway above for an illustration of the correct logic.)
Note: servers, gateways, or middleware implementing start_response should ensure that no reference is held to the exc_info parameter beyond the duration of the function's execution, to avoid creating a circular reference through the traceback and frames involved. The simplest way to do this is something like:
def start_response(status, response_headers, exc_info=None):
if exc_info:
try:
# do stuff w/exc_info here
finally:
exc_info = None # Avoid circular ref.
The example CGI gateway provides another illustration of this technique.
Handling the Content-Length Header
If the application does not supply a Content-Length header, a server or gateway may choose one of several approaches to handling it. The simplest of these is to close the client connection when the response is completed.
Under some circumstances, however, the server or gateway may be able to either generate a Content-Length header, or at least avoid the need to close the client connection. If the application does not call the write() callable, and returns an iterable whose len() is 1, then the server can automatically determine Content-Length by taking the length of the first string yielded by the iterable.
And, if the server and client both support HTTP/1.1 "chunked encoding" [3], then the server may use chunked encoding to send a chunk for each write() call or string yielded by the iterable, thus generating a Content-Length header for each chunk. This allows the server to keep the client connection alive, if it wishes to do so. Note that the server must comply fully with RFC 2616 when doing this, or else fall back to one of the other strategies for dealing with the absence of Content-Length.
(Note: applications and middleware must not apply any kind of Transfer-Encoding to their output, such as chunking or gzipping; as "hop-by-hop" operations, these encodings are the province of the actual web server/gateway. See Other HTTP Features below, for more details.)
Buffering and Streaming
Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks such as Zope: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.
The corresponding approach in WSGI is for the application to simply return a single-element iterable (such as a list) containing the response body as a single string. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.
For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.
In these cases, applications will usually return an iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).
WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:
- Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
- Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
- (Middleware only) send the entire block to its parent gateway/server
By providing this guarantee, WSGI allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.
Middleware Handling of Block Boundaries
In order to better support asynchronous applications and servers, middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string.
To put this requirement another way, a middleware component must yield at least one value each time its underlying application yields a value. If the middleware cannot yield any other value, it must yield an empty string.
This requirement ensures that asynchronous applications and servers can conspire to reduce the number of threads that are required to run a given number of application instances simultaneously.
Note also that this requirement means that middleware must return an iterable as soon as its underlying application returns an iterable. It is also forbidden for middleware to use the write() callable to transmit data that is yielded by an underlying application. Middleware may only use their parent server's write() callable to transmit data that the underlying application sent using a middleware-provided write() callable.
The write() Callable
Some existing application framework APIs support unbuffered output in a different manner than WSGI. Specifically, they provide a "write" function or method of some kind to write an unbuffered block of data, or else they provide a buffered "write" function and a "flush" mechanism to flush the buffer.
Unfortunately, such APIs cannot be implemented in terms of WSGI's "iterable" application return value, unless threads or other special mechanisms are used.
Therefore, to allow these frameworks to continue using an imperative API, WSGI includes a special write() callable, returned by the start_response callable.
New WSGI applications and frameworks should not use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole.
The write() callable is returned by the start_response() callable, and it accepts a single parameter: a string to be written as part of the HTTP response body, that is treated exactly as though it had been yielded by the output iterable. In other words, before write() returns, it must guarantee that the passed-in string was either completely sent to the client, or that it is buffered for transmission while the application proceeds onward.
An application must return an iterable object, even if it uses write() to produce all or part of its response body. The returned iterable may be empty (i.e. yield no non-empty strings), but if it does yield non-empty strings, that output must be treated normally by the server or gateway (i.e., it must be sent or queued immediately). Applications must not invoke write() from within their return iterable, and therefore any strings yielded by the iterable are transmitted after all strings passed to write() have been sent to the client.
Unicode Issues
HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all strings passed to or from the server must be standard Python byte strings, not Unicode objects. The result of using a Unicode object where a string object is required, is undefined.
Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.
On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.
Again, all strings referred to in this specification must be of type str or StringType, and must not be of type unicode or UnicodeType. And, even if a given platform allows for more than 8 bits per character in str/StringType objects, only the lower 8 bits may be used, for any value referred to in this specification as a "string".
Error Handling
In general, applications should try to trap their own, internal errors, and display a helpful message in the browser. (It is up to the application to decide what "helpful" means in this context.)
However, to display such a message, the application must not have actually sent any data to the browser yet, or else it risks corrupting the response. WSGI therefore provides a mechanism to either allow the application to send its error message, or be automatically aborted: the exc_info argument to start_response. Here is an example of its use:
try:
# regular application code here
status = "200 Froody"
response_headers = [("content-type", "text/plain")]
start_response(status, response_headers)
return ["normal body goes here"]
except:
# XXX should trap runtime issues like MemoryError, KeyboardInterrupt
# in a separate handler before this bare 'except:'...
status = "500 Oops"
response_headers = [("content-type", "text/plain")]
start_response(status, response_headers, sys.exc_info())
return ["error body goes here"]
If no output has been written when an exception occurs, the call to start_response will return normally, and the application will return an error body to be sent to the browser. However, if any output has already been sent to the browser, start_response will reraise the provided exception. This exception should not be trapped by the application, and so the application will abort. The server or gateway can then trap this (fatal) exception and abort the response.
Servers should trap and log any exception that aborts an application or the iteration of its return value. If a partial response has already been written to the browser when an application error occurs, the server or gateway may attempt to add an error message to the output, if the already-sent headers indicate a text/* content type that the server knows how to modify cleanly.
Some middleware may wish to provide additional exception handling services, or intercept and replace application error messages. In such cases, middleware may choose to not re-raise the exc_info supplied to start_response, but instead raise a middleware-specific exception, or simply return without an exception after storing the supplied arguments. This will then cause the application to return its error body iterable (or invoke write()), allowing the middleware to capture and modify the error output. These techniques will work as long as application authors:
- Always provide exc_info when beginning an error response
- Never trap errors raised by start_response when exc_info is being provided
HTTP 1.1 Expect/Continue
Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:
- Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
- Proceed with the request normally, but provide the application with a wsgi.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
- Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)
Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.
Other HTTP Features
In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)
However, because WSGI servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to WSGI internal communications. WSGI applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. WSGI servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.
Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.
Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.
Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.
Thread Support
Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.
Implementation/Application Notes
Server Extension APIs
Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a WSGI extension.
In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.
In general, any extension API that duplicates, supplants, or bypasses some portion of WSGI functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically intend to organize or reorganize their frameworks to function almost entirely as middleware of various kinds.
So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some WSGI functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.
Similarly, if an extension API provides an alternate means of writing response data or headers, it should require the start_response callable to be passed in, before the application can obtain the extended service. If the object passed in is not the same one that the server/gateway originally supplied to the application, it cannot guarantee correct operation and must refuse to provide the extended service to the application.
These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.
It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!
Application Configuration
This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).
Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server now have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.
Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.
Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:
from the_app import application
def new_app(environ, start_response):
environ['the_app.configval1'] = 'something'
return application(environ, start_response)
But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)
URL Reconstruction
If an application wishes to reconstruct a request's complete URL, it may do so using the following algorithm, contributed by Ian Bicking:
from urllib import quote
url = environ['wsgi.url_scheme']+'://'
if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']
if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']
url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
if environ.get('QUERY_STRING'):
url += '?' + environ['QUERY_STRING']
Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.
Supporting Older (<2.2) Versions of Python
Some servers, gateways, or applications may wish to support older (<2.2) versions of Python. This is especially important if Jython is a target platform, since as of this writing a production-ready version of Jython 2.2 is not yet available.
For servers and gateways, this is relatively straightforward: servers and gateways targeting pre-2.2 versions of Python must simply restrict themselves to using only a standard "for" loop to iterate over any iterable returned by an application. This is the only way to ensure source-level compatibility with both the pre-2.2 iterator protocol (discussed further below) and "today's" iterator protocol (see PEP 234).
(Note that this technique necessarily applies only to servers, gateways, or middleware that are written in Python. Discussion of how to use iterator protocol(s) correctly from other languages is outside the scope of this PEP.)
For applications, supporting pre-2.2 versions of Python is slightly more complex:
- You may not return a file object and expect it to work as an iterable, since before Python 2.2, files were not iterable. (In general, you shouldn't do this anyway, because it will perform quite poorly most of the time!) Use wsgi.file_wrapper or an application-specific file wrapper class. (See Optional Platform-Specific File Handling for more on wsgi.file_wrapper, and an example class you can use to wrap a file as an iterable.)
- If you return a custom iterable, it must implement the pre-2.2 iterator protocol. That is, provide a __getitem__ method that accepts an integer key, and raises IndexError when exhausted. (Note that built-in sequence types are also acceptable, since they also implement this protocol.)
Finally, middleware that wishes to support pre-2.2 versions of Python, and iterates over application return values or itself returns an iterable (or both), must follow the appropriate recommendations above.
(Note: It should go without saying that to support pre-2.2 versions of Python, any server, gateway, application, or middleware must also use only language features available in the target version, use 1 and 0 instead of True and False, etc.)
Optional Platform-Specific File Handling
Some operating environments provide special high-performance file- transmission facilities, such as the Unix sendfile() call. Servers and gateways may expose this functionality via an optional wsgi.file_wrapper key in the environ. An application may use this "file wrapper" to convert a file or file-like object into an iterable that it then returns, e.g.:
if 'wsgi.file_wrapper' in environ:
return environ['wsgi.file_wrapper'](filelike, block_size)
else:
return iter(lambda: filelike.read(block_size), '')
If the server or gateway supplies wsgi.file_wrapper, it must be a callable that accepts one required positional parameter, and one optional positional parameter. The first parameter is the file-like object to be sent, and the second parameter is an optional block size "suggestion" (which the server/gateway need not use). The callable must return an iterable object, and must not perform any data transmission until and unless the server/gateway actually receives the iterable as a return value from the application. (To do otherwise would prevent middleware from being able to interpret or override the response data.)
To be considered "file-like", the object supplied by the application must have a read() method that takes an optional size argument. It may have a close() method, and if so, the iterable returned by wsgi.file_wrapper must have a close() method that invokes the original file-like object's close() method. If the "file-like" object has any other methods or attributes with names matching those of Python built-in file objects (e.g. fileno()), the wsgi.file_wrapper may assume that these methods or attributes have the same semantics as those of a built-in file object.
The actual implementation of any platform-specific file handling must occur after the application returns, and the server or gateway checks to see if a wrapper object was returned. (Again, because of the presence of middleware, error handlers, and the like, it is not guaranteed that any wrapper created will actually be used.)
Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached.
Of course, platform-specific file transmission APIs don't usually accept arbitrary "file-like" objects. Therefore, a wsgi.file_wrapper has to introspect the supplied object for things such as a fileno() (Unix-like OSes) or a java.nio.FileChannel (under Jython) in order to determine if the file-like object is suitable for use with the platform-specific API it supports.
Note that even if the object is not suitable for the platform API, the wsgi.file_wrapper must still return an iterable that wraps read() and close(), so that applications using file wrappers are portable across platforms. Here's a simple platform-agnostic file wrapper class, suitable for old (pre 2.2) and new Pythons alike:
class FileWrapper:
def __init__(self, filelike, blksize=8192):
self.filelike = filelike
self.blksize = blksize
if hasattr(filelike, 'close'):
self.close = filelike.close
def __getitem__(self, key):
data = self.filelike.read(self.blksize)
if data:
return data
raise IndexError
and here is a snippet from a server/gateway that uses it to provide access to a platform-specific API:
environ['wsgi.file_wrapper'] = FileWrapper
result = application(environ, start_response)
try:
if isinstance(result, FileWrapper):
# check if result.filelike is usable w/platform-specific
# API, and if so, use that API to transmit the result.
# If not, fall through to normal iterable handling
# loop below.
for data in result:
# etc.
finally:
if hasattr(result, 'close'):
result.close()
Questions and Answers
Why must environ be a dictionary? What's wrong with using a subclass?
The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface. In practice, however, most servers will probably find a dictionary adequate to their needs, and thus framework authors will come to expect the full set of dictionary features to be available, since they will be there more often than not. But, if some server chooses not to use a dictionary, then there will be interoperability problems despite that server's "conformance" to spec. Therefore, making a dictionary mandatory simplifies the specification and guarantees interoperabilty.
Note that this does not prevent server or framework developers from offering specialized services as custom variables inside the environ dictionary. This is the recommended approach for offering any such value-added services.
Why can you call write() and yield strings/return an iterable? Shouldn't we pick just one way?
If we supported only the iteration approach, then current frameworks that assume the availability of "push" suffer. But, if we only support pushing via write(), then server performance suffers for transmission of e.g. large files (if a worker thread can't begin work on a new request until all of the output has been sent). Thus, this compromise allows an application framework to support both approaches, as appropriate, but with only a little more burden to the server implementor than a push-only approach would require.
What's the close() for?
When writes are done during the execution of an application object, the application can ensure that resources are released using a try/finally block. But, if the application returns an iterable, any resources used will not be released until the iterable is garbage collected. The close() idiom allows an application to release critical resources at the end of a request, and it's forward-compatible with the support for try/finally in generators that's proposed by PEP 325.
Why is this interface so low-level? I want feature X! (e.g. cookies, sessions, persistence, ...)
This isn't Yet Another Python Web Framework. It's just a way for frameworks to talk to web servers, and vice versa. If you want these features, you need to pick a web framework that provides the features you want. And if that framework lets you create a WSGI application, you should be able to run it in most WSGI-supporting servers. Also, some WSGI servers may offer additional services via objects provided in their environ dictionary; see the applicable server documentation for details. (Of course, applications that use such extensions will not be portable to other WSGI-based servers.)
Why use CGI variables instead of good old HTTP headers? And why mix them in with WSGI-defined variables?
Many existing web frameworks are built heavily upon the CGI spec, and existing web servers know how to generate CGI variables. In contrast, alternative ways of representing inbound HTTP information are fragmented and lack market share. Thus, using the CGI "standard" seems like a good way to leverage existing implementations. As for mixing them with WSGI variables, separating them would just require two dictionary arguments to be passed around, while providing no real benefits.
What about the status string? Can't we just use the number, passing in 200 instead of "200 OK"?
Doing this would complicate the server or gateway, by requiring them to have a table of numeric statuses and corresponding messages. By contrast, it is easy for an application or framework author to type the extra text to go with the specific response code they are using, and existing frameworks often already have a table containing the needed messages. So, on balance it seems better to make the application/framework responsible, rather than the server or gateway.
Why is wsgi.run_once not guaranteed to run the app only once?
Because it's merely a suggestion to the application that it should "rig for infrequent running". This is intended for application frameworks that have multiple modes of operation for caching, sessions, and so forth. In a "multiple run" mode, such frameworks may preload caches, and may not write e.g. logs or session data to disk after each request. In "single run" mode, such frameworks avoid preloading and flush all necessary writes after each request.
However, in order to test an application or framework to verify correct operation in the latter mode, it may be necessary (or at least expedient) to invoke it more than once. Therefore, an application should not assume that it will definitely not be run again, just because it is called with wsgi.run_once set to True.
Feature X (dictionaries, callables, etc.) are ugly for use in application code; why don't we use objects instead?
All of these implementation choices of WSGI are specifically intended to decouple features from one another; recombining these features into encapsulated objects makes it somewhat harder to write servers or gateways, and an order of magnitude harder to write middleware that replaces or modifies only small portions of the overall functionality.
In essence, middleware wants to have a "Chain of Responsibility" pattern, whereby it can act as a "handler" for some functions, while allowing others to remain unchanged. This is difficult to do with ordinary Python objects, if the interface is to remain extensible. For example, one must use __getattr__ or __getattribute__ overrides, to ensure that extensions (such as attributes defined by future WSGI versions) are passed through.
This type of code is notoriously difficult to get 100% correct, and few people will want to write it themselves. They will therefore copy other people's implementations, but fail to update them when the person they copied from corrects yet another corner case.
Further, this necessary boilerplate would be pure excise, a developer tax paid by middleware developers to support a slightly prettier API for application framework developers. But, application framework developers will typically only be updating one framework to support WSGI, and in a very limited part of their framework as a whole. It will likely be their first (and maybe their only) WSGI implementation, and thus they will likely implement with this specification ready to hand. Thus, the effort of making the API "prettier" with object attributes and suchlike would likely be wasted for this audience.
We encourage those who want a prettier (or otherwise improved) WSGI interface for use in direct web application programming (as opposed to web framework development) to develop APIs or frameworks that wrap WSGI for convenient use by application developers. In this way, WSGI can remain conveniently low-level for server and middleware authors, while not being "ugly" for application developers.
Proposed/Under Discussion
These items are currently being discussed on the Web-SIG and elsewhere, or are on the PEP author's "to-do" list:
- Should wsgi.input be an iterator instead of a file? This would help for asynchronous applications and chunked-encoding input streams.
- Optional extensions are being discussed for pausing iteration of an application's output until input is available or until a callback occurs.
- Add a section about synchronous vs. asynchronous apps and servers, the relevant threading models, and issues/design goals in these areas.
Acknowledgements
Thanks go to the many folks on the Web-SIG mailing list whose thoughtful feedback made this revised draft possible. Especially:
- Gregory "Grisha" Trubetskoy, author of mod_python, who beat up on the first draft as not offering any advantages over "plain old CGI", thus encouraging me to look for a better approach.
- Ian Bicking, who helped nag me into properly specifying the multithreading and multiprocess options, as well as badgering me to provide a mechanism for servers to supply custom extension data to an application.
- Tony Lownds, who came up with the concept of a start_response function that took the status and headers, returning a write function. His input also guided the design of the exception handling facilities, especially in the area of allowing for middleware that overrides application error messages.
- Alan Kennedy, whose courageous attempts to implement WSGI-on-Jython (well before the spec was finalized) helped to shape the "supporting older versions of Python" section, as well as the optional wsgi.file_wrapper facility.
- Mark Nottingham, who reviewed the spec extensively for issues with HTTP RFC compliance, especially with regard to HTTP/1.1 features that I didn't even know existed until he pointed them out.
References
| [1] | The Python Wiki "Web Programming" topic (http://www.python.org/cgi-bin/moinmoin/WebProgramming) |
| [2] | The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt) |
| [3] | "Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1) |
| [4] | "End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1) |
| [5] | mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25) |
Copyright
This document has been placed in the public domain.
pep-0334 Simple Coroutines via SuspendIteration
| PEP: | 334 |
|---|---|
| Title: | Simple Coroutines via SuspendIteration |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Clark C. Evans <cce at clarkevans.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Aug-2004 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
Asynchronous application frameworks such as Twisted [1] and Peak [2], are based on a cooperative multitasking via event queues or deferred execution. While this approach to application development does not involve threads and thus avoids a whole class of problems [3], it creates a different sort of programming challenge. When an I/O operation would block, a user request must suspend so that other requests can proceed. The concept of a coroutine [4] promises to help the application developer grapple with this state management difficulty.
This PEP proposes a limited approach to coroutines based on an extension to the iterator protocol [5]. Currently, an iterator may raise a StopIteration exception to indicate that it is done producing values. This proposal adds another exception to this protocol, SuspendIteration, which indicates that the given iterator may have more values to produce, but is unable to do so at this time.
Rationale
There are two current approaches to bringing co-routines to Python. Christian Tismer's Stackless [6] involves a ground-up restructuring of Python's execution model by hacking the 'C' stack. While this approach works, its operation is hard to describe and keep portable. A related approach is to compile Python code to Parrot [7], a register-based virtual machine, which has coroutines. Unfortunately, neither of these solutions is portable with IronPython (CLR) or Jython (JavaVM).
It is thought that a more limited approach, based on iterators, could provide a coroutine facility to application programmers and still be portable across runtimes.
- Iterators keep their state in local variables that are not on the "C" stack. Iterators can be viewed as classes, with state stored in member variables that are persistent across calls to its next() method.
- While an uncaught exception may terminate a function's execution, an uncaught exception need not invalidate an iterator. The proposed exception, SuspendIteration, uses this feature. In other words, just because one call to next() results in an exception does not necessarily need to imply that the iterator itself is no longer capable of producing values.
There are four places where this new exception impacts:
- The simple generator [8] mechanism could be extended to safely 'catch' this SuspendIteration exception, stuff away its current state, and pass the exception on to the caller.
- Various iterator filters [9] in the standard library, such as itertools.izip should be made aware of this exception so that it can transparently propagate SuspendIteration.
- Iterators generated from I/O operations, such as a file or socket reader, could be modified to have a non-blocking variety. This option would raise a subclass of SuspendIteration if the requested operation would block.
- The asyncore library could be updated to provide a basic 'runner' that pulls from an iterator; if the SuspendIteration exception is caught, then it moves on to the next iterator in its runlist [10]. External frameworks like Twisted would provide alternative implementations, perhaps based on FreeBSD's kqueue or Linux's epoll.
While these may seem dramatic changes, it is a very small amount of work compared with the utility provided by continuations.
Semantics
This section will explain, at a high level, how the introduction of this new SuspendIteration exception would behave.
Simple Iterators
The current functionality of iterators is best seen with a simple example which produces two values 'one' and 'two'.
class States:
def __iter__(self):
self._next = self.state_one
return self
def next(self):
return self._next()
def state_one(self):
self._next = self.state_two
return "one"
def state_two(self):
self._next = self.state_stop
return "two"
def state_stop(self):
raise StopIteration
print list(States())
An equivalent iteration could, of course, be created by the following generator:
def States():
yield 'one'
yield 'two'
print list(States())
Introducing SuspendIteration
Suppose that between producing 'one' and 'two', the generator above could block on a socket read. In this case, we would want to raise SuspendIteration to signal that the iterator is not done producing, but is unable to provide a value at the current moment.
from random import randint
from time import sleep
class SuspendIteration(Exception):
pass
class NonBlockingResource:
"""Randomly unable to produce the second value"""
def __iter__(self):
self._next = self.state_one
return self
def next(self):
return self._next()
def state_one(self):
self._next = self.state_suspend
return "one"
def state_suspend(self):
rand = randint(1,10)
if 2 == rand:
self._next = self.state_two
return self.state_two()
raise SuspendIteration()
def state_two(self):
self._next = self.state_stop
return "two"
def state_stop(self):
raise StopIteration
def sleeplist(iterator, timeout = .1):
"""
Do other things (e.g. sleep) while resource is
unable to provide the next value
"""
it = iter(iterator)
retval = []
while True:
try:
retval.append(it.next())
except SuspendIteration:
sleep(timeout)
continue
except StopIteration:
break
return retval
print sleeplist(NonBlockingResource())
In a real-world situation, the NonBlockingResource would be a file iterator, socket handle, or other I/O based producer. The sleeplist would instead be an async reactor, such as those found in asyncore or Twisted. The non-blocking resource could, of course, be written as a generator:
def NonBlockingResource():
yield "one"
while True:
rand = randint(1,10)
if 2 == rand:
break
raise SuspendIteration()
yield "two"
It is not necessary to add a keyword, 'suspend', since most real content generators will not be in application code, they will be in low-level I/O based operations. Since most programmers need not be exposed to the SuspendIteration() mechanism, a keyword is not needed.
Application Iterators
The previous example is rather contrived, a more 'real-world' example would be a web page generator which yields HTML content, and pulls from a database. Note that this is an example of neither the 'producer' nor the 'consumer', but rather of a filter.
def ListAlbums(cursor):
cursor.execute("SELECT title, artist FROM album")
yield '<html><body><table><tr><td>Title</td><td>Artist</td></tr>'
for (title, artist) in cursor:
yield '<tr><td>%s</td><td>%s</td></tr>' % (title, artist)
yield '</table></body></html>'
The problem, of course, is that the database may block for some time before any rows are returned, and that during execution, rows may be returned in blocks of 10 or 100 at a time. Ideally, if the database blocks for the next set of rows, another user connection could be serviced. Note the complete absence of SuspendIterator in the above code. If done correctly, application developers would be able to focus on functionality rather than concurrency issues.
The iterator created by the above generator should do the magic necessary to maintain state, yet pass the exception through to a lower-level async framework. Here is an example of what the corresponding iterator would look like if coded up as a class:
class ListAlbums:
def __init__(self, cursor):
self.cursor = cursor
def __iter__(self):
self.cursor.execute("SELECT title, artist FROM album")
self._iter = iter(self._cursor)
self._next = self.state_head
return self
def next(self):
return self._next()
def state_head(self):
self._next = self.state_cursor
return "<html><body><table><tr><td>\
Title</td><td>Artist</td></tr>"
def state_tail(self):
self._next = self.state_stop
return "</table></body></html>"
def state_cursor(self):
try:
(title,artist) = self._iter.next()
return '<tr><td>%s</td><td>%s</td></tr>' % (title, artist)
except StopIteration:
self._next = self.state_tail
return self.next()
except SuspendIteration:
# just pass-through
raise
def state_stop(self):
raise StopIteration
Complicating Factors
While the above example is straight-forward, things are a bit more complicated if the intermediate generator 'condenses' values, that is, it pulls in two or more values for each value it produces. For example,
def pair(iterLeft,iterRight):
rhs = iter(iterRight)
lhs = iter(iterLeft)
while True:
yield (rhs.next(), lhs.next())
In this case, the corresponding iterator behavior has to be a bit more subtle to handle the case of either the right or left iterator raising SuspendIteration. It seems to be a matter of decomposing the generator to recognize intermediate states where a SuspendIterator exception from the producing context could happen.
class pair:
def __init__(self, iterLeft, iterRight):
self.iterLeft = iterLeft
self.iterRight = iterRight
def __iter__(self):
self.rhs = iter(iterRight)
self.lhs = iter(iterLeft)
self._temp_rhs = None
self._temp_lhs = None
self._next = self.state_rhs
return self
def next(self):
return self._next()
def state_rhs(self):
self._temp_rhs = self.rhs.next()
self._next = self.state_lhs
return self.next()
def state_lhs(self):
self._temp_lhs = self.lhs.next()
self._next = self.state_pair
return self.next()
def state_pair(self):
self._next = self.state_rhs
return (self._temp_rhs, self._temp_lhs)
This proposal assumes that a corresponding iterator written using this class-based method is possible for existing generators. The challenge seems to be the identification of distinct states within the generator where suspension could occur.
Resource Cleanup
The current generator mechanism has a strange interaction with exceptions where a 'yield' statement is not allowed within a try/finally block. The SuspendIterator exception provides another similar issue. The impacts of this issue are not clear. However it may be that re-writing the generator into a state machine, as the previous section did, could resolve this issue allowing for the situation to be no-worse than, and perhaps even removing the yield/finally situation. More investigation is needed in this area.
API and Limitations
This proposal only covers 'suspending' a chain of iterators, and does not cover (of course) suspending general functions, methods, or "C" extension function. While there could be no direct support for creating generators in "C" code, native "C" iterators which comply with the SuspendIterator semantics are certainly possible.
Low-Level Implementation
The author of the PEP is not yet familiar with the Python execution model to comment in this area.
References
| [1] | Twisted (http://twistedmatrix.com) |
| [2] | Peak (http://peak.telecommunity.com) |
| [3] | C10K (http://www.kegel.com/c10k.html) |
| [4] | Coroutines (http://c2.com/cgi/wiki?CallWithCurrentContinuation) |
| [5] | PEP 234, Iterators (http://www.python.org/dev/peps/pep-0234/) |
| [6] | Stackless Python (http://stackless.com) |
| [7] | Parrot /w coroutines (http://www.sidhe.org/~dan/blog/archives/000178.html) |
| [8] | PEP 255, Simple Generators (http://www.python.org/dev/peps/pep-0255/) |
| [9] | itertools - Functions creating iterators (http://docs.python.org/library/itertools.html) |
| [10] | Microthreads in Python, David Mertz (http://www-106.ibm.com/developerworks/linux/library/l-pythrd.html) |
Copyright
This document has been placed in the public domain.
pep-0335 Overloadable Boolean Operators
| PEP: | 335 |
|---|---|
| Title: | Overloadable Boolean Operators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gregory Ewing <greg.ewing at canterbury.ac.nz> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Aug-2004 |
| Python-Version: | 3.3 |
| Post-History: | 05-Sep-2004, 30-Sep-2011, 25-Oct-2011 |
Contents
Rejection Notice
This PEP was rejected. See http://mail.python.org/pipermail/python-dev/2012-March/117510.html
Abstract
This PEP proposes an extension to permit objects to define their own meanings for the boolean operators 'and', 'or' and 'not', and suggests an efficient strategy for implementation. A prototype of this implementation is available for download.
Background
Python does not currently provide any '__xxx__' special methods corresponding to the 'and', 'or' and 'not' boolean operators. In the case of 'and' and 'or', the most likely reason is that these operators have short-circuiting semantics, i.e. the second operand is not evaluated if the result can be determined from the first operand. The usual technique of providing special methods for these operators therefore would not work.
There is no such difficulty in the case of 'not', however, and it would be straightforward to provide a special method for this operator. The rest of this proposal will therefore concentrate mainly on providing a way to overload 'and' and 'or'.
Motivation
There are many applications in which it is natural to provide custom meanings for Python operators, and in some of these, having boolean operators excluded from those able to be customised can be inconvenient. Examples include:
NumPy, in which almost all the operators are defined on arrays so as to perform the appropriate operation between corresponding elements, and return an array of the results. For consistency, one would expect a boolean operation between two arrays to return an array of booleans, but this is not currently possible.
There is a precedent for an extension of this kind: comparison operators were originally restricted to returning boolean results, and rich comparisons were added so that comparisons of NumPy arrays could return arrays of booleans.
A symbolic algebra system, in which a Python expression is evaluated in an environment which results in it constructing a tree of objects corresponding to the structure of the expression.
A relational database interface, in which a Python expression is used to construct an SQL query.
A workaround often suggested is to use the bitwise operators '&', '|' and '~' in place of 'and', 'or' and 'not', but this has some drawbacks:
- The precedence of these is different in relation to the other operators, and they may already be in use for other purposes (as in example 1).
- It is aesthetically displeasing to force users to use something other than the most obvious syntax for what they are trying to express. This would be particularly acute in the case of example 3, considering that boolean operations are a staple of SQL queries.
- Bitwise operators do not provide a solution to the problem of chained comparisons such as 'a < b < c' which involve an implicit 'and' operation. Such expressions currently cannot be used at all on data types such as NumPy arrays where the result of a comparison cannot be treated as having normal boolean semantics; they must be expanded into something like (a < b) & (b < c), losing a considerable amount of clarity.
Rationale
The requirements for a successful solution to the problem of allowing boolean operators to be customised are:
- In the default case (where there is no customisation), the existing short-circuiting semantics must be preserved.
- There must not be any appreciable loss of speed in the default case.
- Ideally, the customisation mechanism should allow the object to provide either short-circuiting or non-short-circuiting semantics, at its discretion.
One obvious strategy, that has been previously suggested, is to pass into the special method the first argument and a function for evaluating the second argument. This would satisfy requirements 1 and 3, but not requirement 2, since it would incur the overhead of constructing a function object and possibly a Python function call on every boolean operation. Therefore, it will not be considered further here.
The following section proposes a strategy that addresses all three requirements. A prototype implementation [1] of this strategy is available for download.
Specification
Special Methods
At the Python level, objects may define the following special methods.
| Unary | Binary, phase 1 | Binary, phase 2 |
|---|---|---|
|
|
|
The __not__ method, if defined, implements the 'not' operator. If it is not defined, or it returns NotImplemented, existing semantics are used.
To permit short-circuiting, processing of the 'and' and 'or' operators is split into two phases. Phase 1 occurs after evaluation of the first operand but before the second. If the first operand defines the relevant phase 1 method, it is called with the first operand as argument. If that method can determine the result without needing the second operand, it returns the result, and further processing is skipped.
If the phase 1 method determines that the second operand is needed, it returns the special value NeedOtherOperand. This triggers the evaluation of the second operand, and the calling of a relevant phase 2 method. During phase 2, the __and2__/__rand2__ and __or2__/__ror2__ method pairs work as for other binary operators.
Processing falls back to existing semantics if at any stage a relevant special method is not found or returns NotImplemented.
As a special case, if the first operand defines a phase 2 method but no corresponding phase 1 method, the second operand is always evaluated and the phase 2 method called. This allows an object which does not want short-circuiting semantics to simply implement the phase 2 methods and ignore phase 1.
Bytecodes
The patch adds four new bytecodes, LOGICAL_AND_1, LOGICAL_AND_2, LOGICAL_OR_1 and LOGICAL_OR_2. As an example of their use, the bytecode generated for an 'and' expression looks like this:
.
.
.
evaluate first operand
LOGICAL_AND_1 L
evaluate second operand
LOGICAL_AND_2
L: .
.
.
The LOGICAL_AND_1 bytecode performs phase 1 processing. If it determines that the second operand is needed, it leaves the first operand on the stack and continues with the following code. Otherwise it pops the first operand, pushes the result and branches to L.
The LOGICAL_AND_2 bytecode performs phase 2 processing, popping both operands and pushing the result.
Type Slots
At the C level, the new special methods are manifested as five new slots in the type object. In the patch, they are added to the tp_as_number substructure, since this allows making use of some existing code for dealing with unary and binary operators. Their existence is signalled by a new type flag, Py_TPFLAGS_HAVE_BOOLEAN_OVERLOAD.
The new type slots are:
unaryfunc nb_logical_not; unaryfunc nb_logical_and_1; unaryfunc nb_logical_or_1; binaryfunc nb_logical_and_2; binaryfunc nb_logical_or_2;
Python/C API Functions
There are also five new Python/C API functions corresponding to the new operations:
PyObject *PyObject_LogicalNot(PyObject *); PyObject *PyObject_LogicalAnd1(PyObject *); PyObject *PyObject_LogicalOr1(PyObject *); PyObject *PyObject_LogicalAnd2(PyObject *, PyObject *); PyObject *PyObject_LogicalOr2(PyObject *, PyObject *);
Alternatives and Optimisations
This section discusses some possible variations on the proposal, and ways in which the bytecode sequences generated for boolean expressions could be optimised.
Reduced special method set
For completeness, the full version of this proposal includes a mechanism for types to define their own customised short-circuiting behaviour. However, the full mechanism is not needed to address the main use cases put forward here, and it would be possible to define a simplified version that only includes the phase 2 methods. There would then only be 5 new special methods (__and2__, __rand2__, __or2__, __ror2__, __not__) with 3 associated type slots and 3 API functions.
This simplified version could be expanded to the full version later if desired.
Additional bytecodes
As defined here, the bytecode sequence for code that branches on the result of a boolean expression would be slightly longer than it currently is. For example, in Python 2.7,
if a and b:
statement1
else:
statement2
generates
LOAD_GLOBAL a
POP_JUMP_IF_FALSE false_branch
LOAD_GLOBAL b
POP_JUMP_IF_FALSE false_branch
<code for statement1>
JUMP_FORWARD end_branch
false_branch:
<code for statement2>
end_branch:
Under this proposal as described so far, it would become something like
LOAD_GLOBAL a
LOGICAL_AND_1 test
LOAD_GLOBAL b
LOGICAL_AND_2
test:
POP_JUMP_IF_FALSE false_branch
<code for statement1>
JUMP_FORWARD end_branch
false_branch:
<code for statement2>
end_branch:
This involves executing one extra bytecode in the short-circuiting case and two extra bytecodes in the non-short-circuiting case.
However, by introducing extra bytecodes that combine the logical operations with testing and branching on the result, it can be reduced to the same number of bytecodes as the original:
LOAD_GLOBAL a
AND1_JUMP true_branch, false_branch
LOAD_GLOBAL b
AND2_JUMP_IF_FALSE false_branch
true_branch:
<code for statement1>
JUMP_FORWARD end_branch
false_branch:
<code for statement2>
end_branch:
Here, AND1_JUMP performs phase 1 processing as above, and then examines the result. If there is a result, it is popped from the stack, its truth value is tested and a branch taken to one of two locations.
Otherwise, the first operand is left on the stack and execution continues to the next bytecode. The AND2_JUMP_IF_FALSE bytecode performs phase 2 processing, pops the result and branches if it tests false
For the 'or' operator, there would be corresponding OR1_JUMP and OR2_JUMP_IF_TRUE bytecodes.
If the simplified version without phase 1 methods is used, then early exiting can only occur if the first operand is false for 'and' and true for 'or'. Consequently, the two-target AND1_JUMP and OR1_JUMP bytecodes can be replaced with AND1_JUMP_IF_FALSE and OR1_JUMP_IF_TRUE, these being ordinary branch instructions with only one target.
Optimisation of 'not'
Recent versions of Python implement a simple optimisation in which branching on a negated boolean expression is implemented by reversing the sense of the branch, saving a UNARY_NOT opcode.
Taking a strict view, this optimisation should no longer be performed, because the 'not' operator may be overridden to produce quite different results from usual. However, in typical use cases, it is not envisaged that expressions involving customised boolean operations will be used for branching -- it is much more likely that the result will be used in some other way.
Therefore, it would probably do little harm to specify that the compiler is allowed to use the laws of boolean algebra to simplify any expression that appears directly in a boolean context. If this is inconvenient, the result can always be assigned to a temporary name first.
This would allow the existing 'not' optimisation to remain, and would permit future extensions of it such as using De Morgan's laws to extend it deeper into the expression.
Usage Examples
Example 1: NumPy Arrays
#-----------------------------------------------------------------
#
# This example creates a subclass of numpy array to which
# 'and', 'or' and 'not' can be applied, producing an array
# of booleans.
#
#-----------------------------------------------------------------
from numpy import array, ndarray
class BArray(ndarray):
def __str__(self):
return "barray(%s)" % ndarray.__str__(self)
def __and2__(self, other):
return (self & other)
def __or2__(self, other):
return (self & other)
def __not__(self):
return (self == 0)
def barray(*args, **kwds):
return array(*args, **kwds).view(type = BArray)
a0 = barray([0, 1, 2, 4])
a1 = barray([1, 2, 3, 4])
a2 = barray([5, 6, 3, 4])
a3 = barray([5, 1, 2, 4])
print "a0:", a0
print "a1:", a1
print "a2:", a2
print "a3:", a3
print "not a0:", not a0
print "a0 == a1 and a2 == a3:", a0 == a1 and a2 == a3
print "a0 == a1 or a2 == a3:", a0 == a1 or a2 == a3
Example 1 Output
a0: barray([0 1 2 4]) a1: barray([1 2 3 4]) a2: barray([5 6 3 4]) a3: barray([5 1 2 4]) not a0: barray([ True False False False]) a0 == a1 and a2 == a3: barray([False False False True]) a0 == a1 or a2 == a3: barray([False False False True])
Example 2: Database Queries
#-----------------------------------------------------------------
#
# This example demonstrates the creation of a DSL for database
# queries allowing 'and' and 'or' operators to be used to
# formulate the query.
#
#-----------------------------------------------------------------
class SQLNode(object):
def __and2__(self, other):
return SQLBinop("and", self, other)
def __rand2__(self, other):
return SQLBinop("and", other, self)
def __eq__(self, other):
return SQLBinop("=", self, other)
class Table(SQLNode):
def __init__(self, name):
self.__tablename__ = name
def __getattr__(self, name):
return SQLAttr(self, name)
def __sql__(self):
return self.__tablename__
class SQLBinop(SQLNode):
def __init__(self, op, opnd1, opnd2):
self.op = op.upper()
self.opnd1 = opnd1
self.opnd2 = opnd2
def __sql__(self):
return "(%s %s %s)" % (sql(self.opnd1), self.op, sql(self.opnd2))
class SQLAttr(SQLNode):
def __init__(self, table, name):
self.table = table
self.name = name
def __sql__(self):
return "%s.%s" % (sql(self.table), self.name)
class SQLSelect(SQLNode):
def __init__(self, targets):
self.targets = targets
self.where_clause = None
def where(self, expr):
self.where_clause = expr
return self
def __sql__(self):
result = "SELECT %s" % ", ".join([sql(target) for target in self.targets])
if self.where_clause:
result = "%s WHERE %s" % (result, sql(self.where_clause))
return result
def sql(expr):
if isinstance(expr, SQLNode):
return expr.__sql__()
elif isinstance(expr, str):
return "'%s'" % expr.replace("'", "''")
else:
return str(expr)
def select(*targets):
return SQLSelect(targets)
#-----------------------------------------------------------------
dishes = Table("dishes")
customers = Table("customers")
orders = Table("orders")
query = select(customers.name, dishes.price, orders.amount).where(
customers.cust_id == orders.cust_id and orders.dish_id == dishes.dish_id
and dishes.name == "Spam, Eggs, Sausages and Spam")
print repr(query)
print sql(query)
Example 2 Output
<__main__.SQLSelect object at 0x1cc830> SELECT customers.name, dishes.price, orders.amount WHERE (((customers.cust_id = orders.cust_id) AND (orders.dish_id = dishes.dish_id)) AND (dishes.name = 'Spam, Eggs, Sausages and Spam'))
Copyright
This document has been placed in the public domain.
pep-0336 Make None Callable
| PEP: | 336 |
|---|---|
| Title: | Make None Callable |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Andrew McClelland <eternalsquire at comcast.net> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 28-Oct-2004 |
| Post-History: |
Abstract
None should be a callable object that when called with any
arguments has no side effect and returns None.
BDFL Pronouncement
This PEP is rejected. It is considered a feature that None raises
an error when called. The proposal falls short in tests for
obviousness, clarity, explictness, and necessity. The provided Switch
example is nice but easily handled by a simple lambda definition.
See python-dev discussion on 17 June 2005.
Motivation
To allow a programming style for selectable actions that is more
in accordance with the minimalistic functional programming goals
of the Python language.
Rationale
Allow the use of None in method tables as a universal no effect
rather than either (1) checking a method table entry against None
before calling, or (2) writing a local no effect method with
arguments similar to other functions in the table.
The semantics would be effectively,
class None:
def __call__(self, *args):
pass
How To Use
Before, checking function table entry against None:
class Select:
def a(self, input):
print 'a'
def b(self, input):
print 'b'
def c(self, input);
print 'c'
def __call__(self, input):
function = { 1 : self.a,
2 : self.b,
3 : self.c
}.get(input, None)
if function: return function(input)
Before, using a local no effect method:
class Select:
def a(self, input):
print 'a'
def b(self, input):
print 'b'
def c(self, input);
print 'c'
def nop(self, input):
pass
def __call__(self, input):
return { 1 : self.a,
2 : self.b,
3 : self.c
}.get(input, self.nop)(input)
After:
class Select:
def a(self, input):
print 'a'
def b(self, input):
print 'b'
def c(self, input);
print 'c'
def __call__(self, input):
return { 1 : self.a,
2 : self.b,
3 : self.c
}.get(input, None)(input)
References
[1] Python Reference Manual, Section 3.2,
http://docs.python.org/reference/
Copyright
This document has been placed in the public domain.
pep-0337 Logging Usage in the Standard Library
| PEP: | 337 |
|---|---|
| Title: | Logging Usage in the Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michael P. Dubner <dubnerm at mindless.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 02-Oct-2004 |
| Python-Version: | 2.5 |
| Post-History: | 10-Nov-2004 |
Abstract
This PEP defines a standard for using the logging system (PEP 282
[1]) in the standard library.
Implementing this PEP will simplify development of daemon
applications. As a downside this PEP requires slight
modifications (however in a back-portable way) to a large number
of standard modules.
After implementing this PEP one can use following filtering
scheme:
logging.getLogger('py.BaseHTTPServer').setLevel(logging.FATAL)
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred
for lack of a current champion interested in promoting the goals of the
PEP and collecting and incorporating feedback, and with sufficient
available time to do so effectively.
Rationale
There are a couple of situations when output to stdout or stderr
is impractical:
- Daemon applications where the framework doesn't allow the
redirection of standard output to some file, but assumes use of
some other form of logging. Examples are syslog under *nix'es
and EventLog under WinNT+.
- GUI applications which want to output every new log entry in
separate pop-up window (i.e. fading OSD).
Also sometimes applications want to filter output entries based on
their source or severity. This requirement can't be implemented
using simple redirection.
Finally sometimes output needs to be marked with event timestamps,
which can be accomplished with ease using the logging system.
Proposal
Every module usable for daemon and GUI applications should be
rewritten to use the logging system instead of 'print' or
'sys.stdout.write'.
There should be code like this included in the beginning of every
modified module:
import logging
_log = logging.getLogger('py.<module-name>')
A prefix of 'py.' [2] must be used by all modules included in the
standard library distributed along with Python, and only by such
modules (unverifiable). The use of "_log" is intentional as we
don't want to auto-export it. For modules that use log only in
one class a logger can be created inside the class definition as
follows:
class XXX:
__log = logging.getLogger('py.<module-name>')
Then this class can create access methods to log to this private
logger.
So "print" and "sys.std{out|err}.write" statements should be
replaced with "_log.{debug|info}", and "traceback.print_exception"
with "_log.exception" or sometimes "_log.debug('...', exc_info=1)".
Module List
Here is a (possibly incomplete) list of modules to be reworked:
- asyncore (dispatcher.log, dispatcher.log_info)
- BaseHTTPServer (BaseHTTPRequestHandler.log_request,
BaseHTTPRequestHandler.log_error,
BaseHTTPRequestHandler.log_message)
- cgi (possibly - is cgi.log used by somebody?)
- ftplib (if FTP.debugging)
- gopherlib (get_directory)
- httplib (HTTPResponse, HTTPConnection)
- ihooks (_Verbose)
- imaplib (IMAP4._mesg)
- mhlib (MH.error)
- nntplib (NNTP)
- pipes (Template.makepipeline)
- pkgutil (extend_path)
- platform (_syscmd_ver)
- poplib (if POP3._debugging)
- profile (if Profile.verbose)
- robotparser (_debug)
- smtplib (if SGMLParser.verbose)
- shlex (if shlex.debug)
- smtpd (SMTPChannel/PureProxy where print >> DEBUGSTREAM)
- smtplib (if SMTP.debuglevel)
- SocketServer (BaseServer.handle_error)
- telnetlib (if Telnet.debuglevel)
- threading? (_Verbose._note, Thread.__bootstrap)
- timeit (Timer.print_exc)
- trace
- uu (decode)
Additionally there are a couple of modules with commented debug
output or modules where debug output should be added. For
example:
- urllib
Finally possibly some modules should be extended to provide more
debug information.
Doubtful Modules
Listed here are modules that the community will propose for
addition to the module list and modules that the community say
should be removed from the module list.
- tabnanny (check)
Guidelines for Logging Usage
Also we can provide some recommendation to authors of library
modules so they all follow the same format of naming loggers. I
propose that non-standard library modules should use loggers named
after their full names, so a module "spam" in sub-package "junk"
of package "dummy" will be named "dummy.junk.spam" and, of course,
the "__init__" module of the same sub-package will have the logger
name "dummy.junk".
References
[1] PEP 282, A Logging System, Vinay Sajip, Trent Mick
http://www.python.org/dev/peps/pep-0282/
[2] http://mail.python.org/pipermail/python-dev/2004-October/049282.html
Copyright
This document has been placed in the public domain.
pep-0338 Executing modules as scripts
| PEP: | 338 |
|---|---|
| Title: | Executing modules as scripts |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 16-Oct-2004 |
| Python-Version: | 2.5 |
| Post-History: | 8-Nov-2004, 11-Feb-2006, 12-Feb-2006, 18-Feb-2006 |
Contents
Abstract
This PEP defines semantics for executing any Python module as a script, either with the -m command line switch, or by invoking it via runpy.run_module(modulename).
The -m switch implemented in Python 2.4 is quite limited. This PEP proposes making use of the PEP 302 [4] import hooks to allow any module which provides access to its code object to be executed.
Rationale
Python 2.4 adds the command line switch -m to allow modules to be located using the Python module namespace for execution as scripts. The motivating examples were standard library modules such as pdb and profile, and the Python 2.4 implementation is fine for this limited purpose.
A number of users and developers have requested extension of the feature to also support running modules located inside packages. One example provided is pychecker's pychecker.checker module. This capability was left out of the Python 2.4 implementation because the implementation of this was significantly more complicated, and the most appropriate strategy was not at all clear.
The opinion on python-dev was that it was better to postpone the extension to Python 2.5, and go through the PEP process to help make sure we got it right.
Since that time, it has also been pointed out that the current version of -m does not support zipimport or any other kind of alternative import behaviour (such as frozen modules).
Providing this functionality as a Python module is significantly easier than writing it in C, and makes the functionality readily available to all Python programs, rather than being specific to the CPython interpreter. CPython's command line switch can then be rewritten to make use of the new module.
Scripts which execute other scripts (e.g. profile, pdb) also have the option to use the new module to provide -m style support for identifying the script to be executed.
Scope of this proposal
In Python 2.4, a module located using -m is executed just as if its filename had been provided on the command line. The goal of this PEP is to get as close as possible to making that statement also hold true for modules inside packages, or accessed via alternative import mechanisms (such as zipimport).
Prior discussions suggest it should be noted that this PEP is not about changing the idiom for making Python modules also useful as scripts (see PEP 299 [1]). That issue is considered orthogonal to the specific feature addressed by this PEP.
Current Behaviour
Before describing the new semantics, it's worth covering the existing semantics for Python 2.4 (as they are currently defined only by the source code and the command line help).
When -m is used on the command line, it immediately terminates the option list (like -c). The argument is interpreted as the name of a top-level Python module (i.e. one which can be found on sys.path).
If the module is found, and is of type PY_SOURCE or PY_COMPILED, then the command line is effectively reinterpreted from python <options> -m <module> <args> to python <options> <filename> <args>. This includes setting sys.argv[0] correctly (some scripts rely on this - Python's own regrtest.py is one example).
If the module is not found, or is not of the correct type, an error is printed.
Proposed Semantics
The semantics proposed are fairly simple: if -m is used to execute a module the PEP 302 import mechanisms are used to locate the module and retrieve its compiled code, before executing the module in accordance with the semantics for a top-level module. The interpreter does this by invoking a new standard library function runpy.run_module.
This is necessary due to the way Python's import machinery locates modules inside packages. A package may modify its own __path__ variable during initialisation. In addition, paths may be affected by *.pth files, and some packages will install custom loaders on sys.metapath. Accordingly, the only way for Python to reliably locate the module is by importing the containing package and using the PEP 302 import hooks to gain access to the Python code.
Note that the process of locating the module to be executed may require importing the containing package. The effects of such a package import that will be visible to the executed module are:
- the containing package will be in sys.modules
- any external effects of the package initialisation (e.g. installed import hooks, loggers, atexit handlers, etc.)
Reference Implementation
A reference implementation is available on SourceForge ([2]), along with documentation for the library reference ([5]). There are two parts to this implementation. The first is a proposed standard library module runpy. The second is a modification to the code implementing the -m switch to always delegate to runpy.run_module instead of trying to run the module directly. The delegation has the form:
runpy.run_module(sys.argv[0], run_name="__main__", alter_sys=True)
run_module is the only function runpy exposes in its public API.
run_module(mod_name[, init_globals][, run_name][, alter_sys])
Execute the code of the specified module and return the resulting module globals dictionary. The module's code is first located using the standard import mechanism (refer to PEP 302 for details) and then executed in a fresh module namespace.
The optional dictionary argument init_globals may be used to pre-populate the globals dictionary before the code is executed. The supplied dictionary will not be modified. If any of the special global variables below are defined in the supplied dictionary, those definitions are overridden by the run_module function.
The special global variables __name__, __file__, __loader__ and __builtins__ are set in the globals dictionary before the module code is executed.
__name__ is set to run_name if this optional argument is supplied, and the original mod_name argument otherwise.
__loader__ is set to the PEP 302 module loader used to retrieve the code for the module (This loader may be a wrapper around the standard import mechanism).
__file__ is set to the name provided by the module loader. If the loader does not make filename information available, this argument is set to None.
__builtins__ is automatically initialised with a reference to the top level namespace of the __builtin__ module.
If the argument alter_sys is supplied and evaluates to True, then sys.argv[0] is updated with the value of __file__ and sys.modules[__name__] is updated with a temporary module object for the module being executed. Both sys.argv[0] and sys.modules[__name__] are restored to their original values before this function returns.
When invoked as a script, the runpy module finds and executes the module supplied as the first argument. It adjusts sys.argv by deleting sys.argv[0] (which refers to the runpy module itself) and then invokes run_module(sys.argv[0], run_name="__main__", alter_sys=True).
Import Statements and the Main Module
The release of 2.5b1 showed a surprising (although obvious in retrospect) interaction between this PEP and PEP 328 - explicit relative imports don't work from a main module. This is due to the fact that relative imports rely on __name__ to determine the current module's position in the package hierarchy. In a main module, the value of __name__ is always '__main__', so explicit relative imports will always fail (as they only work for a module inside a package).
Investigation into why implicit relative imports appear to work when a main module is executed directly but fail when executed using -m showed that such imports are actually always treated as absolute imports. Because of the way direct execution works, the package containing the executed module is added to sys.path, so its sibling modules are actually imported as top level modules. This can easily lead to multiple copies of the sibling modules in the application if implicit relative imports are used in modules that may be directly executed (e.g. test modules or utility scripts).
For the 2.5 release, the recommendation is to always use absolute imports in any module that is intended to be used as a main module. The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module. This means that it is possible to run a module from inside a package using -m so long as the current directory contains the top level directory for the package. Absolute imports will work correctly even if the package isn't installed anywhere else on sys.path. If the module is executed directly and uses absolute imports to retrieve its sibling modules, then the top level package directory needs to be installed somewhere on sys.path (since the current directory won't be added automatically).
Here's an example file layout:
devel/
pkg/
__init__.py
moduleA.py
moduleB.py
test/
__init__.py
test_A.py
test_B.py
So long as the current directory is devel, or devel is already on sys.path and the test modules use absolute imports (such as import pkg moduleA to retrieve the module under test, PEP 338 allows the tests to be run as:
python -m pkg.test.test_A python -m pkg.test.test_B
The question of whether or not relative imports should be supported when a main module is executed with -m is something that will be revisited for Python 2.6. Permitting it would require changes to either Python's import semantics or the semantics used to indicate when a module is the main module, so it is not a decision to be made hastily.
Resolved Issues
There were some key design decisions that influenced the development of the runpy module. These are listed below.
- The special variables __name__, __file__ and __loader__ are set in a module's global namespace before the module is executed. As run_module alters these values, it does not mutate the supplied dictionary. If it did, then passing globals() to this function could have nasty side effects.
- Sometimes, the information needed to populate the special variables simply isn't available. Rather than trying to be too clever, these variables are simply set to None when the relevant information cannot be determined.
- There is no special protection on the alter_sys argument. This may result in sys.argv[0] being set to None if file name information is not available.
- The import lock is NOT used to avoid potential threading issues that arise when alter_sys is set to True. Instead, it is recommended that threaded code simply avoid using this flag.
Alternatives
The first alternative implementation considered ignored packages' __path__ variables, and looked only in the main package directory. A Python script with this behaviour can be found in the discussion of the execmodule cookbook recipe [3].
The execmodule cookbook recipe itself was the proposed mechanism in an earlier version of this PEP (before the PEP's author read PEP 302).
Both approaches were rejected as they do not meet the main goal of the -m switch -- to allow the full Python namespace to be used to locate modules for execution from the command line.
An earlier version of this PEP included some mistaken assumptions about the way exec handled locals dictionaries and code from function objects. These mistaken assumptions led to some unneeded design complexity which has now been removed - run_code shares all of the quirks of exec.
Earlier versions of the PEP also exposed a broader API that just the single run_module() function needed to implement the updates to the -m switch. In the interests of simplicity, those extra functions have been dropped from the proposed API.
After the original implementation in SVN, it became clear that holding the import lock when executing the initial application script was not correct (e.g. python -m test.regrtest test_threadedimport failed). So the run_module function only holds the import lock during the actual search for the module, and releases it before execution, even if alter_sys is set.
References
| [1] | Special __main__() function in modules (http://www.python.org/dev/peps/pep-0299/) |
| [2] | PEP 338 implementation (runpy module and -m update) (http://www.python.org/sf/1429601) |
| [3] | execmodule Python Cookbook Recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/307772) |
| [4] | New import hooks (http://www.python.org/dev/peps/pep-0302/) |
| [5] | PEP 338 documentation (for runpy module) (http://www.python.org/sf/1429605) |
Copyright
This document has been placed in the public domain.
pep-0339 Design of the CPython Compiler
| PEP: | 339 |
|---|---|
| Title: | Design of the CPython Compiler |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Withdrawn |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 02-Feb-2005 |
| Post-History: |
Contents
Note
This PEP has been withdrawn and moved to the Python developer's guide.
Abstract
Historically (through 2.4), compilation from source code to bytecode involved two steps:
- Parse the source code into a parse tree (Parser/pgen.c)
- Emit bytecode based on the parse tree (Python/compile.c)
Historically, this is not how a standard compiler works. The usual steps for compilation are:
- Parse source code into a parse tree (Parser/pgen.c)
- Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
- Transform AST into a Control Flow Graph (Python/compile.c)
- Emit bytecode based on the Control Flow Graph (Python/compile.c)
Starting with Python 2.5, the above steps are now used. This change was done to simplify compilation by breaking it into three steps. The purpose of this document is to outline how the latter three steps of the process works.
This document does not touch on how parsing works beyond what is needed to explain what is needed for compilation. It is also not exhaustive in terms of the how the entire system works. You will most likely need to read some source to have an exact understanding of all details.
Parse Trees
Python's parser is an LL(1) parser mostly based off of the implementation laid out in the Dragon Book [Aho86].
The grammar file for Python can be found in Grammar/Grammar with the numeric value of grammar rules are stored in Include/graminit.h. The numeric values for types of tokens (literal tokens, such as :, numbers, etc.) are kept in Include/token.h). The parse tree made up of node * structs (as defined in Include/node.h).
Querying data from the node structs can be done with the following macros (which are all defined in Include/token.h):
- CHILD(node *, int)
Returns the nth child of the node using zero-offset indexing
- RCHILD(node *, int)
Returns the nth child of the node from the right side; use negative numbers!
- NCH(node *)
Number of children the node has
- STR(node *)
String representation of the node; e.g., will return : for a COLON token
- TYPE(node *)
The type of node as specified in Include/graminit.h
- REQ(node *, TYPE)
Assert that the node is the type that is expected
- LINENO(node *)
retrieve the line number of the source code that led to the creation of the parse rule; defined in Python/ast.c
To tie all of this example, consider the rule for 'while':
while_stmt: 'while' test ':' suite ['else' ':' suite]
The node representing this will have TYPE(node) == while_stmt and the number of children can be 4 or 7 depending on if there is an 'else' statement. To access what should be the first ':' and require it be an actual ':' token, (REQ(CHILD(node, 2), COLON)`.
Abstract Syntax Trees (AST)
The abstract syntax tree (AST) is a high-level representation of the program structure without the necessity of containing the source code; it can be thought of as an abstract representation of the source code. The specification of the AST nodes is specified using the Zephyr Abstract Syntax Definition Language (ASDL) [Wang97].
The definition of the AST nodes for Python is found in the file Parser/Python.asdl .
Each AST node (representing statements, expressions, and several specialized types, like list comprehensions and exception handlers) is defined by the ASDL. Most definitions in the AST correspond to a particular source construct, such as an 'if' statement or an attribute lookup. The definition is independent of its realization in any particular programming language.
The following fragment of the Python ASDL construct demonstrates the approach and syntax:
module Python
{
stmt = FunctionDef(identifier name, arguments args, stmt* body,
expr* decorators)
| Return(expr? value) | Yield(expr value)
attributes (int lineno)
}
The preceding example describes three different kinds of statements; function definitions, return statements, and yield statements. All three kinds are considered of type stmt as shown by '|' separating the various kinds. They all take arguments of various kinds and amounts.
Modifiers on the argument type specify the number of values needed; '?' means it is optional, '*' means 0 or more, no modifier means only one value for the argument and it is required. FunctionDef, for instance, takes an identifier for the name, 'arguments' for args, zero or more stmt arguments for 'body', and zero or more expr arguments for 'decorators'.
Do notice that something like 'arguments', which is a node type, is represented as a single AST node and not as a sequence of nodes as with stmt as one might expect.
All three kinds also have an 'attributes' argument; this is shown by the fact that 'attributes' lacks a '|' before it.
The statement definitions above generate the following C structure type:
typedef struct _stmt *stmt_ty;
struct _stmt {
enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
union {
struct {
identifier name;
arguments_ty args;
asdl_seq *body;
} FunctionDef;
struct {
expr_ty value;
} Return;
struct {
expr_ty value;
} Yield;
} v;
int lineno;
}
Also generated are a series of constructor functions that allocate (in this case) a stmt_ty struct with the appropriate initialization. The 'kind' field specifies which component of the union is initialized. The FunctionDef() constructor function sets 'kind' to FunctionDef_kind and initializes the 'name', 'args', 'body', and 'attributes' fields.
Memory Management
Before discussing the actual implementation of the compiler, a discussion of how memory is handled is in order. To make memory management simple, an arena is used. This means that a memory is pooled in a single location for easy allocation and removal. What this gives us is the removal of explicit memory deallocation. Because memory allocation for all needed memory in the compiler registers that memory with the arena, a single call to free the arena is all that is needed to completely free all memory used by the compiler.
In general, unless you are working on the critical core of the compiler, memory management can be completely ignored. But if you are working at either the very beginning of the compiler or the end, you need to care about how the arena works. All code relating to the arena is in either Include/pyarena.h or Python/pyarena.c .
PyArena_New() will create a new arena. The returned PyArena structure will store pointers to all memory given to it. This does the bookkeeping of what memory needs to be freed when the compiler is finished with the memory it used. That freeing is done with PyArena_Free(). This needs to only be called in strategic areas where the compiler exits.
As stated above, in general you should not have to worry about memory management when working on the compiler. The technical details have been designed to be hidden from you for most cases.
The only exception comes about when managing a PyObject. Since the rest of Python uses reference counting, there is extra support added to the arena to cleanup each PyObject that was allocated. These cases are very rare. However, if you've allocated a PyObject, you must tell the arena about it by calling PyArena_AddPyObject().
Parse Tree to AST
The AST is generated from the parse tree (see Python/ast.c) using the function PyAST_FromNode().
The function begins a tree walk of the parse tree, creating various AST nodes as it goes along. It does this by allocating all new nodes it needs, calling the proper AST node creation functions for any required supporting functions, and connecting them as needed.
Do realize that there is no automated nor symbolic connection between the grammar specification and the nodes in the parse tree. No help is directly provided by the parse tree as in yacc.
For instance, one must keep track of which node in the parse tree one is working with (e.g., if you are working with an 'if' statement you need to watch out for the ':' token to find the end of the conditional).
The functions called to generate AST nodes from the parse tree all have the name ast_for_xx where xx is what the grammar rule that the function handles (alias_for_import_name is the exception to this). These in turn call the constructor functions as defined by the ASDL grammar and contained in Python/Python-ast.c (which was generated by Parser/asdl_c.py) to create the nodes of the AST. This all leads to a sequence of AST nodes stored in asdl_seq structs.
Function and macros for creating and using asdl_seq * types as found in Python/asdl.c and Include/asdl.h:
- asdl_seq_new()
Allocate memory for an asdl_seq for the specified length
- asdl_seq_GET()
Get item held at a specific position in an asdl_seq
- asdl_seq_SET()
Set a specific index in an asdl_seq to the specified value
- asdl_seq_LEN(asdl_seq *)
Return the length of an asdl_seq
If you are working with statements, you must also worry about keeping track of what line number generated the statement. Currently the line number is passed as the last parameter to each stmt_ty function.
Control Flow Graphs
A control flow graph (often referenced by its acronym, CFG) is a directed graph that models the flow of a program using basic blocks that contain the intermediate representation (abbreviated "IR", and in this case is Python bytecode) within the blocks. Basic blocks themselves are a block of IR that has a single entry point but possibly multiple exit points. The single entry point is the key to basic blocks; it all has to do with jumps. An entry point is the target of something that changes control flow (such as a function call or a jump) while exit points are instructions that would change the flow of the program (such as jumps and 'return' statements). What this means is that a basic block is a chunk of code that starts at the entry point and runs to an exit point or the end of the block.
As an example, consider an 'if' statement with an 'else' block. The guard on the 'if' is a basic block which is pointed to by the basic block containing the code leading to the 'if' statement. The 'if' statement block contains jumps (which are exit points) to the true body of the 'if' and the 'else' body (which may be NULL), each of which are their own basic blocks. Both of those blocks in turn point to the basic block representing the code following the entire 'if' statement.
CFGs are usually one step away from final code output. Code is directly generated from the basic blocks (with jump targets adjusted based on the output order) by doing a post-order depth-first search on the CFG following the edges.
AST to CFG to Bytecode
With the AST created, the next step is to create the CFG. The first step is to convert the AST to Python bytecode without having jump targets resolved to specific offsets (this is calculated when the CFG goes to final bytecode). Essentially, this transforms the AST into Python bytecode with control flow represented by the edges of the CFG.
Conversion is done in two passes. The first creates the namespace (variables can be classified as local, free/cell for closures, or global). With that done, the second pass essentially flattens the CFG into a list and calculates jump offsets for final output of bytecode.
The conversion process is initiated by a call to the function PyAST_Compile() in Python/compile.c . This function does both the conversion of the AST to a CFG and outputting final bytecode from the CFG. The AST to CFG step is handled mostly by two functions called by PyAST_Compile(); PySymtable_Build() and compiler_mod() . The former is in Python/symtable.c while the latter is in Python/compile.c .
PySymtable_Build() begins by entering the starting code block for the AST (passed-in) and then calling the proper symtable_visit_xx function (with xx being the AST node type). Next, the AST tree is walked with the various code blocks that delineate the reach of a local variable as blocks are entered and exited using symtable_enter_block() and symtable_exit_block(), respectively.
Once the symbol table is created, it is time for CFG creation, whose code is in Python/compile.c . This is handled by several functions that break the task down by various AST node types. The functions are all named compiler_visit_xx where xx is the name of the node type (such as stmt, expr, etc.). Each function receives a struct compiler * and xx_ty where xx is the AST node type. Typically these functions consist of a large 'switch' statement, branching based on the kind of node type passed to it. Simple things are handled inline in the 'switch' statement with more complex transformations farmed out to other functions named compiler_xx with xx being a descriptive name of what is being handled.
When transforming an arbitrary AST node, use the VISIT() macro. The appropriate compiler_visit_xx function is called, based on the value passed in for <node type> (so VISIT(c, expr, node) calls compiler_visit_expr(c, node)). The VISIT_SEQ macro is very similar, but is called on AST node sequences (those values that were created as arguments to a node that used the '*' modifier). There is also VISIT_SLICE() just for handling slices.
Emission of bytecode is handled by the following macros:
- ADDOP()
add a specified opcode
- ADDOP_I()
add an opcode that takes an argument
- ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)
add an opcode with the proper argument based on the position of the specified PyObject in PyObject sequence object, but with no handling of mangled names; used for when you need to do named lookups of objects such as globals, consts, or parameters where name mangling is not possible and the scope of the name is known
- ADDOP_NAME()
just like ADDOP_O, but name mangling is also handled; used for attribute loading or importing based on name
- ADDOP_JABS()
create an absolute jump to a basic block
- ADDOP_JREL()
create a relative jump to a basic block
Several helper functions that will emit bytecode and are named compiler_xx() where xx is what the function helps with (list, boolop, etc.). A rather useful one is compiler_nameop(). This function looks up the scope of a variable and, based on the expression context, emits the proper opcode to load, store, or delete the variable.
As for handling the line number on which a statement is defined, is handled by compiler_visit_stmt() and thus is not a worry.
In addition to emitting bytecode based on the AST node, handling the creation of basic blocks must be done. Below are the macros and functions used for managing basic blocks:
- NEW_BLOCK()
create block and set it as current
- NEXT_BLOCK()
basically NEW_BLOCK() plus jump from current block
- compiler_new_block()
create a block but don't use it (used for generating jumps)
Once the CFG is created, it must be flattened and then final emission of bytecode occurs. Flattening is handled using a post-order depth-first search. Once flattened, jump offsets are backpatched based on the flattening and then a PyCodeObject file is created. All of this is handled by calling assemble() .
Introducing New Bytecode
Sometimes a new feature requires a new opcode. But adding new bytecode is not as simple as just suddenly introducing new bytecode in the AST -> bytecode step of the compiler. Several pieces of code throughout Python depend on having correct information about what bytecode exists.
First, you must choose a name and a unique identifier number. The official list of bytecode can be found in Include/opcode.h . If the opcode is to take an argument, it must be given a unique number greater than that assigned to HAVE_ARGUMENT (as found in Include/opcode.h).
Once the name/number pair has been chosen and entered in Include/opcode.h, you must also enter it into Lib/opcode.py and Doc/library/dis.rst .
With a new bytecode you must also change what is called the magic number for .pyc files. The variable MAGIC in Python/import.c contains the number. Changing this number will lead to all .pyc files with the old MAGIC to be recompiled by the interpreter on import.
Finally, you need to introduce the use of the new bytecode. Altering Python/compile.c and Python/ceval.c will be the primary places to change. But you will also need to change the 'compiler' package. The key files to do that are Lib/compiler/pyassem.py and Lib/compiler/pycodegen.py .
If you make a change here that can affect the output of bytecode that is already in existence and you do not change the magic number constantly, make sure to delete your old .py(c|o) files! Even though you will end up changing the magic number if you change the bytecode, while you are debugging your work you will be changing the bytecode output without constantly bumping up the magic number. This means you end up with stale .pyc files that will not be recreated. Running find . -name '*.py[co]' -exec rm -f {} ';' should delete all .pyc files you have, forcing new ones to be created and thus allow you test out your new bytecode properly.
Code Objects
The result of PyAST_Compile() is a PyCodeObject which is defined in Include/code.h . And with that you now have executable Python bytecode!
The code objects (byte code) is executed in Python/ceval.c . This file will also need a new case statement for the new opcode in the big switch statement in PyEval_EvalFrameEx().
Important Files
Parser/
Python/
- Python-ast.c
Creates C structs corresponding to the ASDL types. Also contains code for marshaling AST nodes (core ASDL types have marshaling code in asdl.c). "File automatically generated by Parser/asdl_c.py". This file must be committed separately after every grammar change is committed since the __version__ value is set to the latest grammar change revision number.
- asdl.c
Contains code to handle the ASDL sequence type. Also has code to handle marshalling the core ASDL types, such as number and identifier. used by Python-ast.c for marshaling AST nodes.
- ast.c
Converts Python's parse tree into the abstract syntax tree.
- ceval.c
Executes byte code (aka, eval loop).
- compile.c
Emits bytecode based on the AST.
- symtable.c
Generates a symbol table from AST.
- pyarena.c
Implementation of the arena memory manager.
- import.c
Home of the magic number (named MAGIC) for bytecode versioning
Include/
- Python-ast.h
Contains the actual definitions of the C structs as generated by Python/Python-ast.c . "Automatically generated by Parser/asdl_c.py".
- asdl.h
Header for the corresponding Python/ast.c .
- ast.h
Declares PyAST_FromNode() external (from Python/ast.c).
- code.h
Header file for Objects/codeobject.c; contains definition of PyCodeObject.
- symtable.h
Header for Python/symtable.c . struct symtable and PySTEntryObject are defined here.
- pyarena.h
Header file for the corresponding Python/pyarena.c .
- opcode.h
Master list of bytecode; if this file is modified you must modify several other files accordingly (see "Introducing New Bytecode")
Objects/
- codeobject.c
Contains PyCodeObject-related code (originally in Python/compile.c).
Lib/
- opcode.py
One of the files that must be modified if Include/opcode.h is.
compiler/
- pyassem.py
One of the files that must be modified if Include/opcode.h is changed.
- pycodegen.py
One of the files that must be modified if Include/opcode.h is changed.
References
| [Aho86] | Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools, http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108 |
| [Wang97] | Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris S. Serra. The Zephyr Abstract Syntax Description Language. [4] In Proceedings of the Conference on Domain-Specific Languages, pp. 213--227, 1997. |
| [1] | Skip Montanaro's Peephole Optimizer Paper (http://www.foretec.com/python/workshops/1998-11/proceedings/papers/montanaro/montanaro.html) |
| [2] | Bytecodehacks Project (http://bytecodehacks.sourceforge.net/bch-docs/bch/index.html) |
| [3] | CALL_ATTR opcode (http://www.python.org/sf/709744) |
| [4] | http://www.cs.princeton.edu/research/techreps/TR-554-97 |
| [5] | (1, 2) http://pages.cpsc.ucalgary.ca/~aycock/spark/ |
pep-0340 Anonymous Block Statements
| PEP: | 340 |
|---|---|
| Title: | Anonymous Block Statements |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 27-Apr-2005 |
| Post-History: |
Introduction
This PEP proposes a new type of compound statement which can be
used for resource management purposes. The new statement type
is provisionally called the block-statement because the keyword
to be used has not yet been chosen.
This PEP competes with several other PEPs: PEP 288 (Generators
Attributes and Exceptions; only the second part), PEP 310
(Reliable Acquisition/Release Pairs), and PEP 325
(Resource-Release Support for Generators).
I should clarify that using a generator to "drive" a block
statement is really a separable proposal; with just the definition
of the block statement from the PEP you could implement all the
examples using a class (similar to example 6, which is easily
turned into a template). But the key idea is using a generator to
drive a block statement; the rest is elaboration, so I'd like to
keep these two parts together.
(PEP 342, Enhanced Iterators, was originally a part of this PEP;
but the two proposals are really independent and with Steven
Bethard's help I have moved it to a separate PEP.)
Rejection Notice
I am rejecting this PEP in favor of PEP 343. See the motivational
section in that PEP for the reasoning behind this rejection. GvR.
Motivation and Summary
(Thanks to Shane Hathaway -- Hi Shane!)
Good programmers move commonly used code into reusable functions.
Sometimes, however, patterns arise in the structure of the
functions rather than the actual sequence of statements. For
example, many functions acquire a lock, execute some code specific
to that function, and unconditionally release the lock. Repeating
the locking code in every function that uses it is error prone and
makes refactoring difficult.
Block statements provide a mechanism for encapsulating patterns of
structure. Code inside the block statement runs under the control
of an object called a block iterator. Simple block iterators
execute code before and after the code inside the block statement.
Block iterators also have the opportunity to execute the
controlled code more than once (or not at all), catch exceptions,
or receive data from the body of the block statement.
A convenient way to write block iterators is to write a generator
(PEP 255). A generator looks a lot like a Python function, but
instead of returning a value immediately, generators pause their
execution at "yield" statements. When a generator is used as a
block iterator, the yield statement tells the Python interpreter
to suspend the block iterator, execute the block statement body,
and resume the block iterator when the body has executed.
The Python interpreter behaves as follows when it encounters a
block statement based on a generator. First, the interpreter
instantiates the generator and begins executing it. The generator
does setup work appropriate to the pattern it encapsulates, such
as acquiring a lock, opening a file, starting a database
transaction, or starting a loop. Then the generator yields
execution to the body of the block statement using a yield
statement. When the block statement body completes, raises an
uncaught exception, or sends data back to the generator using a
continue statement, the generator resumes. At this point, the
generator can either clean up and stop or yield again, causing the
block statement body to execute again. When the generator
finishes, the interpreter leaves the block statement.
Use Cases
See the Examples section near the end.
Specification: the __exit__() Method
An optional new method for iterators is proposed, called
__exit__(). It takes up to three arguments which correspond to
the three "arguments" to the raise-statement: type, value, and
traceback. If all three arguments are None, sys.exc_info() may be
consulted to provide suitable default values.
Specification: the Anonymous Block Statement
A new statement is proposed with the syntax
block EXPR1 as VAR1:
BLOCK1
Here, 'block' and 'as' are new keywords; EXPR1 is an arbitrary
expression (but not an expression-list) and VAR1 is an arbitrary
assignment target (which may be a comma-separated list).
The "as VAR1" part is optional; if omitted, the assignments to
VAR1 in the translation below are omitted (but the expressions
assigned are still evaluated!).
The choice of the 'block' keyword is contentious; many
alternatives have been proposed, including not to use a keyword at
all (which I actually like). PEP 310 uses 'with' for similar
semantics, but I would like to reserve that for a with-statement
similar to the one found in Pascal and VB. (Though I just found
that the C# designers don't like 'with' [2], and I have to agree
with their reasoning.) To sidestep this issue momentarily I'm
using 'block' until we can agree on the right keyword, if any.
Note that the 'as' keyword is not contentious (it will finally be
elevated to proper keyword status).
Note that it is up to the iterator to decide whether a
block-statement represents a loop with multiple iterations; in the
most common use case BLOCK1 is executed exactly once. To the
parser, however, it is always a loop; break and continue return
transfer to the block's iterator (see below for details).
The translation is subtly different from a for-loop: iter() is
not called, so EXPR1 should already be an iterator (not just an
iterable); and the iterator is guaranteed to be notified when
the block-statement is left, regardless if this is due to a
break, return or exception:
itr = EXPR1 # The iterator
ret = False # True if a return statement is active
val = None # Return value, if ret == True
exc = None # sys.exc_info() tuple if an exception is active
while True:
try:
if exc:
ext = getattr(itr, "__exit__", None)
if ext is not None:
VAR1 = ext(*exc) # May re-raise *exc
else:
raise exc[0], exc[1], exc[2]
else:
VAR1 = itr.next() # May raise StopIteration
except StopIteration:
if ret:
return val
break
try:
ret = False
val = exc = None
BLOCK1
except:
exc = sys.exc_info()
(However, the variables 'itr' etc. are not user-visible and the
built-in names used cannot be overridden by the user.)
Inside BLOCK1, the following special translations apply:
- "break" is always legal; it is translated into:
exc = (StopIteration, None, None)
continue
- "return EXPR3" is only legal when the block-statement is
contained in a function definition; it is translated into:
exc = (StopIteration, None, None)
ret = True
val = EXPR3
continue
The net effect is that break and return behave much the same as
if the block-statement were a for-loop, except that the iterator
gets a chance at resource cleanup before the block-statement is
left, through the optional __exit__() method. The iterator also
gets a chance if the block-statement is left through raising an
exception. If the iterator doesn't have an __exit__() method,
there is no difference with a for-loop (except that a for-loop
calls iter() on EXPR1).
Note that a yield-statement in a block-statement is not treated
differently. It suspends the function containing the block
*without* notifying the block's iterator. The block's iterator is
entirely unaware of this yield, since the local control flow
doesn't actually leave the block. In other words, it is *not*
like a break or return statement. When the loop that was resumed
by the yield calls next(), the block is resumed right after the
yield. (See example 7 below.) The generator finalization
semantics described below guarantee (within the limitations of all
finalization semantics) that the block will be resumed eventually.
Unlike the for-loop, the block-statement does not have an
else-clause. I think it would be confusing, and emphasize the
"loopiness" of the block-statement, while I want to emphasize its
*difference* from a for-loop. In addition, there are several
possible semantics for an else-clause, and only a very weak use
case.
Specification: Generator Exit Handling
Generators will implement the new __exit__() method API.
Generators will be allowed to have a yield statement inside a
try-finally statement.
The expression argument to the yield-statement will become
optional (defaulting to None).
When __exit__() is called, the generator is resumed but at the
point of the yield-statement the exception represented by the
__exit__ argument(s) is raised. The generator may re-raise this
exception, raise another exception, or yield another value,
except that if the exception passed in to __exit__() was
StopIteration, it ought to raise StopIteration (otherwise the
effect would be that a break is turned into continue, which is
unexpected at least). When the *initial* call resuming the
generator is an __exit__() call instead of a next() call, the
generator's execution is aborted and the exception is re-raised
without passing control to the generator's body.
When a generator that has not yet terminated is garbage-collected
(either through reference counting or by the cyclical garbage
collector), its __exit__() method is called once with
StopIteration as its first argument. Together with the
requirement that a generator ought to raise StopIteration when
__exit__() is called with StopIteration, this guarantees the
eventual activation of any finally-clauses that were active when
the generator was last suspended. Of course, under certain
circumstances the generator may never be garbage-collected. This
is no different than the guarantees that are made about finalizers
(__del__() methods) of other objects.
Alternatives Considered and Rejected
- Many alternatives have been proposed for 'block'. I haven't
seen a proposal for another keyword that I like better than
'block' yet. Alas, 'block' is also not a good choice; it is a
rather popular name for variables, arguments and methods.
Perhaps 'with' is the best choice after all?
- Instead of trying to pick the ideal keyword, the block-statement
could simply have the form:
EXPR1 as VAR1:
BLOCK1
This is at first attractive because, together with a good choice
of function names (like those in the Examples section below)
used in EXPR1, it reads well, and feels like a "user-defined
statement". And yet, it makes me (and many others)
uncomfortable; without a keyword the syntax is very "bland",
difficult to look up in a manual (remember that 'as' is
optional), and it makes the meaning of break and continue in the
block-statement even more confusing.
- Phillip Eby has proposed to have the block-statement use
an entirely different API than the for-loop, to differentiate
between the two. A generator would have to be wrapped in a
decorator to make it support the block API. IMO this adds more
complexity with very little benefit; and we can't relly deny
that the block-statement is conceptually a loop -- it supports
break and continue, after all.
- This keeps getting proposed: "block VAR1 = EXPR1" instead of
"block EXPR1 as VAR1". That would be very misleading, since
VAR1 does *not* get assigned the value of EXPR1; EXPR1 results
in a generator which is assigned to an internal variable, and
VAR1 is the value returned by successive calls to the __next__()
method of that iterator.
- Why not change the translation to apply iter(EXPR1)? All the
examples would continue to work. But this makes the
block-statement *more* like a for-loop, while the emphasis ought
to be on the *difference* between the two. Not calling iter()
catches a bunch of misunderstandings, like using a sequence as
EXPR1.
Comparison to Thunks
Alternative semantics proposed for the block-statement turn the
block into a thunk (an anonymous function that blends into the
containing scope).
The main advantage of thunks that I can see is that you can save
the thunk for later, like a callback for a button widget (the
thunk then becomes a closure). You can't use a yield-based block
for that (except in Ruby, which uses yield syntax with a
thunk-based implementation). But I have to say that I almost see
this as an advantage: I think I'd be slightly uncomfortable seeing
a block and not knowing whether it will be executed in the normal
control flow or later. Defining an explicit nested function for
that purpose doesn't have this problem for me, because I already
know that the 'def' keyword means its body is executed later.
The other problem with thunks is that once we think of them as the
anonymous functions they are, we're pretty much forced to say that
a return statement in a thunk returns from the thunk rather than
from the containing function. Doing it any other way would cause
major weirdness when the thunk were to survive its containing
function as a closure (perhaps continuations would help, but I'm
not about to go there :-).
But then an IMO important use case for the resource cleanup
template pattern is lost. I routinely write code like this:
def findSomething(self, key, default=None):
self.lock.acquire()
try:
for item in self.elements:
if item.matches(key):
return item
return default
finally:
self.lock.release()
and I'd be bummed if I couldn't write this as:
def findSomething(self, key, default=None):
block locking(self.lock):
for item in self.elements:
if item.matches(key):
return item
return default
This particular example can be rewritten using a break:
def findSomething(self, key, default=None):
block locking(self.lock):
for item in self.elements:
if item.matches(key):
break
else:
item = default
return item
but it looks forced and the transformation isn't always that easy;
you'd be forced to rewrite your code in a single-return style
which feels too restrictive.
Also note the semantic conundrum of a yield in a thunk -- the only
reasonable interpretation is that this turns the thunk into a
generator!
Greg Ewing believes that thunks "would be a lot simpler, doing
just what is required without any jiggery pokery with exceptions
and break/continue/return statements. It would be easy to explain
what it does and why it's useful."
But in order to obtain the required local variable sharing between
the thunk and the containing function, every local variable used
or set in the thunk would have to become a 'cell' (our mechanism
for sharing variables between nested scopes). Cells slow down
access compared to regular local variables: access involves an
extra C function call (PyCell_Get() or PyCell_Set()).
Perhaps not entirely coincidentally, the last example above
(findSomething() rewritten to avoid a return inside the block)
shows that, unlike for regular nested functions, we'll want
variables *assigned to* by the thunk also to be shared with the
containing function, even if they are not assigned to outside the
thunk.
Greg Ewing again: "generators have turned out to be more powerful,
because you can have more than one of them on the go at once. Is
there a use for that capability here?"
I believe there are definitely uses for this; several people have
already shown how to do asynchronous light-weight threads using
generators (e.g. David Mertz quoted in PEP 288, and Fredrik
Lundh[3]).
And finally, Greg says: "a thunk implementation has the potential
to easily handle multiple block arguments, if a suitable syntax
could ever be devised. It's hard to see how that could be done in
a general way with the generator implementation."
However, the use cases for multiple blocks seem elusive.
(Proposals have since been made to change the implementation of
thunks to remove most of these objections, but the resulting
semantics are fairly complex to explain and to implement, so IMO
that defeats the purpose of using thunks in the first place.)
Examples
(Several of these examples contain "yield None". If PEP 342 is
accepted, these can be changed to just "yield" of course.)
1. A template for ensuring that a lock, acquired at the start of a
block, is released when the block is left:
def locking(lock):
lock.acquire()
try:
yield None
finally:
lock.release()
Used as follows:
block locking(myLock):
# Code here executes with myLock held. The lock is
# guaranteed to be released when the block is left (even
# if via return or by an uncaught exception).
2. A template for opening a file that ensures the file is closed
when the block is left:
def opening(filename, mode="r"):
f = open(filename, mode)
try:
yield f
finally:
f.close()
Used as follows:
block opening("/etc/passwd") as f:
for line in f:
print line.rstrip()
3. A template for committing or rolling back a database
transaction:
def transactional(db):
try:
yield None
except:
db.rollback()
raise
else:
db.commit()
4. A template that tries something up to n times:
def auto_retry(n=3, exc=Exception):
for i in range(n):
try:
yield None
return
except exc, err:
# perhaps log exception here
continue
raise # re-raise the exception we caught earlier
Used as follows:
block auto_retry(3, IOError):
f = urllib.urlopen("http://www.python.org/dev/peps/pep-0340/")
print f.read()
5. It is possible to nest blocks and combine templates:
def locking_opening(lock, filename, mode="r"):
block locking(lock):
block opening(filename) as f:
yield f
Used as follows:
block locking_opening(myLock, "/etc/passwd") as f:
for line in f:
print line.rstrip()
(If this example confuses you, consider that it is equivalent
to using a for-loop with a yield in its body in a regular
generator which is invoking another iterator or generator
recursively; see for example the source code for os.walk().)
6. It is possible to write a regular iterator with the
semantics of example 1:
class locking:
def __init__(self, lock):
self.lock = lock
self.state = 0
def __next__(self, arg=None):
# ignores arg
if self.state:
assert self.state == 1
self.lock.release()
self.state += 1
raise StopIteration
else:
self.lock.acquire()
self.state += 1
return None
def __exit__(self, type, value=None, traceback=None):
assert self.state in (0, 1, 2)
if self.state == 1:
self.lock.release()
raise type, value, traceback
(This example is easily modified to implement the other
examples; it shows how much simpler generators are for the same
purpose.)
7. Redirect stdout temporarily:
def redirecting_stdout(new_stdout):
save_stdout = sys.stdout
try:
sys.stdout = new_stdout
yield None
finally:
sys.stdout = save_stdout
Used as follows:
block opening(filename, "w") as f:
block redirecting_stdout(f):
print "Hello world"
8. A variant on opening() that also returns an error condition:
def opening_w_error(filename, mode="r"):
try:
f = open(filename, mode)
except IOError, err:
yield None, err
else:
try:
yield f, None
finally:
f.close()
Used as follows:
block opening_w_error("/etc/passwd", "a") as f, err:
if err:
print "IOError:", err
else:
f.write("guido::0:0::/:/bin/sh\n")
Acknowledgements
In no useful order: Alex Martelli, Barry Warsaw, Bob Ippolito,
Brett Cannon, Brian Sabbey, Chris Ryland, Doug Landauer, Duncan
Booth, Fredrik Lundh, Greg Ewing, Holger Krekel, Jason Diamond,
Jim Jewett, Josiah Carlson, Ka-Ping Yee, Michael Chermside,
Michael Hudson, Neil Schemenauer, Nick Coghlan, Paul Moore,
Phillip Eby, Raymond Hettinger, Georg Brandl, Samuele
Pedroni, Shannon Behrens, Skip Montanaro, Steven Bethard, Terry
Reedy, Tim Delaney, Aahz, and others. Thanks all for the valuable
contributions!
References
[1] http://mail.python.org/pipermail/python-dev/2005-April/052821.html
[2] http://msdn.microsoft.com/vcsharp/programming/language/ask/withstatement/
[3] http://effbot.org/zone/asyncore-generators.htm
Copyright
This document has been placed in the public domain.
pep-0341 Unifying try-except and try-finally
| PEP: | 341 |
|---|---|
| Title: | Unifying try-except and try-finally |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 04-May-2005 |
| Post-History: |
Abstract
This PEP proposes a change in the syntax and semantics of try
statements to allow combined try-except-finally blocks. This
means in short that it would be valid to write
try:
<do something>
except Exception:
<handle the error>
finally:
<cleanup>
Rationale/Proposal
There are many use cases for the try-except statement and
for the try-finally statement per se; however, often one needs
to catch exceptions and execute some cleanup code afterwards.
It is slightly annoying and not very intelligible that
one has to write
f = None
try:
try:
f = open(filename)
text = f.read()
except IOError:
print 'An error occurred'
finally:
if f:
f.close()
So it is proposed that a construction like this
try:
<suite 1>
except Ex1:
<suite 2>
<more except: clauses>
else:
<suite 3>
finally:
<suite 4>
be exactly the same as the legacy
try:
try:
<suite 1>
except Ex1:
<suite 2>
<more except: clauses>
else:
<suite 3>
finally:
<suite 4>
This is backwards compatible, and every try statement that is
legal today would continue to work.
Changes to the grammar
The grammar for the try statement, which is currently
try_stmt: ('try' ':' suite (except_clause ':' suite)+
['else' ':' suite] | 'try' ':' suite 'finally' ':' suite)
would have to become
try_stmt: 'try' ':' suite
(
(except_clause ':' suite)+
['else' ':' suite]
['finally' ':' suite]
|
'finally' ':' suite
)
Implementation
As the PEP author currently does not have sufficient knowledge
of the CPython implementation, he is unfortunately not able
to deliver one. Thomas Lee has submitted a patch[2].
However, according to Guido, it should be a piece of cake to
implement[1] -- at least for a core hacker.
This patch was committed 17 December 2005, SVN revision 41740 [3].
References
[1] http://mail.python.org/pipermail/python-dev/2005-May/053319.html
[2] http://python.org/sf/1355913
[3] http://mail.python.org/pipermail/python-checkins/2005-December/048457.html
Copyright
This document has been placed in the public domain.
pep-0342 Coroutines via Enhanced Generators
| PEP: | 342 |
|---|---|
| Title: | Coroutines via Enhanced Generators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum, Phillip J. Eby |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 10-May-2005 |
| Python-Version: | 2.5 |
| Post-History: |
Introduction
This PEP proposes some enhancements to the API and syntax of
generators, to make them usable as simple coroutines. It is
basically a combination of ideas from these two PEPs, which
may be considered redundant if this PEP is accepted:
- PEP 288, Generators Attributes and Exceptions. The current PEP
covers its second half, generator exceptions (in fact the
throw() method name was taken from PEP 288). PEP 342 replaces
generator attributes, however, with a concept from an earlier
revision of PEP 288, the "yield expression".
- PEP 325, Resource-Release Support for Generators. PEP 342
ties up a few loose ends in the PEP 325 spec, to make it suitable
for actual implementation.
Motivation
Coroutines are a natural way of expressing many algorithms, such as
simulations, games, asynchronous I/O, and other forms of event-
driven programming or co-operative multitasking. Python's generator
functions are almost coroutines -- but not quite -- in that they
allow pausing execution to produce a value, but do not provide for
values or exceptions to be passed in when execution resumes. They
also do not allow execution to be paused within the "try" portion of
try/finally blocks, and therefore make it difficult for an aborted
coroutine to clean up after itself.
Also, generators cannot yield control while other functions are
executing, unless those functions are themselves expressed as
generators, and the outer generator is written to yield in response
to values yielded by the inner generator. This complicates the
implementation of even relatively simple use cases like asynchronous
communications, because calling any functions either requires the
generator to "block" (i.e. be unable to yield control), or else a
lot of boilerplate looping code must be added around every needed
function call.
However, if it were possible to pass values or exceptions *into* a
generator at the point where it was suspended, a simple co-routine
scheduler or "trampoline function" would let coroutines "call" each
other without blocking -- a tremendous boon for asynchronous
applications. Such applications could then write co-routines to
do non-blocking socket I/O by yielding control to an I/O scheduler
until data has been sent or becomes available. Meanwhile, code that
performs the I/O would simply do something like this:
data = (yield nonblocking_read(my_socket, nbytes))
in order to pause execution until the nonblocking_read() coroutine
produced a value.
In other words, with a few relatively minor enhancements to the
language and to the implementation of the generator-iterator type,
Python will be able to support performing asynchronous operations
without needing to write the entire application as a series of
callbacks, and without requiring the use of resource-intensive threads
for programs that need hundreds or even thousands of co-operatively
multitasking pseudothreads. Thus, these enhancements will give
standard Python many of the benefits of the Stackless Python fork,
without requiring any significant modification to the CPython core
or its APIs. In addition, these enhancements should be readily
implementable by any Python implementation (such as Jython) that
already supports generators.
Specification Summary
By adding a few simple methods to the generator-iterator type, and
with two minor syntax adjustments, Python developers will be able
to use generator functions to implement co-routines and other forms
of co-operative multitasking. These methods and adjustments are:
1. Redefine "yield" to be an expression, rather than a statement.
The current yield statement would become a yield expression
whose value is thrown away. A yield expression's value is
None whenever the generator is resumed by a normal next() call.
2. Add a new send() method for generator-iterators, which resumes
the generator and "sends" a value that becomes the result of the
current yield-expression. The send() method returns the next
value yielded by the generator, or raises StopIteration if the
generator exits without yielding another value.
3. Add a new throw() method for generator-iterators, which raises
an exception at the point where the generator was paused, and
which returns the next value yielded by the generator, raising
StopIteration if the generator exits without yielding another
value. (If the generator does not catch the passed-in exception,
or raises a different exception, then that exception propagates
to the caller.)
4. Add a close() method for generator-iterators, which raises
GeneratorExit at the point where the generator was paused. If
the generator then raises StopIteration (by exiting normally, or
due to already being closed) or GeneratorExit (by not catching
the exception), close() returns to its caller. If the generator
yields a value, a RuntimeError is raised. If the generator
raises any other exception, it is propagated to the caller.
close() does nothing if the generator has already exited due to
an exception or normal exit.
5. Add support to ensure that close() is called when a generator
iterator is garbage-collected.
6. Allow "yield" to be used in try/finally blocks, since garbage
collection or an explicit close() call would now allow the
finally clause to execute.
A prototype patch implementing all of these changes against the
current Python CVS HEAD is available as SourceForge patch #1223381
(http://python.org/sf/1223381).
Specification: Sending Values into Generators
New generator method: send(value)
A new method for generator-iterators is proposed, called send(). It
takes exactly one argument, which is the value that should be "sent
in" to the generator. Calling send(None) is exactly equivalent to
calling a generator's next() method. Calling send() with any other
value is the same, except that the value produced by the generator's
current yield expression will be different.
Because generator-iterators begin execution at the top of the
generator's function body, there is no yield expression to receive
a value when the generator has just been created. Therefore,
calling send() with a non-None argument is prohibited when the
generator iterator has just started, and a TypeError is raised if
this occurs (presumably due to a logic error of some kind). Thus,
before you can communicate with a coroutine you must first call
next() or send(None) to advance its execution to the first yield
expression.
As with the next() method, the send() method returns the next value
yielded by the generator-iterator, or raises StopIteration if the
generator exits normally, or has already exited. If the generator
raises an uncaught exception, it is propagated to send()'s caller.
New syntax: Yield Expressions
The yield-statement will be allowed to be used on the right-hand
side of an assignment; in that case it is referred to as
yield-expression. The value of this yield-expression is None
unless send() was called with a non-None argument; see below.
A yield-expression must always be parenthesized except when it
occurs at the top-level expression on the right-hand side of an
assignment. So
x = yield 42
x = yield
x = 12 + (yield 42)
x = 12 + (yield)
foo(yield 42)
foo(yield)
are all legal, but
x = 12 + yield 42
x = 12 + yield
foo(yield 42, 12)
foo(yield, 12)
are all illegal. (Some of the edge cases are motivated by the
current legality of "yield 12, 42".)
Note that a yield-statement or yield-expression without an
expression is now legal. This makes sense: when the information
flow in the next() call is reversed, it should be possible to
yield without passing an explicit value ("yield" is of course
equivalent to "yield None").
When send(value) is called, the yield-expression that it resumes
will return the passed-in value. When next() is called, the resumed
yield-expression will return None. If the yield-expression is a
yield-statement, this returned value is ignored, similar to ignoring
the value returned by a function call used as a statement.
In effect, a yield-expression is like an inverted function call; the
argument to yield is in fact returned (yielded) from the currently
executing function, and the "return value" of yield is the argument
passed in via send().
Note: the syntactic extensions to yield make its use very similar
to that in Ruby. This is intentional. Do note that in Python the
block passes a value to the generator using "send(EXPR)" rather
than "return EXPR", and the underlying mechanism whereby control
is passed between the generator and the block is completely
different. Blocks in Python are not compiled into thunks; rather,
yield suspends execution of the generator's frame. Some edge
cases work differently; in Python, you cannot save the block for
later use, and you cannot test whether there is a block or not.
(XXX - this stuff about blocks seems out of place now, perhaps
Guido can edit to clarify.)
Specification: Exceptions and Cleanup
Let a generator object be the iterator produced by calling a
generator function. Below, 'g' always refers to a generator
object.
New syntax: yield allowed inside try-finally
The syntax for generator functions is extended to allow a
yield-statement inside a try-finally statement.
New generator method: throw(type, value=None, traceback=None)
g.throw(type, value, traceback) causes the specified exception to
be thrown at the point where the generator g is currently
suspended (i.e. at a yield-statement, or at the start of its
function body if next() has not been called yet). If the
generator catches the exception and yields another value, that is
the return value of g.throw(). If it doesn't catch the exception,
the throw() appears to raise the same exception passed it (it
"falls through"). If the generator raises another exception (this
includes the StopIteration produced when it returns) that
exception is raised by the throw() call. In summary, throw()
behaves like next() or send(), except it raises an exception at the
suspension point. If the generator is already in the closed
state, throw() just raises the exception it was passed without
executing any of the generator's code.
The effect of raising the exception is exactly as if the
statement:
raise type, value, traceback
was executed at the suspension point. The type argument must
not be None, and the type and value must be compatible. If the
value is not an instance of the type, a new exception instance
is created using the value, following the same rules that the raise
statement uses to create an exception instance. The traceback, if
supplied, must be a valid Python traceback object, or a TypeError
occurs.
Note: The name of the throw() method was selected for several
reasons. Raise is a keyword and so cannot be used as a method
name. Unlike raise (which immediately raises an exception from the
current execution point), throw() first resumes the generator, and
only then raises the exception. The word throw is suggestive of
putting the exception in another location, and is already associated
with exceptions in other languages.
Alternative method names were considered: resolve(), signal(),
genraise(), raiseinto(), and flush(). None of these seem to fit
as well as throw().
New standard exception: GeneratorExit
A new standard exception is defined, GeneratorExit, inheriting
from Exception. A generator should handle this by re-raising it
(or just not catching it) or by raising StopIteration.
New generator method: close()
g.close() is defined by the following pseudo-code:
def close(self):
try:
self.throw(GeneratorExit)
except (GeneratorExit, StopIteration):
pass
else:
raise RuntimeError("generator ignored GeneratorExit")
# Other exceptions are not caught
New generator method: __del__()
g.__del__() is a wrapper for g.close(). This will be called when
the generator object is garbage-collected (in CPython, this is
when its reference count goes to zero). If close() raises an
exception, a traceback for the exception is printed to sys.stderr
and further ignored; it is not propagated back to the place that
triggered the garbage collection. This is consistent with the
handling of exceptions in __del__() methods on class instances.
If the generator object participates in a cycle, g.__del__() may
not be called. This is the behavior of CPython's current garbage
collector. The reason for the restriction is that the GC code
needs to "break" a cycle at an arbitrary point in order to collect
it, and from then on no Python code should be allowed to see the
objects that formed the cycle, as they may be in an invalid state.
Objects "hanging off" a cycle are not subject to this restriction.
Note that it is unlikely to see a generator object participate in
a cycle in practice. However, storing a generator object in a
global variable creates a cycle via the generator frame's
f_globals pointer. Another way to create a cycle would be to
store a reference to the generator object in a data structure that
is passed to the generator as an argument (e.g., if an object has
a method that's a generator, and keeps a reference to a running
iterator created by that method). Neither of these cases
are very likely given the typical patterns of generator use.
Also, in the CPython implementation of this PEP, the frame object
used by the generator should be released whenever its execution is
terminated due to an error or normal exit. This will ensure that
generators that cannot be resumed do not remain part of an
uncollectable reference cycle. This allows other code to
potentially use close() in a try/finally or "with" block (per PEP
343) to ensure that a given generator is properly finalized.
Optional Extensions
The Extended 'continue' Statement
An earlier draft of this PEP proposed a new "continue EXPR"
syntax for use in for-loops (carried over from PEP 340), that
would pass the value of EXPR into the iterator being looped over.
This feature has been withdrawn for the time being, because the
scope of this PEP has been narrowed to focus only on passing values
into generator-iterators, and not other kinds of iterators. It
was also felt by some on the Python-Dev list that adding new syntax
for this particular feature would be premature at best.
Open Issues
Discussion on python-dev has revealed some open issues. I list
them here, with my preferred resolution and its motivation. The
PEP as currently written reflects this preferred resolution.
1. What exception should be raised by close() when the generator
yields another value as a response to the GeneratorExit
exception?
I originally chose TypeError because it represents gross
misbehavior of the generator function, which should be fixed by
changing the code. But the with_template decorator class in
PEP 343 uses RuntimeError for similar offenses. Arguably they
should all use the same exception. I'd rather not introduce a
new exception class just for this purpose, since it's not an
exception that I want people to catch: I want it to turn into a
traceback which is seen by the programmer who then fixes the
code. So now I believe they should both raise RuntimeError.
There are some precedents for that: it's raised by the core
Python code in situations where endless recursion is detected,
and for uninitialized objects (and for a variety of
miscellaneous conditions).
2. Oren Tirosh has proposed renaming the send() method to feed(),
for compatibility with the "consumer interface" (see
http://effbot.org/zone/consumer.htm for the specification.)
However, looking more closely at the consumer interface, it seems
that the desired semantics for feed() are different than for
send(), because send() can't be meaningfully called on a just-
started generator. Also, the consumer interface as currently
defined doesn't include handling for StopIteration.
Therefore, it seems like it would probably be more useful to
create a simple decorator that wraps a generator function to make
it conform to the consumer interface. For example, it could
"warm up" the generator with an initial next() call, trap
StopIteration, and perhaps even provide reset() by re-invoking
the generator function.
Examples
1. A simple "consumer" decorator that makes a generator function
automatically advance to its first yield point when initially
called:
def consumer(func):
def wrapper(*args,**kw):
gen = func(*args, **kw)
gen.next()
return gen
wrapper.__name__ = func.__name__
wrapper.__dict__ = func.__dict__
wrapper.__doc__ = func.__doc__
return wrapper
2. An example of using the "consumer" decorator to create a
"reverse generator" that receives images and creates thumbnail
pages, sending them on to another consumer. Functions like
this can be chained together to form efficient processing
pipelines of "consumers" that each can have complex internal
state:
@consumer
def thumbnail_pager(pagesize, thumbsize, destination):
while True:
page = new_image(pagesize)
rows, columns = pagesize / thumbsize
pending = False
try:
for row in xrange(rows):
for column in xrange(columns):
thumb = create_thumbnail((yield), thumbsize)
page.write(
thumb, col*thumbsize.x, row*thumbsize.y
)
pending = True
except GeneratorExit:
# close() was called, so flush any pending output
if pending:
destination.send(page)
# then close the downstream consumer, and exit
destination.close()
return
else:
# we finished a page full of thumbnails, so send it
# downstream and keep on looping
destination.send(page)
@consumer
def jpeg_writer(dirname):
fileno = 1
while True:
filename = os.path.join(dirname,"page%04d.jpg" % fileno)
write_jpeg((yield), filename)
fileno += 1
# Put them together to make a function that makes thumbnail
# pages from a list of images and other parameters.
#
def write_thumbnails(pagesize, thumbsize, images, output_dir):
pipeline = thumbnail_pager(
pagesize, thumbsize, jpeg_writer(output_dir)
)
for image in images:
pipeline.send(image)
pipeline.close()
3. A simple co-routine scheduler or "trampoline" that lets
coroutines "call" other coroutines by yielding the coroutine
they wish to invoke. Any non-generator value yielded by
a coroutine is returned to the coroutine that "called" the
one yielding the value. Similarly, if a coroutine raises an
exception, the exception is propagated to its "caller". In
effect, this example emulates simple tasklets as are used
in Stackless Python, as long as you use a yield expression to
invoke routines that would otherwise "block". This is only
a very simple example, and far more sophisticated schedulers
are possible. (For example, the existing GTasklet framework
for Python (http://www.gnome.org/~gjc/gtasklet/gtasklets.html)
and the peak.events framework (http://peak.telecommunity.com/)
already implement similar scheduling capabilities, but must
currently use awkward workarounds for the inability to pass
values or exceptions into generators.)
import collections
class Trampoline:
"""Manage communications between coroutines"""
running = False
def __init__(self):
self.queue = collections.deque()
def add(self, coroutine):
"""Request that a coroutine be executed"""
self.schedule(coroutine)
def run(self):
result = None
self.running = True
try:
while self.running and self.queue:
func = self.queue.popleft()
result = func()
return result
finally:
self.running = False
def stop(self):
self.running = False
def schedule(self, coroutine, stack=(), val=None, *exc):
def resume():
value = val
try:
if exc:
value = coroutine.throw(value,*exc)
else:
value = coroutine.send(value)
except:
if stack:
# send the error back to the "caller"
self.schedule(
stack[0], stack[1], *sys.exc_info()
)
else:
# Nothing left in this pseudothread to
# handle it, let it propagate to the
# run loop
raise
if isinstance(value, types.GeneratorType):
# Yielded to a specific coroutine, push the
# current one on the stack, and call the new
# one with no args
self.schedule(value, (coroutine,stack))
elif stack:
# Yielded a result, pop the stack and send the
# value to the caller
self.schedule(stack[0], stack[1], value)
# else: this pseudothread has ended
self.queue.append(resume)
4. A simple "echo" server, and code to run it using a trampoline
(presumes the existence of "nonblocking_read",
"nonblocking_write", and other I/O coroutines, that e.g. raise
ConnectionLost if the connection is closed):
# coroutine function that echos data back on a connected
# socket
#
def echo_handler(sock):
while True:
try:
data = yield nonblocking_read(sock)
yield nonblocking_write(sock, data)
except ConnectionLost:
pass # exit normally if connection lost
# coroutine function that listens for connections on a
# socket, and then launches a service "handler" coroutine
# to service the connection
#
def listen_on(trampoline, sock, handler):
while True:
# get the next incoming connection
connected_socket = yield nonblocking_accept(sock)
# start another coroutine to handle the connection
trampoline.add( handler(connected_socket) )
# Create a scheduler to manage all our coroutines
t = Trampoline()
# Create a coroutine instance to run the echo_handler on
# incoming connections
#
server = listen_on(
t, listening_socket("localhost","echo"), echo_handler
)
# Add the coroutine to the scheduler
t.add(server)
# loop forever, accepting connections and servicing them
# "in parallel"
#
t.run()
Reference Implementation
A prototype patch implementing all of the features described in this
PEP is available as SourceForge patch #1223381
(http://python.org/sf/1223381).
This patch was committed to CVS 01-02 August 2005.
Acknowledgements
Raymond Hettinger (PEP 288) and Samuele Pedroni (PEP 325) first
formally proposed the ideas of communicating values or exceptions
into generators, and the ability to "close" generators. Timothy
Delaney suggested the title of this PEP, and Steven Bethard helped
edit a previous version. See also the Acknowledgements section
of PEP 340.
References
TBD.
Copyright
This document has been placed in the public domain.
pep-0343 The "with" Statement
| PEP: | 343 |
|---|---|
| Title: | The "with" Statement |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum, Nick Coghlan |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 13-May-2005 |
| Python-Version: | 2.5 |
| Post-History: | 2-Jun-2005, 16-Oct-2005, 29-Oct-2005, 23-Apr-2006, 1-May-2006, 30-Jul-2006 |
Abstract
This PEP adds a new statement "with" to the Python language to make
it possible to factor out standard uses of try/finally statements.
In this PEP, context managers provide __enter__() and __exit__()
methods that are invoked on entry to and exit from the body of the
with statement.
Author's Note
This PEP was originally written in first person by Guido, and
subsequently updated by Nick Coghlan to reflect later discussion
on python-dev. Any first person references are from Guido's
original.
Python's alpha release cycle revealed terminology problems in this
PEP and in the associated documentation and implementation [14].
The PEP stabilised around the time of the first Python 2.5 beta
release.
Yes, the verb tense is messed up in a few places. We've been
working on this PEP for over a year now, so things that were
originally in the future are now in the past :)
Introduction
After a lot of discussion about PEP 340 and alternatives, I
decided to withdraw PEP 340 and proposed a slight variant on PEP
310. After more discussion, I have added back a mechanism for
raising an exception in a suspended generator using a throw()
method, and a close() method which throws a new GeneratorExit
exception; these additions were first proposed on python-dev in
[2] and universally approved of. I'm also changing the keyword to
'with'.
After acceptance of this PEP, the following PEPs were rejected due
to overlap:
- PEP 310, Reliable Acquisition/Release Pairs. This is the
original with-statement proposal.
- PEP 319, Python Synchronize/Asynchronize Block. Its use cases
can be covered by the current PEP by providing suitable
with-statement controllers: for 'synchronize' we can use the
"locking" template from example 1; for 'asynchronize' we can use
a similar "unlocking" template. I don't think having an
"anonymous" lock associated with a code block is all that
important; in fact it may be better to always be explicit about
the mutex being used.
PEP 340 and PEP 346 also overlapped with this PEP, but were
voluntarily withdrawn when this PEP was submitted.
Some discussion of earlier incarnations of this PEP took place on
the Python Wiki [3].
Motivation and Summary
PEP 340, Anonymous Block Statements, combined many powerful ideas:
using generators as block templates, adding exception handling and
finalization to generators, and more. Besides praise it received
a lot of opposition from people who didn't like the fact that it
was, under the covers, a (potential) looping construct. This
meant that break and continue in a block-statement would break or
continue the block-statement, even if it was used as a non-looping
resource management tool.
But the final blow came when I read Raymond Chen's rant about
flow-control macros[1]. Raymond argues convincingly that hiding
flow control in macros makes your code inscrutable, and I find
that his argument applies to Python as well as to C. I realized
that PEP 340 templates can hide all sorts of control flow; for
example, its example 4 (auto_retry()) catches exceptions and
repeats the block up to three times.
However, the with-statement of PEP 310 does *not* hide control
flow, in my view: while a finally-suite temporarily suspends the
control flow, in the end, the control flow resumes as if the
finally-suite wasn't there at all.
Remember, PEP 310 proposes roughly this syntax (the "VAR =" part is
optional):
with VAR = EXPR:
BLOCK
which roughly translates into this:
VAR = EXPR
VAR.__enter__()
try:
BLOCK
finally:
VAR.__exit__()
Now consider this example:
with f = open("/etc/passwd"):
BLOCK1
BLOCK2
Here, just as if the first line was "if True" instead, we know
that if BLOCK1 completes without an exception, BLOCK2 will be
reached; and if BLOCK1 raises an exception or executes a non-local
goto (a break, continue or return), BLOCK2 is *not* reached. The
magic added by the with-statement at the end doesn't affect this.
(You may ask, what if a bug in the __exit__() method causes an
exception? Then all is lost -- but this is no worse than with
other exceptions; the nature of exceptions is that they can happen
*anywhere*, and you just have to live with that. Even if you
write bug-free code, a KeyboardInterrupt exception can still cause
it to exit between any two virtual machine opcodes.)
This argument almost led me to endorse PEP 310, but I had one idea
left from the PEP 340 euphoria that I wasn't ready to drop: using
generators as "templates" for abstractions like acquiring and
releasing a lock or opening and closing a file is a powerful idea,
as can be seen by looking at the examples in that PEP.
Inspired by a counter-proposal to PEP 340 by Phillip Eby I tried
to create a decorator that would turn a suitable generator into an
object with the necessary __enter__() and __exit__() methods.
Here I ran into a snag: while it wasn't too hard for the locking
example, it was impossible to do this for the opening example.
The idea was to define the template like this:
@contextmanager
def opening(filename):
f = open(filename)
try:
yield f
finally:
f.close()
and used it like this:
with f = opening(filename):
...read data from f...
The problem is that in PEP 310, the result of calling EXPR is
assigned directly to VAR, and then VAR's __exit__() method is
called upon exit from BLOCK1. But here, VAR clearly needs to
receive the opened file, and that would mean that __exit__() would
have to be a method on the file.
While this can be solved using a proxy class, this is awkward and
made me realize that a slightly different translation would make
writing the desired decorator a piece of cake: let VAR receive the
result from calling the __enter__() method, and save the value of
EXPR to call its __exit__() method later. Then the decorator can
return an instance of a wrapper class whose __enter__() method
calls the generator's next() method and returns whatever next()
returns; the wrapper instance's __exit__() method calls next()
again but expects it to raise StopIteration. (Details below in
the section Optional Generator Decorator.)
So now the final hurdle was that the PEP 310 syntax:
with VAR = EXPR:
BLOCK1
would be deceptive, since VAR does *not* receive the value of
EXPR. Borrowing from PEP 340, it was an easy step to:
with EXPR as VAR:
BLOCK1
Additional discussion showed that people really liked being able
to "see" the exception in the generator, even if it was only to
log it; the generator is not allowed to yield another value, since
the with-statement should not be usable as a loop (raising a
different exception is marginally acceptable). To enable this, a
new throw() method for generators is proposed, which takes one to
three arguments representing an exception in the usual fashion
(type, value, traceback) and raises it at the point where the
generator is suspended.
Once we have this, it is a small step to proposing another
generator method, close(), which calls throw() with a special
exception, GeneratorExit. This tells the generator to exit, and
from there it's another small step to proposing that close() be
called automatically when the generator is garbage-collected.
Then, finally, we can allow a yield-statement inside a try-finally
statement, since we can now guarantee that the finally-clause will
(eventually) be executed. The usual cautions about finalization
apply -- the process may be terminated abruptly without finalizing
any objects, and objects may be kept alive forever by cycles or
memory leaks in the application (as opposed to cycles or leaks in
the Python implementation, which are taken care of by GC).
Note that we're not guaranteeing that the finally-clause is
executed immediately after the generator object becomes unused,
even though this is how it will work in CPython. This is similar
to auto-closing files: while a reference-counting implementation
like CPython deallocates an object as soon as the last reference
to it goes away, implementations that use other GC algorithms do
not make the same guarantee. This applies to Jython, IronPython,
and probably to Python running on Parrot.
(The details of the changes made to generators can now be found in
PEP 342 rather than in the current PEP)
Use Cases
See the Examples section near the end.
Specification: The 'with' Statement
A new statement is proposed with the syntax:
with EXPR as VAR:
BLOCK
Here, 'with' and 'as' are new keywords; EXPR is an arbitrary
expression (but not an expression-list) and VAR is a single
assignment target. It can *not* be a comma-separated sequence of
variables, but it *can* be a *parenthesized* comma-separated
sequence of variables. (This restriction makes a future extension
possible of the syntax to have multiple comma-separated resources,
each with its own optional as-clause.)
The "as VAR" part is optional.
The translation of the above statement is:
mgr = (EXPR)
exit = type(mgr).__exit__ # Not calling it yet
value = type(mgr).__enter__(mgr)
exc = True
try:
try:
VAR = value # Only if "as VAR" is present
BLOCK
except:
# The exceptional case is handled here
exc = False
if not exit(mgr, *sys.exc_info()):
raise
# The exception is swallowed if exit() returns true
finally:
# The normal and non-local-goto cases are handled here
if exc:
exit(mgr, None, None, None)
Here, the lowercase variables (mgr, exit, value, exc) are internal
variables and not accessible to the user; they will most likely be
implemented as special registers or stack positions.
The details of the above translation are intended to prescribe the
exact semantics. If either of the relevant methods are not found
as expected, the interpreter will raise AttributeError, in the
order that they are tried (__exit__, __enter__).
Similarly, if any of the calls raises an exception, the effect is
exactly as it would be in the above code. Finally, if BLOCK
contains a break, continue or return statement, the __exit__()
method is called with three None arguments just as if BLOCK
completed normally. (I.e. these "pseudo-exceptions" are not seen
as exceptions by __exit__().)
If the "as VAR" part of the syntax is omitted, the "VAR =" part of
the translation is omitted (but mgr.__enter__() is still called).
The calling convention for mgr.__exit__() is as follows. If the
finally-suite was reached through normal completion of BLOCK or
through a non-local goto (a break, continue or return statement in
BLOCK), mgr.__exit__() is called with three None arguments. If
the finally-suite was reached through an exception raised in
BLOCK, mgr.__exit__() is called with three arguments representing
the exception type, value, and traceback.
IMPORTANT: if mgr.__exit__() returns a "true" value, the exception
is "swallowed". That is, if it returns "true", execution
continues at the next statement after the with-statement, even if
an exception happened inside the with-statement. However, if the
with-statement was left via a non-local goto (break, continue or
return), this non-local return is resumed when mgr.__exit__()
returns regardless of the return value. The motivation for this
detail is to make it possible for mgr.__exit__() to swallow
exceptions, without making it too easy (since the default return
value, None, is false and this causes the exception to be
re-raised). The main use case for swallowing exceptions is to
make it possible to write the @contextmanager decorator so
that a try/except block in a decorated generator behaves exactly
as if the body of the generator were expanded in-line at the place
of the with-statement.
The motivation for passing the exception details to __exit__(), as
opposed to the argument-less __exit__() from PEP 310, was given by
the transactional() use case, example 3 below. The template in
that example must commit or roll back the transaction depending on
whether an exception occurred or not. Rather than just having a
boolean flag indicating whether an exception occurred, we pass the
complete exception information, for the benefit of an
exception-logging facility for example. Relying on sys.exc_info()
to get at the exception information was rejected; sys.exc_info()
has very complex semantics and it is perfectly possible that it
returns the exception information for an exception that was caught
ages ago. It was also proposed to add an additional boolean to
distinguish between reaching the end of BLOCK and a non-local
goto. This was rejected as too complex and unnecessary; a
non-local goto should be considered unexceptional for the purposes
of a database transaction roll-back decision.
To facilitate chaining of contexts in Python code that directly
manipulates context managers, __exit__() methods should *not*
re-raise the error that is passed in to them. It is always the
responsibility of the *caller* of the __exit__() method to do any
reraising in that case.
That way, if the caller needs to tell whether the __exit__()
invocation *failed* (as opposed to successfully cleaning up before
propagating the original error), it can do so.
If __exit__() returns without an error, this can then be
interpreted as success of the __exit__() method itself (regardless
of whether or not the original error is to be propagated or
suppressed).
However, if __exit__() propagates an exception to its caller, this
means that __exit__() *itself* has failed. Thus, __exit__()
methods should avoid raising errors unless they have actually
failed. (And allowing the original error to proceed isn't a
failure.)
Transition Plan
In Python 2.5, the new syntax will only be recognized if a future
statement is present:
from __future__ import with_statement
This will make both 'with' and 'as' keywords. Without the future
statement, using 'with' or 'as' as an identifier will cause a
Warning to be issued to stderr.
In Python 2.6, the new syntax will always be recognized; 'with'
and 'as' are always keywords.
Generator Decorator
With PEP 342 accepted, it is possible to write a decorator
that makes it possible to use a generator that yields exactly once
to control a with-statement. Here's a sketch of such a decorator:
class GeneratorContextManager(object):
def __init__(self, gen):
self.gen = gen
def __enter__(self):
try:
return self.gen.next()
except StopIteration:
raise RuntimeError("generator didn't yield")
def __exit__(self, type, value, traceback):
if type is None:
try:
self.gen.next()
except StopIteration:
return
else:
raise RuntimeError("generator didn't stop")
else:
try:
self.gen.throw(type, value, traceback)
raise RuntimeError("generator didn't stop after throw()")
except StopIteration:
return True
except:
# only re-raise if it's *not* the exception that was
# passed to throw(), because __exit__() must not raise
# an exception unless __exit__() itself failed. But
# throw() has to raise the exception to signal
# propagation, so this fixes the impedance mismatch
# between the throw() protocol and the __exit__()
# protocol.
#
if sys.exc_info()[1] is not value:
raise
def contextmanager(func):
def helper(*args, **kwds):
return GeneratorContextManager(func(*args, **kwds))
return helper
This decorator could be used as follows:
@contextmanager
def opening(filename):
f = open(filename) # IOError is untouched by GeneratorContext
try:
yield f
finally:
f.close() # Ditto for errors here (however unlikely)
A robust implementation of this decorator will be made
part of the standard library.
Context Managers in the Standard Library
It would be possible to endow certain objects, like files,
sockets, and locks, with __enter__() and __exit__() methods so
that instead of writing:
with locking(myLock):
BLOCK
one could write simply:
with myLock:
BLOCK
I think we should be careful with this; it could lead to mistakes
like:
f = open(filename)
with f:
BLOCK1
with f:
BLOCK2
which does not do what one might think (f is closed before BLOCK2
is entered).
OTOH such mistakes are easily diagnosed; for example, the
generator context decorator above raises RuntimeError when a
second with-statement calls f.__enter__() again. A similar error
can be raised if __enter__ is invoked on a closed file object.
For Python 2.5, the following types have been identified as
context managers:
- file
- thread.LockType
- threading.Lock
- threading.RLock
- threading.Condition
- threading.Semaphore
- threading.BoundedSemaphore
A context manager will also be added to the decimal module to
support using a local decimal arithmetic context within the body
of a with statement, automatically restoring the original context
when the with statement is exited.
Standard Terminology
This PEP proposes that the protocol consisting of the __enter__()
and __exit__() methods be known as the "context management protocol",
and that objects that implement that protocol be known as "context
managers". [4]
The expression immediately following the with keyword in the
statement is a "context expression" as that expression provides the
main clue as to the runtime environment the context manager
establishes for the duration of the statement body.
The code in the body of the with statement and the variable name
(or names) after the as keyword don't really have special terms at
this point in time. The general terms "statement body" and "target
list" can be used, prefixing with "with" or "with statement" if the
terms would otherwise be unclear.
Given the existence of objects such as the decimal module's
arithmetic context, the term "context" is unfortunately ambiguous.
If necessary, it can be made more specific by using the terms
"context manager" for the concrete object created by the context
expression and "runtime context" or (preferably) "runtime
environment" for the actual state modifications made by the context
manager. When simply discussing use of the with statement, the
ambiguity shouldn't matter too much as the context expression fully
defines the changes made to the runtime environment.
The distinction is more important when discussing the mechanics of
the with statement itself and how to go about actually implementing
context managers.
Caching Context Managers
Many context managers (such as files and generator-based contexts)
will be single-use objects. Once the __exit__() method has been
called, the context manager will no longer be in a usable state
(e.g. the file has been closed, or the underlying generator has
finished execution).
Requiring a fresh manager object for each with statement is the
easiest way to avoid problems with multi-threaded code and nested
with statements trying to use the same context manager. It isn't
coincidental that all of the standard library context managers
that support reuse come from the threading module - they're all
already designed to deal with the problems created by threaded
and nested usage.
This means that in order to save a context manager with particular
initialisation arguments to be used in multiple with statements, it
will typically be necessary to store it in a zero-argument callable
that is then called in the context expression of each statement
rather than caching the context manager directly.
When this restriction does not apply, the documentation of the
affected context manager should make that clear.
Resolved Issues
The following issues were resolved by BDFL approval (and a lack
of any major objections on python-dev).
1. What exception should GeneratorContextManager raise when the
underlying generator-iterator misbehaves? The following quote is
the reason behind Guido's choice of RuntimeError for both this
and for the generator close() method in PEP 342 (from [8]):
"I'd rather not introduce a new exception class just for this
purpose, since it's not an exception that I want people to catch:
I want it to turn into a traceback which is seen by the
programmer who then fixes the code. So now I believe they
should both raise RuntimeError.
There are some precedents for that: it's raised by the core
Python code in situations where endless recursion is detected,
and for uninitialized objects (and for a variety of
miscellaneous conditions)."
2. It is fine to raise AttributeError instead of TypeError if the
relevant methods aren't present on a class involved in a with
statement. The fact that the abstract object C API raises
TypeError rather than AttributeError is an accident of history,
rather than a deliberate design decision [11].
3. Objects with __enter__/__exit__ methods are called "context
managers" and the decorator to convert a generator function
into a context manager factory is ``contextlib.contextmanager``.
There were some other suggestions [16] during the 2.5 release
cycle but no compelling arguments for switching away from the
terms that had been used in the PEP implementation were made.
Rejected Options
For several months, the PEP prohibited suppression of exceptions
in order to avoid hidden flow control. Implementation
revealed this to be a right royal pain, so Guido restored the
ability [13].
Another aspect of the PEP that caused no end of questions and
terminology debates was providing a __context__() method that
was analogous to an iterable's __iter__() method [5, 7, 9].
The ongoing problems [10, 13] with explaining what it was and why
it was and how it was meant to work eventually lead to Guido
killing the concept outright [15] (and there was much rejoicing!).
The notion of using the PEP 342 generator API directly to define
the with statement was also briefly entertained [6], but quickly
dismissed as making it too difficult to write non-generator
based context managers.
Examples
The generator based examples rely on PEP 342. Also, some of the
examples are unnecessary in practice, as the appropriate objects,
such as threading.RLock, are able to be used directly in with
statements.
The tense used in the names of the example contexts is not
arbitrary. Past tense ("-ed") is used when the name refers to an
action which is done in the __enter__ method and undone in the
__exit__ method. Progressive tense ("-ing") is used when the name
refers to an action which is to be done in the __exit__ method.
1. A template for ensuring that a lock, acquired at the start of a
block, is released when the block is left:
@contextmanager
def locked(lock):
lock.acquire()
try:
yield
finally:
lock.release()
Used as follows:
with locked(myLock):
# Code here executes with myLock held. The lock is
# guaranteed to be released when the block is left (even
# if via return or by an uncaught exception).
2. A template for opening a file that ensures the file is closed
when the block is left:
@contextmanager
def opened(filename, mode="r"):
f = open(filename, mode)
try:
yield f
finally:
f.close()
Used as follows:
with opened("/etc/passwd") as f:
for line in f:
print line.rstrip()
3. A template for committing or rolling back a database
transaction:
@contextmanager
def transaction(db):
db.begin()
try:
yield None
except:
db.rollback()
raise
else:
db.commit()
4. Example 1 rewritten without a generator:
class locked:
def __init__(self, lock):
self.lock = lock
def __enter__(self):
self.lock.acquire()
def __exit__(self, type, value, tb):
self.lock.release()
(This example is easily modified to implement the other
relatively stateless examples; it shows that it is easy to avoid
the need for a generator if no special state needs to be
preserved.)
5. Redirect stdout temporarily:
@contextmanager
def stdout_redirected(new_stdout):
save_stdout = sys.stdout
sys.stdout = new_stdout
try:
yield None
finally:
sys.stdout = save_stdout
Used as follows:
with opened(filename, "w") as f:
with stdout_redirected(f):
print "Hello world"
This isn't thread-safe, of course, but neither is doing this
same dance manually. In single-threaded programs (for example,
in scripts) it is a popular way of doing things.
6. A variant on opened() that also returns an error condition:
@contextmanager
def opened_w_error(filename, mode="r"):
try:
f = open(filename, mode)
except IOError, err:
yield None, err
else:
try:
yield f, None
finally:
f.close()
Used as follows:
with opened_w_error("/etc/passwd", "a") as (f, err):
if err:
print "IOError:", err
else:
f.write("guido::0:0::/:/bin/sh\n")
7. Another useful example would be an operation that blocks
signals. The use could be like this:
import signal
with signal.blocked():
# code executed without worrying about signals
An optional argument might be a list of signals to be blocked;
by default all signals are blocked. The implementation is left
as an exercise to the reader.
8. Another use for this feature is the Decimal context. Here's a
simple example, after one posted by Michael Chermside:
import decimal
@contextmanager
def extra_precision(places=2):
c = decimal.getcontext()
saved_prec = c.prec
c.prec += places
try:
yield None
finally:
c.prec = saved_prec
Sample usage (adapted from the Python Library Reference):
def sin(x):
"Return the sine of x as measured in radians."
with extra_precision():
i, lasts, s, fact, num, sign = 1, 0, x, 1, x, 1
while s != lasts:
lasts = s
i += 2
fact *= i * (i-1)
num *= x * x
sign *= -1
s += num / fact * sign
# The "+s" rounds back to the original precision,
# so this must be outside the with-statement:
return +s
9. Here's a simple context manager for the decimal module:
@contextmanager
def localcontext(ctx=None):
"""Set a new local decimal context for the block"""
# Default to using the current context
if ctx is None:
ctx = getcontext()
# We set the thread context to a copy of this context
# to ensure that changes within the block are kept
# local to the block.
newctx = ctx.copy()
oldctx = decimal.getcontext()
decimal.setcontext(newctx)
try:
yield newctx
finally:
# Always restore the original context
decimal.setcontext(oldctx)
Sample usage:
from decimal import localcontext, ExtendedContext
def sin(x):
with localcontext() as ctx:
ctx.prec += 2
# Rest of sin calculation algorithm
# uses a precision 2 greater than normal
return +s # Convert result to normal precision
def sin(x):
with localcontext(ExtendedContext):
# Rest of sin calculation algorithm
# uses the Extended Context from the
# General Decimal Arithmetic Specification
return +s # Convert result to normal context
10. A generic "object-closing" context manager:
class closing(object):
def __init__(self, obj):
self.obj = obj
def __enter__(self):
return self.obj
def __exit__(self, *exc_info):
try:
close_it = self.obj.close
except AttributeError:
pass
else:
close_it()
This can be used to deterministically close anything with a
close method, be it file, generator, or something else. It
can even be used when the object isn't guaranteed to require
closing (e.g., a function that accepts an arbitrary
iterable):
# emulate opening():
with closing(open("argument.txt")) as contradiction:
for line in contradiction:
print line
# deterministically finalize an iterator:
with closing(iter(data_source)) as data:
for datum in data:
process(datum)
(Python 2.5's contextlib module contains a version
of this context manager)
11. PEP 319 gives a use case for also having a released()
context to temporarily release a previously acquired lock;
this can be written very similarly to the locked context
manager above by swapping the acquire() and release() calls.
class released:
def __init__(self, lock):
self.lock = lock
def __enter__(self):
self.lock.release()
def __exit__(self, type, value, tb):
self.lock.acquire()
Sample usage:
with my_lock:
# Operations with the lock held
with released(my_lock):
# Operations without the lock
# e.g. blocking I/O
# Lock is held again here
12. A "nested" context manager that automatically nests the
supplied contexts from left-to-right to avoid excessive
indentation:
@contextmanager
def nested(*contexts):
exits = []
vars = []
try:
try:
for context in contexts:
mgr = context.__context__()
exit = mgr.__exit__
enter = mgr.__enter__
vars.append(enter())
exits.append(exit)
yield vars
except:
exc = sys.exc_info()
else:
exc = (None, None, None)
finally:
while exits:
exit = exits.pop()
try:
exit(*exc)
except:
exc = sys.exc_info()
else:
exc = (None, None, None)
if exc != (None, None, None):
# sys.exc_info() may have been
# changed by one of the exit methods
# so provide explicit exception info
raise exc[0], exc[1], exc[2]
Sample usage:
with nested(a, b, c) as (x, y, z):
# Perform operation
Is equivalent to:
with a as x:
with b as y:
with c as z:
# Perform operation
(Python 2.5's contextlib module contains a version
of this context manager)
Reference Implementation
This PEP was first accepted by Guido at his EuroPython
keynote, 27 June 2005.
It was accepted again later, with the __context__ method added.
The PEP was implemented in Subversion for Python 2.5a1
The __context__() method will be removed in Python 2.5a3
Ackowledgements
Many people contributed to the ideas and concepts in this PEP,
including all those mentioned in the acknowledgements for PEP 340
and PEP 346.
Additional thanks goes to (in no meaningful order): Paul Moore,
Phillip J. Eby, Greg Ewing, Jason Orendorff, Michael Hudson,
Raymond Hettinger, Walter Dรถrwald, Aahz, Georg Brandl, Terry Reedy,
A.M. Kuchling, Brett Cannon, and all those that participated in the
discussions on python-dev.
References
[1] Raymond Chen's article on hidden flow control
http://blogs.msdn.com/oldnewthing/archive/2005/01/06/347666.aspx
[2] Guido suggests some generator changes that ended up in PEP 342
http://mail.python.org/pipermail/python-dev/2005-May/053885.html
[3] Wiki discussion of PEP 343
http://wiki.python.org/moin/WithStatement
[4] Early draft of some documentation for the with statement
http://mail.python.org/pipermail/python-dev/2005-July/054658.html
[5] Proposal to add the __with__ method
http://mail.python.org/pipermail/python-dev/2005-October/056947.html
[6] Proposal to use the PEP 342 enhanced generator API directly
http://mail.python.org/pipermail/python-dev/2005-October/056969.html
[7] Guido lets me (Nick Coghlan) talk him into a bad idea ;)
http://mail.python.org/pipermail/python-dev/2005-October/057018.html
[8] Guido raises some exception handling questions
http://mail.python.org/pipermail/python-dev/2005-June/054064.html
[9] Guido answers some questions about the __context__ method
http://mail.python.org/pipermail/python-dev/2005-October/057520.html
[10] Guido answers more questions about the __context__ method
http://mail.python.org/pipermail/python-dev/2005-October/057535.html
[11] Guido says AttributeError is fine for missing special methods
http://mail.python.org/pipermail/python-dev/2005-October/057625.html
[12] Original PEP 342 implementation patch
http://sourceforge.net/tracker/index.php?func=detail&aid=1223381&group_id=5470&atid=305470
[13] Guido restores the ability to suppress exceptions
http://mail.python.org/pipermail/python-dev/2006-February/061909.html
[14] A simple question kickstarts a thorough review of PEP 343
http://mail.python.org/pipermail/python-dev/2006-April/063859.html
[15] Guido kills the __context__() method
http://mail.python.org/pipermail/python-dev/2006-April/064632.html
[16] Proposal to use 'context guard' instead of 'context manager'
http://mail.python.org/pipermail/python-dev/2006-May/064676.html
Copyright
This document has been placed in the public domain.
..
pep-0344 Exception Chaining and Embedded Tracebacks
| PEP: | 344 |
|---|---|
| Title: | Exception Chaining and Embedded Tracebacks |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 12-May-2005 |
| Python-Version: | 2.5 |
| Post-History: |
Numbering Note
This PEP has been renumbered to PEP 3134. The text below is the
last version submitted under the old number.
Abstract
This PEP proposes three standard attributes on exception instances:
the '__context__' attribute for implicitly chained exceptions, the
'__cause__' attribute for explicitly chained exceptions, and the
'__traceback__' attribute for the traceback. A new "raise ... from"
statement sets the '__cause__' attribute.
Motivation
During the handling of one exception (exception A), it is possible
that another exception (exception B) may occur. In today's Python
(version 2.4), if this happens, exception B is propagated outward
and exception A is lost. In order to debug the problem, it is
useful to know about both exceptions. The '__context__' attribute
retains this information automatically.
Sometimes it can be useful for an exception handler to intentionally
re-raise an exception, either to provide extra information or to
translate an exception to another type. The '__cause__' attribute
provides an explicit way to record the direct cause of an exception.
In today's Python implementation, exceptions are composed of three
parts: the type, the value, and the traceback. The 'sys' module,
exposes the current exception in three parallel variables, exc_type,
exc_value, and exc_traceback, the sys.exc_info() function returns a
tuple of these three parts, and the 'raise' statement has a
three-argument form accepting these three parts. Manipulating
exceptions often requires passing these three things in parallel,
which can be tedious and error-prone. Additionally, the 'except'
statement can only provide access to the value, not the traceback.
Adding the '__traceback__' attribute to exception values makes all
the exception information accessible from a single place.
History
Raymond Hettinger [1] raised the issue of masked exceptions on
Python-Dev in January 2003 and proposed a PyErr_FormatAppend()
function that C modules could use to augment the currently active
exception with more information. Brett Cannon [2] brought up
chained exceptions again in June 2003, prompting a long discussion.
Greg Ewing [3] identified the case of an exception occuring in a
'finally' block during unwinding triggered by an original exception,
as distinct from the case of an exception occuring in an 'except'
block that is handling the original exception.
Greg Ewing [4] and Guido van Rossum [5], and probably others, have
previously mentioned adding a traceback attribute to Exception
instances. This is noted in PEP 3000.
This PEP was motivated by yet another recent Python-Dev reposting
of the same ideas [6] [7].
Rationale
The Python-Dev discussions revealed interest in exception chaining
for two quite different purposes. To handle the unexpected raising
of a secondary exception, the exception must be retained implicitly.
To support intentional translation of an exception, there must be a
way to chain exceptions explicitly. This PEP addresses both.
Several attribute names for chained exceptions have been suggested
on Python-Dev [2], including 'cause', 'antecedent', 'reason',
'original', 'chain', 'chainedexc', 'exc_chain', 'excprev',
'previous', and 'precursor'. For an explicitly chained exception,
this PEP suggests '__cause__' because of its specific meaning. For
an implicitly chained exception, this PEP proposes the name
'__context__' because the intended meaning is more specific than
temporal precedence but less specific than causation: an exception
occurs in the context of handling another exception.
This PEP suggests names with leading and trailing double-underscores
for these three attributes because they are set by the Python VM.
Only in very special cases should they be set by normal assignment.
This PEP handles exceptions that occur during 'except' blocks and
'finally' blocks in the same way. Reading the traceback makes it
clear where the exceptions occurred, so additional mechanisms for
distinguishing the two cases would only add unnecessary complexity.
This PEP proposes that the outermost exception object (the one
exposed for matching by 'except' clauses) be the most recently
raised exception for compatibility with current behaviour.
This PEP proposes that tracebacks display the outermost exception
last, because this would be consistent with the chronological order
of tracebacks (from oldest to most recent frame) and because the
actual thrown exception is easier to find on the last line.
To keep things simpler, the C API calls for setting an exception
will not automatically set the exception's '__context__'. Guido
van Rossum has has expressed concerns with making such changes [8].
As for other languages, Java and Ruby both discard the original
exception when another exception occurs in a 'catch'/'rescue' or
'finally'/'ensure' clause. Perl 5 lacks built-in structured
exception handling. For Perl 6, RFC number 88 [9] proposes an exception
mechanism that implicitly retains chained exceptions in an array
named @@. In that RFC, the most recently raised exception is
exposed for matching, as in this PEP; also, arbitrary expressions
(possibly involving @@) can be evaluated for exception matching.
Exceptions in C# contain a read-only 'InnerException' property that
may point to another exception. Its documentation [10] says that
"When an exception X is thrown as a direct result of a previous
exception Y, the InnerException property of X should contain a
reference to Y." This property is not set by the VM automatically;
rather, all exception constructors take an optional 'innerException'
argument to set it explicitly. The '__cause__' attribute fulfills
the same purpose as InnerException, but this PEP proposes a new form
of 'raise' rather than extending the constructors of all exceptions.
C# also provides a GetBaseException method that jumps directly to
the end of the InnerException chain; this PEP proposes no analog.
The reason all three of these attributes are presented together in
one proposal is that the '__traceback__' attribute provides
convenient access to the traceback on chained exceptions.
Implicit Exception Chaining
Here is an example to illustrate the '__context__' attribute.
def compute(a, b):
try:
a/b
except Exception, exc:
log(exc)
def log(exc):
file = open('logfile.txt') # oops, forgot the 'w'
print >>file, exc
file.close()
Calling compute(0, 0) causes a ZeroDivisionError. The compute()
function catches this exception and calls log(exc), but the log()
function also raises an exception when it tries to write to a
file that wasn't opened for writing.
In today's Python, the caller of compute() gets thrown an IOError.
The ZeroDivisionError is lost. With the proposed change, the
instance of IOError has an additional '__context__' attribute that
retains the ZeroDivisionError.
The following more elaborate example demonstrates the handling of a
mixture of 'finally' and 'except' clauses:
def main(filename):
file = open(filename) # oops, forgot the 'w'
try:
try:
compute()
except Exception, exc:
log(file, exc)
finally:
file.clos() # oops, misspelled 'close'
def compute():
1/0
def log(file, exc):
try:
print >>file, exc # oops, file is not writable
except:
display(exc)
def display(exc):
print ex # oops, misspelled 'exc'
Calling main() with the name of an existing file will trigger four
exceptions. The ultimate result will be an AttributeError due to
the misspelling of 'clos', whose __context__ points to a NameError
due to the misspelling of 'ex', whose __context__ points to an
IOError due to the file being read-only, whose __context__ points to
a ZeroDivisionError, whose __context__ attribute is None.
The proposed semantics are as follows:
1. Each thread has an exception context initially set to None.
2. Whenever an exception is raised, if the exception instance does
not already have a '__context__' attribute, the interpreter sets
it equal to the thread's exception context.
3. Immediately after an exception is raised, the thread's exception
context is set to the exception.
4. Whenever the interpreter exits an 'except' block by reaching the
end or executing a 'return', 'yield', 'continue', or 'break'
statement, the thread's exception context is set to None.
Explicit Exception Chaining
The '__cause__' attribute on exception objects is always initialized
to None. It is set by a new form of the 'raise' statement:
raise EXCEPTION from CAUSE
which is equivalent to:
exc = EXCEPTION
exc.__cause__ = CAUSE
raise exc
In the following example, a database provides implementations for a
few different kinds of storage, with file storage as one kind. The
database designer wants errors to propagate as DatabaseError objects
so that the client doesn't have to be aware of the storage-specific
details, but doesn't want to lose the underlying error information.
class DatabaseError(StandardError):
pass
class FileDatabase(Database):
def __init__(self, filename):
try:
self.file = open(filename)
except IOError, exc:
raise DatabaseError('failed to open') from exc
If the call to open() raises an exception, the problem will be
reported as a DatabaseError, with a __cause__ attribute that reveals
the IOError as the original cause.
Traceback Attribute
The following example illustrates the '__traceback__' attribute.
def do_logged(file, work):
try:
work()
except Exception, exc:
write_exception(file, exc)
raise exc
from traceback import format_tb
def write_exception(file, exc):
...
type = exc.__class__
message = str(exc)
lines = format_tb(exc.__traceback__)
file.write(... type ... message ... lines ...)
...
In today's Python, the do_logged() function would have to extract
the traceback from sys.exc_traceback or sys.exc_info()[2] and pass
both the value and the traceback to write_exception(). With the
proposed change, write_exception() simply gets one argument and
obtains the exception using the '__traceback__' attribute.
The proposed semantics are as follows:
1. Whenever an exception is caught, if the exception instance does
not already have a '__traceback__' attribute, the interpreter
sets it to the newly caught traceback.
Enhanced Reporting
The default exception handler will be modified to report chained
exceptions. The chain of exceptions is traversed by following the
'__cause__' and '__context__' attributes, with '__cause__' taking
priority. In keeping with the chronological order of tracebacks,
the most recently raised exception is displayed last; that is, the
display begins with the description of the innermost exception and
backs up the chain to the outermost exception. The tracebacks are
formatted as usual, with one of the lines:
The above exception was the direct cause of the following exception:
or
During handling of the above exception, another exception occurred:
between tracebacks, depending whether they are linked by __cause__
or __context__ respectively. Here is a sketch of the procedure:
def print_chain(exc):
if exc.__cause__:
print_chain(exc.__cause__)
print '\nThe above exception was the direct cause...'
elif exc.__context__:
print_chain(exc.__context__)
print '\nDuring handling of the above exception, ...'
print_exc(exc)
In the 'traceback' module, the format_exception, print_exception,
print_exc, and print_last functions will be updated to accept an
optional 'chain' argument, True by default. When this argument is
True, these functions will format or display the entire chain of
exceptions as just described. When it is False, these functions
will format or display only the outermost exception.
The 'cgitb' module should also be updated to display the entire
chain of exceptions.
C API
The PyErr_Set* calls for setting exceptions will not set the
'__context__' attribute on exceptions. PyErr_NormalizeException
will always set the 'traceback' attribute to its 'tb' argument and
the '__context__' and '__cause__' attributes to None.
A new API function, PyErr_SetContext(context), will help C
programmers provide chained exception information. This function
will first normalize the current exception so it is an instance,
then set its '__context__' attribute. A similar API function,
PyErr_SetCause(cause), will set the '__cause__' attribute.
Compatibility
Chained exceptions expose the type of the most recent exception, so
they will still match the same 'except' clauses as they do now.
The proposed changes should not break any code unless it sets or
uses attributes named '__context__', '__cause__', or '__traceback__'
on exception instances. As of 2005-05-12, the Python standard
library contains no mention of such attributes.
Open Issue: Extra Information
Walter Dรถrwald [11] expressed a desire to attach extra information
to an exception during its upward propagation without changing its
type. This could be a useful feature, but it is not addressed by
this PEP. It could conceivably be addressed by a separate PEP
establishing conventions for other informational attributes on
exceptions.
Open Issue: Suppressing Context
As written, this PEP makes it impossible to suppress '__context__',
since setting exc.__context__ to None in an 'except' or 'finally'
clause will only result in it being set again when exc is raised.
Open Issue: Limiting Exception Types
To improve encapsulation, library implementors may want to wrap all
implementation-level exceptions with an application-level exception.
One could try to wrap exceptions by writing this:
try:
... implementation may raise an exception ...
except:
import sys
raise ApplicationError from sys.exc_value
or this:
try:
... implementation may raise an exception ...
except Exception, exc:
raise ApplicationError from exc
but both are somewhat flawed. It would be nice to be able to name
the current exception in a catch-all 'except' clause, but that isn't
addressed here. Such a feature would allow something like this:
try:
... implementation may raise an exception ...
except *, exc:
raise ApplicationError from exc
Open Issue: yield
The exception context is lost when a 'yield' statement is executed;
resuming the frame after the 'yield' does not restore the context.
Addressing this problem is out of the scope of this PEP; it is not a
new problem, as demonstrated by the following example:
>>> def gen():
... try:
... 1/0
... except:
... yield 3
... raise
...
>>> g = gen()
>>> g.next()
3
>>> g.next()
TypeError: exceptions must be classes, instances, or strings
(deprecated), not NoneType
Open Issue: Garbage Collection
The strongest objection to this proposal has been that it creates
cycles between exceptions and stack frames [12]. Collection of
cyclic garbage (and therefore resource release) can be greatly
delayed.
>>> try:
>>> 1/0
>>> except Exception, err:
>>> pass
will introduce a cycle from err -> traceback -> stack frame -> err,
keeping all locals in the same scope alive until the next GC happens.
Today, these locals would go out of scope. There is lots of code
which assumes that "local" resources -- particularly open files -- will
be closed quickly. If closure has to wait for the next GC, a program
(which runs fine today) may run out of file handles.
Making the __traceback__ attribute a weak reference would avoid the
problems with cyclic garbage. Unfortunately, it would make saving
the Exception for later (as unittest does) more awkward, and it would
not allow as much cleanup of the sys module.
A possible alternate solution, suggested by Adam Olsen, would be to
instead turn the reference from the stack frame to the 'err' variable
into a weak reference when the variable goes out of scope [13].
Possible Future Compatible Changes
These changes are consistent with the appearance of exceptions as
a single object rather than a triple at the interpreter level.
- If PEP 340 or PEP 343 is accepted, replace the three (type, value,
traceback) arguments to __exit__ with a single exception argument.
- Deprecate sys.exc_type, sys.exc_value, sys.exc_traceback, and
sys.exc_info() in favour of a single member, sys.exception.
- Deprecate sys.last_type, sys.last_value, and sys.last_traceback
in favour of a single member, sys.last_exception.
- Deprecate the three-argument form of the 'raise' statement in
favour of the one-argument form.
- Upgrade cgitb.html() to accept a single value as its first
argument as an alternative to a (type, value, traceback) tuple.
Possible Future Incompatible Changes
These changes might be worth considering for Python 3000.
- Remove sys.exc_type, sys.exc_value, sys.exc_traceback, and
sys.exc_info().
- Remove sys.last_type, sys.last_value, and sys.last_traceback.
- Replace the three-argument sys.excepthook with a one-argument
API, and changing the 'cgitb' module to match.
- Remove the three-argument form of the 'raise' statement.
- Upgrade traceback.print_exception to accept an 'exception'
argument instead of the type, value, and traceback arguments.
Acknowledgements
Brett Cannon, Greg Ewing, Guido van Rossum, Jeremy Hylton, Phillip
J. Eby, Raymond Hettinger, Walter Dรถrwald, and others.
References
[1] Raymond Hettinger, "Idea for avoiding exception masking"
http://mail.python.org/pipermail/python-dev/2003-January/032492.html
[2] Brett Cannon explains chained exceptions
http://mail.python.org/pipermail/python-dev/2003-June/036063.html
[3] Greg Ewing points out masking caused by exceptions during finally
http://mail.python.org/pipermail/python-dev/2003-June/036290.html
[4] Greg Ewing suggests storing the traceback in the exception object
http://mail.python.org/pipermail/python-dev/2003-June/036092.html
[5] Guido van Rossum mentions exceptions having a traceback attribute
http://mail.python.org/pipermail/python-dev/2005-April/053060.html
[6] Ka-Ping Yee, "Tidier Exceptions"
http://mail.python.org/pipermail/python-dev/2005-May/053671.html
[7] Ka-Ping Yee, "Chained Exceptions"
http://mail.python.org/pipermail/python-dev/2005-May/053672.html
[8] Guido van Rossum discusses automatic chaining in PyErr_Set*
http://mail.python.org/pipermail/python-dev/2003-June/036180.html
[9] Tony Olensky, "Omnibus Structured Exception/Error Handling Mechanism"
http://dev.perl.org/perl6/rfc/88.html
[10] MSDN .NET Framework Library, "Exception.InnerException Property"
http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemexceptionclassinnerexceptiontopic.asp
[11] Walter Dรถrwald suggests wrapping exceptions to add details
http://mail.python.org/pipermail/python-dev/2003-June/036148.html
[12] Guido van Rossum restates the objection to cyclic trash
http://mail.python.org/pipermail/python-3000/2007-January/005322.html
[13] Adam Olsen suggests using a weakref from stack frame to exception
http://mail.python.org/pipermail/python-3000/2007-January/005363.html
Copyright
This document has been placed in the public domain.
pep-0345 Metadata for Python Software Packages 1.2
| PEP: | 345 |
|---|---|
| Title: | Metadata for Python Software Packages 1.2 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Richard Jones <richard at python.org> |
| Discussions-To: | Distutils SIG |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Apr-2005 |
| Python-Version: | 2.5 |
| Post-History: |
Contents
- Abstract
- Fields
- Metadata-Version
- Name
- Version
- Platform (multiple use)
- Supported-Platform (multiple use)
- Summary
- Description (optional)
- Keywords (optional)
- Home-page (optional)
- Download-URL
- Author (optional)
- Author-email (optional)
- Maintainer (optional)
- Maintainer-email (optional)
- License (optional)
- Classifier (multiple use)
- Requires-Dist (multiple use)
- Provides-Dist (multiple use)
- Obsoletes-Dist (multiple use)
- Requires-Python
- Requires-External (multiple use)
- Project-URL (multiple-use)
- Version Specifiers
- Environment markers
- Summary of Differences From PEP 314
- References
- Copyright
- Acknowledgements
Abstract
This PEP describes a mechanism for adding metadata to Python distributions. It includes specifics of the field names, and their semantics and usage.
This document specifies version 1.2 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314.
Version 1.2 of the metadata format adds a number of optional fields designed to make third-party packaging of Python Software easier. These fields are "Requires-Python", "Requires-External", "Requires-Dist", "Provides-Dist", and "Obsoletes-Dist". This version also changes the "Platform" field. Three new fields were also added: "Maintainer", "Maintainer-email" and "Project-URL".
Last, this new version also adds environment markers.
Fields
This section specifies the names and semantics of each of the supported metadata fields.
Fields marked with "(Multiple use)" may be specified multiple times in a single PKG-INFO file. Other fields may only occur once in a PKG-INFO file. Fields marked with "(optional)" are not required to appear in a valid PKG-INFO file; all other fields must be present.
Metadata-Version
Version of the file format; "1.2" is the only legal value.
Example:
Metadata-Version: 1.2
Version
A string containing the distribution's version number. This field must be in the format specified in PEP 386.
Example:
Version: 1.0a2
Platform (multiple use)
A Platform specification describing an operating system supported by the distribution which is not listed in the "Operating System" Trove classifiers. See "Classifier" below.
Examples:
Platform: ObscureUnix Platform: RareDOS
Supported-Platform (multiple use)
Binary distributions containing a PKG-INFO file will use the Supported-Platform field in their metadata to specify the OS and CPU for which the binary distribution was compiled. The semantics of the Supported-Platform field are not specified in this PEP.
Example:
Supported-Platform: RedHat 7.2 Supported-Platform: i386-win32-2791
Summary
A one-line summary of what the distribution does.
Example:
Summary: A module for collecting votes from beagles.
Description (optional)
A longer description of the distribution that can run to several paragraphs. Software that deals with metadata should not assume any maximum size for this field, though people shouldn't include their instruction manual as the description.
The contents of this field can be written using reStructuredText markup [1]. For programs that work with the metadata, supporting markup is optional; programs can also display the contents of the field as-is. This means that authors should be conservative in the markup they use.
To support empty lines and lines with indentation with respect to the RFC 822 format, any CRLF character has to be suffixed by 7 spaces followed by a pipe ("|") char. As a result, the Description field is encoded into a folded field that can be interpreted by RFC822 parser [2].
Example:
Description: This project provides powerful math functions
|For example, you can use `sum()` to sum numbers:
|
|Example::
|
| >>> sum(1, 2)
| 3
|
This encoding implies that any occurences of a CRLF followed by 7 spaces and a pipe char have to be replaced by a single CRLF when the field is unfolded using a RFC822 reader.
Keywords (optional)
A list of additional keywords to be used to assist searching for the distribution in a larger catalog.
Example:
Keywords: dog puppy voting election
Home-page (optional)
A string containing the URL for the distribution's home page.
Example:
Home-page: http://www.example.com/~cschultz/bvote/
Download-URL
A string containing the URL from which this version of the distribution can be downloaded. (This means that the URL can't be something like ".../BeagleVote-latest.tgz", but instead must be ".../BeagleVote-0.45.tgz".)
Author (optional)
A string containing the author's name at a minimum; additional contact information may be provided.
Example:
Author: C. Schultz, Universal Features Syndicate,
Los Angeles, CA <cschultz@peanuts.example.com>
Author-email (optional)
A string containing the author's e-mail address. It can contain a name and e-mail address in the legal forms for a RFC-822 From: header.
Example:
Author-email: "C. Schultz" <cschultz@example.com>
Maintainer (optional)
A string containing the maintainer's name at a minimum; additional contact information may be provided.
Note that this field is intended for use when a project is being maintained by someone other than the original author: it should be omitted if it is identical to Author.
Example:
Maintainer: C. Schultz, Universal Features Syndicate,
Los Angeles, CA <cschultz@peanuts.example.com>
Maintainer-email (optional)
A string containing the maintainer's e-mail address. It can contain a name and e-mail address in the legal forms for a RFC-822 From: header.
Note that this field is intended for use when a project is being maintained by someone other than the original author: it should be omitted if it is identical to Author-email.
Example:
Maintainer-email: "C. Schultz" <cschultz@example.com>
License (optional)
Text indicating the license covering the distribution where the license is not a selection from the "License" Trove classifiers. See "Classifier" below. This field may also be used to specify a particular version of a licencse which is named via the Classifier field, or to indicate a variation or exception to such a license.
Examples:
License: This software may only be obtained by sending the
author a postcard, and then the user promises not
to redistribute it.
License: GPL version 3, excluding DRM provisions
Classifier (multiple use)
Each entry is a string giving a single classification value for the distribution. Classifiers are described in PEP 301 [3].
Examples:
Classifier: Development Status :: 4 - Beta Classifier: Environment :: Console (Text Based)
Requires-Dist (multiple use)
Each entry contains a string naming some other distutils project required by this distribution.
The format of a requirement string is identical to that of a distutils project name (e.g., as found in the Name: field. optionally followed by a version declaration within parentheses.
The distutils project names should correspond to names as found on the Python Package Index [4].
Version declarations must follow the rules described in Version Specifiers
Examples:
Requires-Dist: pkginfo Requires-Dist: PasteDeploy Requires-Dist: zope.interface (>3.5.0)
Provides-Dist (multiple use)
Each entry contains a string naming a Distutils project which is contained within this distribution. This field must include the project identified in the Name field, followed by the version : Name (Version).
A distribution may provide additional names, e.g. to indicate that multiple projects have been bundled together. For instance, source distributions of the ZODB project have historically included the transaction project, which is now available as a separate distribution. Installing such a source distribution satisfies requirements for both ZODB and transaction.
A distribution may also provide a "virtual" project name, which does not correspond to any separately-distributed project: such a name might be used to indicate an abstract capability which could be supplied by one of multiple projects. E.g., multiple projects might supply RDBMS bindings for use by a given ORM: each project might declare that it provides ORM-bindings, allowing other projects to depend only on having at most one of them installed.
A version declaration may be supplied and must follow the rules described in Version Specifiers. The distribution's version number will be implied if none is specified.
Examples:
Provides-Dist: OtherProject Provides-Dist: AnotherProject (3.4) Provides-Dist: virtual_package
Obsoletes-Dist (multiple use)
Each entry contains a string describing a distutils project's distribution which this distribution renders obsolete, meaning that the two projects should not be installed at the same time.
Version declarations can be supplied. Version numbers must be in the format specified in Version Specifiers.
The most common use of this field will be in case a project name changes, e.g. Gorgon 2.3 gets subsumed into Torqued Python 1.0. When you install Torqued Python, the Gorgon distribution should be removed.
Examples:
Obsoletes-Dist: Gorgon Obsoletes-Dist: OtherProject (<3.0)
Requires-Python
This field specifies the Python version(s) that the distribution is guaranteed to be compatible with.
Version numbers must be in the format specified in Version Specifiers.
Examples:
Requires-Python: 2.5 Requires-Python: >2.1 Requires-Python: >=2.3.4 Requires-Python: >=2.5,<2.7
Requires-External (multiple use)
Each entry contains a string describing some dependency in the system that the distribution is to be used. This field is intended to serve as a hint to downstream project maintainers, and has no semantics which are meaningful to the distutils distribution.
The format of a requirement string is a name of an external dependency, optionally followed by a version declaration within parentheses.
Because they refer to non-Python software releases, version numbers for this field are not required to conform to the format specified in PEP 386: they should correspond to the version scheme used by the external dependency.
Notice that there's is no particular rule on the strings to be used.
Examples:
Requires-External: C Requires-External: libpng (>=1.5)
Project-URL (multiple-use)
A string containing a browsable URL for the project and a label for it, separated by a comma.
Example:
Bug Tracker, http://bitbucket.org/tarek/distribute/issues/
The label is a free text limited to 32 signs.
Version Specifiers
Version specifiers are a series of conditional operators and version numbers, separated by commas. Conditional operators must be one of "<", ">", "<=", ">=", "==" and "!=".
Any number of conditional operators can be specified, e.g. the string ">1.0, !=1.3.4, <2.0" is a legal version declaration. The comma (",") is equivalent to the and operator.
Each version number must be in the format specified in PEP 386.
When a version is provided, it always includes all versions that starts with the same value. For example the "2.5" version of Python will include versions like "2.5.2" or "2.5.3". Pre and post releases in that case are excluded. So in our example, versions like "2.5a1" are not included when "2.5" is used. If the first version of the range is required, it has to be explicitly given. In our example, it will be "2.5.0".
Notice that some projects might omit the ".0" prefix for the first release of the "2.5.x" series:
- 2.5
- 2.5.1
- 2.5.2
- etc.
In that case, "2.5.0" will have to be explicitly used to avoid any confusion between the "2.5" notation that represents the full range. It is a recommended practice to use schemes of the same length for a series to completely avoid this problem.
Some Examples:
- Requires-Dist: zope.interface (3.1): any version that starts with 3.1, excluding post or pre-releases.
- Requires-Dist: zope.interface (3.1.0): any version that starts with 3.1.0, excluding post or pre-releases. Since that particular project doesn't use more than 3 digits, it also means "only the 3.1.0 release".
- Requires-Python: 3: Any Python 3 version, no matter wich one, excluding post or pre-releases.
- Requires-Python: >=2.6,<3: Any version of Python 2.6 or 2.7, including post releases of 2.6, pre and post releases of 2.7. It excludes pre releases of Python 3.
- Requires-Python: 2.6.2: Equivalent to ">=2.6.2,<2.6.3". So this includes only Python 2.6.2. Of course, if Python was numbered with 4 digits, it would have include all versions of the 2.6.2 series.
- Requires-Python: 2.5.0: Equivalent to ">=2.5.0,<2.5.1".
- Requires-Dist: zope.interface (3.1,!=3.1.3): any version that starts with 3.1, excluding post or pre-releases of 3.1 and excluding any version that starts with "3.1.3". For this particular project, this means: "any version of the 3.1 series but not 3.1.3". This is equivalent to: ">=3.1,!=3.1.3,<3.2".
Environment markers
An environment marker is a marker that can be added at the end of a field after a semi-colon (";"), to add a condition about the execution environment.
Here are some example of fields using such markers:
Requires-Dist: pywin32 (>1.0); sys.platform == 'win32' Obsoletes-Dist: pywin31; sys.platform == 'win32' Requires-Dist: foo (1,!=1.3); platform.machine == 'i386' Requires-Dist: bar; python_version == '2.4' or python_version == '2.5' Requires-External: libxslt; 'linux' in sys.platform
The micro-language behind this is the simplest possible: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. It makes it also easy to understand to non-pythoneers.
The pseudo-grammar is
EXPR [in|==|!=|not in] EXPR [or|and] ...
where EXPR belongs to any of those:
- python_version = '%s.%s' % (sys.version_info[0], sys.version_info[1])
- python_full_version = sys.version.split()[0]
- os.name = os.name
- sys.platform = sys.platform
- platform.version = platform.version()
- platform.machine = platform.machine()
- platform.python_implementation = platform.python_implementation()
- a free string, like '2.4', or 'win32'
Notice that in is restricted to strings, meaning that it is not possible to use other sequences like tuples or lists on the right side.
The fields that benefit from this marker are:
- Requires-Python
- Requires-External
- Requires-Dist
- Provides-Dist
- Obsoletes-Dist
- Classifier
Summary of Differences From PEP 314
- Metadata-Version is now 1.2.
- Added the environment markers.
- Changed fields:
- Platform
- Author
- Added fields:
- Maintainer
- Maintainer-email
- Requires-Python
- Requires-External
- Requires-Dist
- Provides-Dist
- Obsoletes-Dist
- Project-URL
- Deprecated fields:
- Requires (in favor of Requires-Dist)
- Provides (in favor of Provides-Dist)
- Obsoletes (in favor of Obsoletes-Dist)
References
This document specifies version 1.2 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314.
| [1] | reStructuredText markup: http://docutils.sourceforge.net/ |
| [2] | RFC 822 Long Header Fields: http://www.freesoft.org/CIE/RFC/822/7.htm |
| [3] | PEP 301, Package Index and Metadata for Distutils: http://www.python.org/dev/peps/pep-0301/ |
| [4] | http://pypi.python.org/pypi/ |
Copyright
This document has been placed in the public domain.
Acknowledgements
Fred Drake, Anthony Baxter and Matthias Klose have all contributed to the ideas presented in this PEP.
Tres Seaver, Jim Fulton, Marc-André Lemburg, Martin von Löwis, Tarek Ziadé, David Lyon and other people at the Distutils-SIG have contributed to the new updated version.
pep-0346 User Defined ("with") Statements
| PEP: | 346 |
|---|---|
| Title: | User Defined ("with") Statements |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 6-May-2005 |
| Python-Version: | 2.5 |
| Post-History: |
Contents
- Abstract
- Author's Note
- Introduction
- Relationship with other PEPs
- User defined statements
- Generators
- Default value for yield
- Template generator decorator: statement_template
- Template generator wrapper: __enter__() method
- Template generator wrapper: __exit__() method
- Injecting exceptions into generators
- Generator finalisation
- Generator finalisation: TerminateIteration exception
- Generator finalisation: __del__() method
- Deterministic generator finalisation
- Generators as user defined statement templates
- Examples
- Open Issues
- Rejected Options
- Having the basic construct be a looping construct
- Allowing statement templates to suppress exceptions
- Differentiating between non-exceptional exits
- Not injecting raised exceptions into generators
- Making all generators statement templates
- Using do as the keyword
- Not having a keyword
- Enhancing try statements
- Having the template protocol directly reflect try statements
- Iterator finalisation (WITHDRAWN)
- Acknowledgements
- References
- Copyright
Abstract
This PEP is a combination of PEP 310's "Reliable Acquisition/Release Pairs" with the "Anonymous Block Statements" of Guido's PEP 340. This PEP aims to take the good parts of PEP 340, blend them with parts of PEP 310 and rearrange the lot into an elegant whole. It borrows from various other PEPs in order to paint a complete picture, and is intended to stand on its own.
Author's Note
During the discussion of PEP 340, I maintained drafts of this PEP as PEP 3XX on my own website (since I didn't have CVS access to update a submitted PEP fast enough to track the activity on python-dev).
Since the first draft of this PEP, Guido wrote PEP 343 as a simplified version of PEP 340. PEP 343 (at the time of writing) uses the exact same semantics for the new statements as this PEP, but uses a slightly different mechanism to allow generators to be used to write statement templates. However, Guido has indicated that he intends to accept a new PEP being written by Raymond Hettinger that will integrate PEP 288 and PEP 325, and will permit a generator decorator like the one described in this PEP to be used to write statement templates for PEP 343. The other difference was the choice of keyword ('with' versus 'do') and Guido has stated he will organise a vote on that in the context of PEP 343.
Accordingly, the version of this PEP submitted for archiving on python.org is to be WITHDRAWN immediately after submission. PEP 343 and the combined generator enhancement PEP will cover the important ideas.
Introduction
This PEP proposes that Python's ability to reliably manage resources be enhanced by the introduction of a new with statement that allows factoring out of arbitrary try/finally and some try/except/else boilerplate. The new construct is called a 'user defined statement', and the associated class definitions are called 'statement templates'.
The above is the main point of the PEP. However, if that was all it said, then PEP 310 would be sufficient and this PEP would be essentially redundant. Instead, this PEP recommends additional enhancements that make it natural to write these statement templates using appropriately decorated generators. A side effect of those enhancements is that it becomes important to appropriately deal with the management of resources inside generators.
This is quite similar to PEP 343, but the exceptions that occur are re-raised inside the generators frame, and the issue of generator finalisation needs to be addressed as a result. The template generator decorator suggested by this PEP also creates reusable templates, rather than the single use templates of PEP 340.
In comparison to PEP 340, this PEP eliminates the ability to suppress exceptions, and makes the user defined statement a non-looping construct. The other main difference is the use of a decorator to turn generators into statement templates, and the incorporation of ideas for addressing iterator finalisation.
If all that seems like an ambitious operation. . . well, Guido was the one to set the bar that high when he wrote PEP 340 :)
Relationship with other PEPs
This PEP competes directly with PEP 310 [1], PEP 340 [2] and PEP 343 [3], as those PEPs all describe alternative mechanisms for handling deterministic resource management.
It does not compete with PEP 342 [4] which splits off PEP 340's enhancements related to passing data into iterators. The associated changes to the for loop semantics would be combined with the iterator finalisation changes suggested in this PEP. User defined statements would not be affected.
Neither does this PEP compete with the generator enhancements described in PEP 288 [5]. While this PEP proposes the ability to inject exceptions into generator frames, it is an internal implementation detail, and does not require making that ability publicly available to Python code. PEP 288 is, in part, about making that implementation detail easily accessible.
This PEP would, however, make the generator resource release support described in PEP 325 [6] redundant - iterators which require finalisation should provide an appropriate implementation of the statement template protocol.
User defined statements
To steal the motivating example from PEP 310, correct handling of a synchronisation lock currently looks like this:
the_lock.acquire()
try:
# Code here executes with the lock held
finally:
the_lock.release()
Like PEP 310, this PEP proposes that such code be able to be written as:
with the_lock:
# Code here executes with the lock held
These user defined statements are primarily designed to allow easy factoring of try blocks that are not easily converted to functions. This is most commonly the case when the exception handling pattern is consistent, but the body of the try block changes. With a user-defined statement, it is straightforward to factor out the exception handling into a statement template, with the body of the try clause provided inline in the user code.
The term 'user defined statement' reflects the fact that the meaning of a with statement is governed primarily by the statement template used, and programmers are free to create their own statement templates, just as they are free to create their own iterators for use in for loops.
Usage syntax for user defined statements
The proposed syntax is simple:
with EXPR1 [as VAR1]:
BLOCK1
Semantics for user defined statements
the_stmt = EXPR1
stmt_enter = getattr(the_stmt, "__enter__", None)
stmt_exit = getattr(the_stmt, "__exit__", None)
if stmt_enter is None or stmt_exit is None:
raise TypeError("Statement template required")
VAR1 = stmt_enter() # Omit 'VAR1 =' if no 'as' clause
exc = (None, None, None)
try:
try:
BLOCK1
except:
exc = sys.exc_info()
raise
finally:
stmt_exit(*exc)
Other than VAR1, none of the local variables shown above will be visible to user code. Like the iteration variable in a for loop, VAR1 is visible in both BLOCK1 and code following the user defined statement.
Note that the statement template can only react to exceptions, it cannot suppress them. See Rejected Options for an explanation as to why.
Statement template protocol: __enter__
The __enter__() method takes no arguments, and if it raises an exception, BLOCK1 is never executed. If this happens, the __exit__() method is not called. The value returned by this method is assigned to VAR1 if the as clause is used. Object's with no other value to return should generally return self rather than None to permit in-place creation in the with statement.
Statement templates should use this method to set up the conditions that are to exist during execution of the statement (e.g. acquisition of a synchronisation lock).
Statement templates which are not always usable (e.g. closed file objects) should raise a RuntimeError if an attempt is made to call __enter__() when the template is not in a valid state.
Statement template protocol: __exit__
The __exit__() method accepts three arguments which correspond to the three "arguments" to the raise statement: type, value, and traceback. All arguments are always supplied, and will be set to None if no exception occurred. This method will be called exactly once by the with statement machinery if the __enter__() method completes successfully.
Statement templates perform their exception handling in this method. If the first argument is None, it indicates non-exceptional completion of BLOCK1 - execution either reached the end of block, or early completion was forced using a return, break or continue statement. Otherwise, the three arguments reflect the exception that terminated BLOCK1.
Any exceptions raised by the __exit__() method are propagated to the scope containing the with statement. If the user code in BLOCK1 also raised an exception, that exception would be lost, and replaced by the one raised by the __exit__() method.
Factoring out arbitrary exception handling
Consider the following exception handling arrangement:
SETUP_BLOCK
try:
try:
TRY_BLOCK
except exc_type1, exc:
EXCEPT_BLOCK1
except exc_type2, exc:
EXCEPT_BLOCK2
except:
EXCEPT_BLOCK3
else:
ELSE_BLOCK
finally:
FINALLY_BLOCK
It can be roughly translated to a statement template as follows:
class my_template(object):
def __init__(self, *args):
# Any required arguments (e.g. a file name)
# get stored in member variables
# The various BLOCK's will need updating to reflect
# that.
def __enter__(self):
SETUP_BLOCK
def __exit__(self, exc_type, value, traceback):
try:
try:
if exc_type is not None:
raise exc_type, value, traceback
except exc_type1, exc:
EXCEPT_BLOCK1
except exc_type2, exc:
EXCEPT_BLOCK2
except:
EXCEPT_BLOCK3
else:
ELSE_BLOCK
finally:
FINALLY_BLOCK
Which can then be used as:
with my_template(*args):
TRY_BLOCK
However, there are two important semantic differences between this code and the original try statement.
Firstly, in the original try statement, if a break, return or continue statement is encountered in TRY_BLOCK, only FINALLY_BLOCK will be executed as the statement completes. With the statement template, ELSE_BLOCK will also execute, as these statements are treated like any other non-exceptional block termination. For use cases where it matters, this is likely to be a good thing (see transaction in the Examples), as this hole where neither the except nor the else clause gets executed is easy to forget when writing exception handlers.
Secondly, the statement template will not suppress any exceptions. If, for example, the original code suppressed the exc_type1 and exc_type2 exceptions, then this would still need to be done inline in the user code:
try:
with my_template(*args):
TRY_BLOCK
except (exc_type1, exc_type2):
pass
However, even in these cases where the suppression of exceptions needs to be made explicit, the amount of boilerplate repeated at the calling site is significantly reduced (See Rejected Options for further discussion of this behaviour).
In general, not all of the clauses will be needed. For resource handling (like files or synchronisation locks), it is possible to simply execute the code that would have been part of FINALLY_BLOCK in the __exit__() method. This can be seen in the following implementation that makes synchronisation locks into statement templates as mentioned at the beginning of this section:
# New methods of synchronisation lock objects
def __enter__(self):
self.acquire()
return self
def __exit__(self, *exc_info):
self.release()
Generators
With their ability to suspend execution, and return control to the calling frame, generators are natural candidates for writing statement templates. Adding user defined statements to the language does not require the generator changes described in this section, thus making this PEP an obvious candidate for a phased implementation (with statements in phase 1, generator integration in phase 2). The suggested generator updates allow arbitrary exception handling to be factored out like this:
@statement_template
def my_template(*arguments):
SETUP_BLOCK
try:
try:
yield
except exc_type1, exc:
EXCEPT_BLOCK1
except exc_type2, exc:
EXCEPT_BLOCK2
except:
EXCEPT_BLOCK3
else:
ELSE_BLOCK
finally:
FINALLY_BLOCK
Notice that, unlike the class based version, none of the blocks need to be modified, as shared values are local variables of the generator's internal frame, including the arguments passed in by the invoking code. The semantic differences noted earlier (all non-exceptional block termination triggers the else clause, and the template is unable to suppress exceptions) still apply.
Default value for yield
When creating a statement template with a generator, the yield statement will often be used solely to return control to the body of the user defined statement, rather than to return a useful value.
Accordingly, if this PEP is accepted, yield, like return, will supply a default value of None (i.e. yield and yield None will become equivalent statements).
This same change is being suggested in PEP 342. Obviously, it would only need to be implemented once if both PEPs were accepted :)
Template generator decorator: statement_template
As with PEP 343, a new decorator is suggested that wraps a generator in an object with the appropriate statement template semantics. Unlike PEP 343, the templates suggested here are reusable, as the generator is instantiated anew in each call to __enter__(). Additionally, any exceptions that occur in BLOCK1 are re-raised in the generator's internal frame:
class template_generator_wrapper(object):
def __init__(self, func, func_args, func_kwds):
self.func = func
self.args = func_args
self.kwds = func_kwds
self.gen = None
def __enter__(self):
if self.gen is not None:
raise RuntimeError("Enter called without exit!")
self.gen = self.func(*self.args, **self.kwds)
try:
return self.gen.next()
except StopIteration:
raise RuntimeError("Generator didn't yield")
def __exit__(self, *exc_info):
if self.gen is None:
raise RuntimeError("Exit called without enter!")
try:
try:
if exc_info[0] is not None:
self.gen._inject_exception(*exc_info)
else:
self.gen.next()
except StopIteration:
pass
else:
raise RuntimeError("Generator didn't stop")
finally:
self.gen = None
def statement_template(func):
def factory(*args, **kwds):
return template_generator_wrapper(func, args, kwds)
return factory
Template generator wrapper: __enter__() method
The template generator wrapper has an __enter__() method that creates a new instance of the contained generator, and then invokes next() once. It will raise a RuntimeError if the last generator instance has not been cleaned up, or if the generator terminates instead of yielding a value.
Template generator wrapper: __exit__() method
The template generator wrapper has an __exit__() method that simply invokes next() on the generator if no exception is passed in. If an exception is passed in, it is re-raised in the contained generator at the point of the last yield statement.
In either case, the generator wrapper will raise a RuntimeError if the internal frame does not terminate as a result of the operation. The __exit__() method will always clean up the reference to the used generator instance, permitting __enter__() to be called again.
A StopIteration raised by the body of the user defined statement may be inadvertently suppressed inside the __exit__() method, but this is unimportant, as the originally raised exception still propagates correctly.
Injecting exceptions into generators
To implement the __exit__() method of the template generator wrapper, it is necessary to inject exceptions into the internal frame of the generator. This is new implementation level behaviour that has no current Python equivalent.
The injection mechanism (referred to as _inject_exception in this PEP) raises an exception in the generator's frame with the specified type, value and traceback information. This means that the exception looks like the original if it is allowed to propagate.
For the purposes of this PEP, there is no need to make this capability available outside the Python implementation code.
Generator finalisation
To support resource management in template generators, this PEP will eliminate the restriction on yield statements inside the try block of a try/finally statement. Accordingly, generators which require the use of a file or some such object can ensure the object is managed correctly through the use of try/finally or with statements.
This restriction will likely need to be lifted globally - it would be difficult to restrict it so that it was only permitted inside generators used to define statement templates. Accordingly, this PEP includes suggestions designed to ensure generators which are not used as statement templates are still finalised appropriately.
Generator finalisation: TerminateIteration exception
A new exception is proposed:
class TerminateIteration(Exception): pass
The new exception is injected into a generator in order to request finalisation. It should not be suppressed by well-behaved code.
Generator finalisation: __del__() method
To ensure a generator is finalised eventually (within the limits of Python's garbage collection), generators will acquire a __del__() method with the following semantics:
def __del__(self):
try:
self._inject_exception(TerminateIteration, None, None)
except TerminateIteration:
pass
Deterministic generator finalisation
There is a simple way to provide deterministic finalisation of generators - give them appropriate __enter__() and __exit__() methods:
def __enter__(self):
return self
def __exit__(self, *exc_info):
try:
self._inject_exception(TerminateIteration, None, None)
except TerminateIteration:
pass
Then any generator can be finalised promptly by wrapping the relevant for loop inside a with statement:
with all_lines(filenames) as lines:
for line in lines:
print lines
(See the Examples for the definition of all_lines, and the reason it requires prompt finalisation)
Compare the above example to the usage of file objects:
with open(filename) as f:
for line in f:
print f
Generators as user defined statement templates
When used to implement a user defined statement, a generator should yield only once on a given control path. The result of that yield will then be provided as the result of the generator's __enter__() method. Having a single yield on each control path ensures that the internal frame will terminate when the generator's __exit__() method is called. Multiple yield statements on a single control path will result in a RuntimeError being raised by the __exit__() method when the internal frame fails to terminate correctly. Such an error indicates a bug in the statement template.
To respond to exceptions, or to clean up resources, it is sufficient to wrap the yield statement in an appropriately constructed try statement. If execution resumes after the yield without an exception, the generator knows that the body of the do statement completed without incident.
Examples
A template for ensuring that a lock, acquired at the start of a block, is released when the block is left:
# New methods on synchronisation locks def __enter__(self): self.acquire() return self def __exit__(self, *exc_info): lock.release()Used as follows:
with myLock: # Code here executes with myLock held. The lock is # guaranteed to be released when the block is left (even # if via return or by an uncaught exception).A template for opening a file that ensures the file is closed when the block is left:
# New methods on file objects def __enter__(self): if self.closed: raise RuntimeError, "Cannot reopen closed file handle" return self def __exit__(self, *args): self.close()Used as follows:
with open("/etc/passwd") as f: for line in f: print line.rstrip()A template for committing or rolling back a database transaction:
def transaction(db): try: yield except: db.rollback() else: db.commit()Used as follows:
with transaction(the_db): make_table(the_db) add_data(the_db) # Getting to here automatically triggers a commit # Any exception automatically triggers a rollbackIt is possible to nest blocks and combine templates:
@statement_template def lock_opening(lock, filename, mode="r"): with lock: with open(filename, mode) as f: yield fUsed as follows:
with lock_opening(myLock, "/etc/passwd") as f: for line in f: print line.rstrip()Redirect stdout temporarily:
@statement_template def redirected_stdout(new_stdout): save_stdout = sys.stdout try: sys.stdout = new_stdout yield finally: sys.stdout = save_stdoutUsed as follows:
with open(filename, "w") as f: with redirected_stdout(f): print "Hello world"A variant on open() that also returns an error condition:
@statement_template def open_w_error(filename, mode="r"): try: f = open(filename, mode) except IOError, err: yield None, err else: try: yield f, None finally: f.close()Used as follows:
do open_w_error("/etc/passwd", "a") as f, err: if err: print "IOError:", err else: f.write("guido::0:0::/:/bin/sh\n")Find the first file with a specific header:
for name in filenames: with open(name) as f: if f.read(2) == 0xFEB0: breakFind the first item you can handle, holding a lock for the entire loop, or just for each iteration:
with lock: for item in items: if handle(item): break for item in items: with lock: if handle(item): breakHold a lock while inside a generator, but release it when returning control to the outer scope:
@statement_template def released(lock): lock.release() try: yield finally: lock.acquire()Used as follows:
with lock: for item in items: with released(lock): yield itemRead the lines from a collection of files (e.g. processing multiple configuration sources):
def all_lines(filenames): for name in filenames: with open(name) as f: for line in f: yield lineUsed as follows:
with all_lines(filenames) as lines: for line in lines: update_config(line)Not all uses need to involve resource management:
@statement_template def tag(*args, **kwds): name = cgi.escape(args[0]) if kwds: kwd_pairs = ["%s=%s" % cgi.escape(key), cgi.escape(value) for key, value in kwds] print '<%s %s>' % name, " ".join(kwd_pairs) else: print '<%s>' % name yield print '</%s>' % nameUsed as follows:
with tag('html'): with tag('head'): with tag('title'): print 'A web page' with tag('body'): for par in pars: with tag('p'): print par with tag('a', href="http://www.python.org"): print "Not a dead parrot!"From PEP 343, another useful example would be an operation that blocks signals. The use could be like this:
from signal import blocked_signals with blocked_signals(): # code executed without worrying about signalsAn optional argument might be a list of signals to be blocked; by default all signals are blocked. The implementation is left as an exercise to the reader.
Another use for this feature is for Decimal contexts:
# New methods on decimal Context objects def __enter__(self): if self._old_context is not None: raise RuntimeError("Already suspending other Context") self._old_context = getcontext() setcontext(self) def __exit__(self, *args): setcontext(self._old_context) self._old_context = NoneUsed as follows:
with decimal.Context(precision=28): # Code here executes with the given context # The context always reverts after this statement
Open Issues
None, as this PEP has been withdrawn.
Rejected Options
Having the basic construct be a looping construct
The major issue with this idea, as illustrated by PEP 340's block statements, is that it causes problems with factoring try statements that are inside loops, and contain break and continue statements (as these statements would then apply to the block construct, instead of the original loop). As a key goal is to be able to factor out arbitrary exception handling (other than suppression) into statement templates, this is a definite problem.
There is also an understandability problem, as can be seen in the Examples. In the example showing acquisition of a lock either for an entire loop, or for each iteration of the loop, if the user defined statement was itself a loop, moving it from outside the for loop to inside the for loop would have major semantic implications, beyond those one would expect.
Finally, with a looping construct, there are significant problems with TOOWTDI, as it is frequently unclear whether a particular situation should be handled with a conventional for loop or the new looping construct. With the current PEP, there is no such problem - for loops continue to be used for iteration, and the new do statements are used to factor out exception handling.
Another issue, specifically with PEP 340's anonymous block statements, is that they make it quite difficult to write statement templates directly (i.e. not using a generator). This problem is addressed by the current proposal, as can be seen by the relative simplicity of the various class based implementations of statement templates in the Examples.
Allowing statement templates to suppress exceptions
Earlier versions of this PEP gave statement templates the ability to suppress exceptions. The BDFL expressed concern over the associated complexity, and I agreed after reading an article by Raymond Chen about the evils of hiding flow control inside macros in C code [7].
Removing the suppression ability eliminated a whole lot of complexity from both the explanation and implementation of user defined statements, further supporting it as the correct choice. Older versions of the PEP had to jump through some horrible hoops to avoid inadvertently suppressing exceptions in __exit__() methods - that issue does not exist with the current suggested semantics.
There was one example (auto_retry) that actually used the ability to suppress exceptions. This use case, while not quite as elegant, has significantly more obvious control flow when written out in full in the user code:
def attempts(num_tries):
return reversed(xrange(num_tries))
for retry in attempts(3):
try:
make_attempt()
except IOError:
if not retry:
raise
For what it's worth, the perverse could still write this as:
for attempt in auto_retry(3, IOError):
try:
with attempt:
make_attempt()
except FailedAttempt:
pass
To protect the innocent, the code to actually support that is not included here.
Differentiating between non-exceptional exits
Earlier versions of this PEP allowed statement templates to distinguish between exiting the block normally, and exiting via a return, break or continue statement. The BDFL flirted with a similar idea in PEP 343 and its associated discussion. This added significant complexity to the description of the semantics, and it required each and every statement template to decide whether or not those statements should be treated like exceptions, or like a normal mechanism for exiting the block.
This template-by-template decision process raised great potential for confusion - consider if one database connector provided a transaction template that treated early exits like an exception, whereas a second connector treated them as normal block termination.
Accordingly, this PEP now uses the simplest solution - early exits appear identical to normal block termination as far as the statement template is concerned.
Not injecting raised exceptions into generators
PEP 343 suggests simply invoking next() unconditionally on generators used to define statement templates. This means the template generators end up looking rather unintuitive, and the retention of the ban against yielding inside try/finally means that Python's exception handling capabilities cannot be used to deal with management of multiple resources.
The alternative which this PEP advocates (injecting raised exceptions into the generator frame), means that multiple resources can be managed elegantly as shown by lock_opening in the Examples
Making all generators statement templates
Separating the template object from the generator itself makes it possible to have reusable generator templates. That is, the following code will work correctly if this PEP is accepted:
open_it = lock_opening(parrot_lock, "dead_parrot.txt")
with open_it as f:
# use the file for a while
with open_it as f:
# use the file again
The second benefit is that iterator generators and template generators are very different things - the decorator keeps that distinction clear, and prevents one being used where the other is required.
Finally, requiring the decorator allows the native methods of generator objects to be used to implement generator finalisation.
Using do as the keyword
do was an alternative keyword proposed during the PEP 340 discussion. It reads well with appropriately named functions, but it reads poorly when used with methods, or with objects that provide native statement template support.
When do was first suggested, the BDFL had rejected PEP 310's with keyword, based on a desire to use it for a Pascal/Delphi style with statement. Since then, the BDFL has retracted this objection, as he no longer intends to provide such a statement. This change of heart was apparently based on the C# developers reasons for not providing the feature [8].
Not having a keyword
This is an interesting option, and can be made to read quite well. However, it's awkward to look up in the documentation for new users, and strikes some as being too magical. Accordingly, this PEP goes with a keyword based suggestion.
Enhancing try statements
This suggestion involves give bare try statements a signature similar to that proposed for with statements.
I think that trying to write a with statement as an enhanced try statement makes as much sense as trying to write a for loop as an enhanced while loop. That is, while the semantics of the former can be explained as a particular way of using the latter, the former is not an instance of the latter. The additional semantics added around the more fundamental statement result in a new construct, and the two different statements shouldn't be confused.
This can be seen by the fact that the 'enhanced' try statement still needs to be explained in terms of a 'non-enhanced' try statement. If it's something different, it makes more sense to give it a different name.
Having the template protocol directly reflect try statements
One suggestion was to have separate methods in the protocol to cover different parts of the structure of a generalised try statement. Using the terms try, except, else and finally, we would have something like:
class my_template(object):
def __init__(self, *args):
# Any required arguments (e.g. a file name)
# get stored in member variables
# The various BLOCK's will need to updated to reflect
# that.
def __try__(self):
SETUP_BLOCK
def __except__(self, exc, value, traceback):
if isinstance(exc, exc_type1):
EXCEPT_BLOCK1
if isinstance(exc, exc_type2):
EXCEPT_BLOCK2
else:
EXCEPT_BLOCK3
def __else__(self):
ELSE_BLOCK
def __finally__(self):
FINALLY_BLOCK
Aside from preferring the addition of two method slots rather than four, I consider it significantly easier to be able to simply reproduce a slightly modified version of the original try statement code in the __exit__() method (as shown in Factoring out arbitrary exception handling), rather than have to split the functionality amongst several different methods (or figure out which method to use if not all clauses are used by the template).
To make this discussion less theoretical, here is the transaction example implemented using both the two method and the four method protocols instead of a generator. Both implementations guarantee a commit if a break, return or continue statement is encountered (as does the generator-based implementation in the Examples section):
class transaction_2method(object):
def __init__(self, db):
self.db = db
def __enter__(self):
pass
def __exit__(self, exc_type, *exc_details):
if exc_type is None:
self.db.commit()
else:
self.db.rollback()
class transaction_4method(object):
def __init__(self, db):
self.db = db
self.commit = False
def __try__(self):
self.commit = True
def __except__(self, exc_type, exc_value, traceback):
self.db.rollback()
self.commit = False
def __else__(self):
pass
def __finally__(self):
if self.commit:
self.db.commit()
self.commit = False
There are two more minor points, relating to the specific method names in the suggestion. The name of the __try__() method is misleading, as SETUP_BLOCK executes before the try statement is entered, and the name of the __else__() method is unclear in isolation, as numerous other Python statements include an else clause.
Iterator finalisation (WITHDRAWN)
The ability to use user defined statements inside generators is likely to increase the need for deterministic finalisation of iterators, as resource management is pushed inside the generators, rather than being handled externally as is currently the case.
The PEP currently suggests handling this by making all generators statement templates, and using with statements to handle finalisation. However, earlier versions of this PEP suggested the following, more complex, solution, that allowed the author of a generator to flag the need for finalisation, and have for loops deal with it automatically. It is included here as a long, detailed rejected option.
Iterator protocol addition: __finish__
An optional new method for iterators is proposed, called __finish__(). It takes no arguments, and should not return anything.
The __finish__ method is expected to clean up all resources the iterator has open. Iterators with a __finish__() method are called 'finishable iterators' for the remainder of the PEP.
Best effort finalisation
A finishable iterator should ensure that it provides a __del__ method that also performs finalisation (e.g. by invoking the __finish__() method). This allows Python to still make a best effort at finalisation in the event that deterministic finalisation is not applied to the iterator.
Deterministic finalisation
If the iterator used in a for loop has a __finish__() method, the enhanced for loop semantics will guarantee that that method will be executed, regardless of the means of exiting the loop. This is important for iterator generators that utilise user defined statements or the now permitted try/finally statements, or for new iterators that rely on timely finalisation to release allocated resources (e.g. releasing a thread or database connection back into a pool).
for loop syntax
No changes are suggested to for loop syntax. This is just to define the statement parts needed for the description of the semantics:
for VAR1 in EXPR1:
BLOCK1
else:
BLOCK2
Updated for loop semantics
When the target iterator does not have a __finish__() method, a for loop will execute as follows (i.e. no change from the status quo):
itr = iter(EXPR1)
exhausted = False
while True:
try:
VAR1 = itr.next()
except StopIteration:
exhausted = True
break
BLOCK1
if exhausted:
BLOCK2
When the target iterator has a __finish__() method, a for loop will execute as follows:
itr = iter(EXPR1)
exhausted = False
try:
while True:
try:
VAR1 = itr.next()
except StopIteration:
exhausted = True
break
BLOCK1
if exhausted:
BLOCK2
finally:
itr.__finish__()
The implementation will need to take some care to avoid incurring the try/finally overhead when the iterator does not have a __finish__() method.
Generator iterator finalisation: __finish__() method
When enabled with the appropriate decorator, generators will have a __finish__() method that raises TerminateIteration in the internal frame:
def __finish__(self):
try:
self._inject_exception(TerminateIteration)
except TerminateIteration:
pass
A decorator (e.g. needs_finish()) is required to enable this feature, so that existing generators (which are not expecting finalisation) continue to work as expected.
Partial iteration of finishable iterators
Partial iteration of a finishable iterator is possible, although it requires some care to ensure the iterator is still finalised promptly (it was made finishable for a reason!). First, we need a class to enable partial iteration of a finishable iterator by hiding the iterator's __finish__() method from the for loop:
class partial_iter(object):
def __init__(self, iterable):
self.iter = iter(iterable)
def __iter__(self):
return self
def next(self):
return self.itr.next()
Secondly, an appropriate statement template is needed to ensure the the iterator is finished eventually:
@statement_template
def finishing(iterable):
itr = iter(iterable)
itr_finish = getattr(itr, "__finish__", None)
if itr_finish is None:
yield itr
else:
try:
yield partial_iter(itr)
finally:
itr_finish()
This can then be used as follows:
do finishing(finishable_itr) as itr:
for header_item in itr:
if end_of_header(header_item):
break
# process header item
for body_item in itr:
# process body item
Note that none of the above is needed for an iterator that is not finishable - without a __finish__() method, it will not be promptly finalised by the for loop, and hence inherently allows partial iteration. Allowing partial iteration of non-finishable iterators as the default behaviour is a key element in keeping this addition to the iterator protocol backwards compatible.
Acknowledgements
The acknowledgements section for PEP 340 applies, since this text grew out of the discussion of that PEP, but additional thanks go to Michael Hudson, Paul Moore and Guido van Rossum for writing PEP 310 and PEP 340 in the first place, and to (in no meaningful order) Fredrik Lundh, Phillip J. Eby, Steven Bethard, Josiah Carlson, Greg Ewing, Tim Delaney and Arnold deVos for prompting particular ideas that made their way into this text.
References
| [1] | Reliable Acquisition/Release Pairs (http://www.python.org/dev/peps/pep-0310/) |
| [2] | Anonymous block statements (http://www.python.org/dev/peps/pep-0340/) |
| [3] | Anonymous blocks, redux (http://www.python.org/dev/peps/pep-0343/) |
| [4] | Enhanced Iterators (http://www.python.org/dev/peps/pep-0342/) |
| [5] | Generator Attributes and Exceptions (http://www.python.org/dev/peps/pep-0288/) |
| [6] | Resource-Release Support for Generators (http://www.python.org/dev/peps/pep-0325/) |
| [7] | A rant against flow control macros (http://blogs.msdn.com/oldnewthing/archive/2005/01/06/347666.aspx) |
| [8] | Why doesn't C# have a 'with' statement? (http://msdn.microsoft.com/vcsharp/programming/language/ask/withstatement/) |
Copyright
This document has been placed in the public domain.
pep-0347 Migrating the Python CVS to Subversion
| PEP: | 347 |
|---|---|
| Title: | Migrating the Python CVS to Subversion |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Discussions-To: | <python-dev at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 14-Jul-2004 |
| Post-History: | 14-Jul-2004 |
Contents
Abstract
The Python source code is currently managed in a CVS repository on sourceforge.net. This PEP proposes to move it to a Subversion repository on svn.python.org.
Rationale
This change has two aspects: moving from CVS to Subversion, and moving from SourceForge to python.org. For each, a rationale will be given.
Moving to Subversion
CVS has a number of limitations that have been eliminated by Subversion. For the development of Python, the most notable improvements are:
- the ability to rename files and directories, and to remove directories, while keeping the history of these files.
- support for change sets (sets of correlated changes to multiple files) through global revision numbers. Change sets are transactional.
- atomic, fast tagging: a cvs tag might take many minutes; a Subversion tag (svn cp) will complete quickly, and atomically. Likewise, branches are very efficient.
- support for offline diffs, which is useful when creating patches.
Moving to python.org
SourceForge has kindly provided an important infrastructure for the past years. Unfortunately, the attention that SF received has also caused repeated overload situations in the past, to which the SF operators could not always respond in a timely manner. In particular, for CVS, they had to reduce the load on the primary CVS server by introducing a second, read-only CVS server for anonymous access. This server is regularly synchronized, but lags behind the read-write CVS repository between synchronizations. As a result, users without commit access can see recent changes to the repository only after a delay.
On python.org, it would be possible to make the repository accessible for anonymous access.
Migration Procedure
To move the Python CVS repository, the following steps need to be executed. The steps are elaborated upon in the following sections.
- Collect SSH keys for all current committers, along with usernames to appear in commit messages.
- At the beginning of the migration, announce that the repository on SourceForge closed.
- 24 hours after the last commit, download the CVS repository.
- Convert the CVS repository into a Subversion repository.
- Publish the repository with write access for committers, and read-only anonymous access.
- Disable CVS access on SF.
Collect SSH keys
After some discussion, svn+ssh was selected as the best method for write access to the repository. Developers can continue to use their SSH keys, but they must be installed on python.org.
In order to avoid having to create a new Unix user for each developer, a single account should be used, with command= attributes in the authorized_keys files.
The lines in the authorized_keys file should read like this (wrapped for better readability):
command="/usr/bin/svnserve --root=/svnroot -t --tunnel-user='<username>'",no-port-forwarding, no-X11-forwarding,no-agent-forwarding,no-pty ssh-dss <key> <comment>
As the usernames, the real names should be used instead of the SF account names, so that people can be better identified in log messages.
Administrator Access
Administrator access to the pythondev account should be granted to all current admins of the Python SF project. To distinguish between shell login and svnserve login, admins need to maintain two keys. Using OpenSSH, the following procedure can be used to create a second key:
cd .ssh ssh-keygen -t DSA -f pythondev -C <user>@pythondev vi config
In the config file, the following lines need to be added:
Host pythondev Hostname dinsdale.python.org User pythondev IdentityFile ~/.ssh/pythondev
Then, shell login becomes possible through "ssh pythondev".
Downloading the CVS Repository
The CVS repository can be downloaded from
http://cvs.sourceforge.net/cvstarballs/python-cvsroot.tar.bz2
Since this tarball is generated only once a day, some time must pass after the repository freeze before the tarball can be picked up. It should be verified that the last commit, as recorded on the python-commits mailing list, is indeed included in the tarball.
After the conversion, the converted CVS tarball should be kept forever on www.python.org/archive/python-cvsroot-<date>.tar.bz2
Converting the CVS Repository
The Python CVS repository contains two modules: distutils and python. The python module is further structured into dist and nondist, where dist only contains src (the python code proper). nondist contains various subdirectories.
These should be reorganized in the Subversion repository to get shorter URLs, following the <project>/{trunk,tags,branches} structure. A project will be created for each nondist directory, plus for src (called python), plus distutils. Reorganizing the repository is best done in the CVS tree, as shown below.
The fsfs backend should be used as the repository format (which requires Subversion 1.1). The fsfs backend has the advantage of being more backup-friendly, as it allows incremental repository backups, without requiring any dump commands to be run.
The conversion should be done using the cvs2svn utility, available e.g. in the cvs2svn Debian package. As cvs2svn does not currently support the project/trunk structure, each project needs to be converted separately. To get each conversion result into a separate directory in the target repository, svnadmin load must be used.
Subversion has a different view on binary-vs-text files than CVS. To correctly carry the CVS semantics forward, svn:eol-style should be set to native on all files that are not marked binary in the CVS.
In summary, the conversion script is:
#!/bin/sh rm cvs2svn-* rm -rf python py.new tar xjf python-cvsroot.tar.bz2 rm -rf python/CVSROOT svnadmin create --fs-type fsfs py.new mv python/python python/orig mv python/orig/dist/src python/python mv python/orig/nondist/* python # nondist/nondist is empty rmdir python/nondist rm -rf python/orig for a in python/* do b=`basename $a` cvs2svn -q --dump-only --encoding=latin1 --force-branch=cnri-16-start \ --force-branch=descr-branch --force-branch=release152p1-patches \ --force-tag=r16b1 $a svn mkdir -m"Conversion to SVN" file:///`pwd`/py.new/$b svnadmin load -q --parent-dir $b py.new < cvs2svn-dump rm cvs2svn-dump done
Sample results of this conversion are available at
http://www.dcl.hpi.uni-potsdam.de/pysvn/
Publish the Repository
The repository should be published at http://svn.python.org/projects. Read-write access should be granted to all current SF committers through svn+ssh://pythondev@svn.python.org/; read-only anonymous access through WebDAV should also be granted.
As an option, websvn (available e.g. from the Debian websvn package) could be provided. Unfortunately, in the test installation, websvn breaks because it runs out of memory.
The current SF project admins should get write access to the authorized_keys2 file of the pythondev account.
Disable CVS
It appears that CVS cannot be disabled entirely. Only the user interface can be removed from the project page; the repository itself remains available. If desired, write access to the python and distutils modules can be disabled through a CVS commitinfo entry.
Discussion
Several alternatives had been suggested to the procedure above. The rejected alternatives are shortly discussed here:
create multiple repositories, one for python and one for distutils. This would have allowed even shorter URLs, but was rejected because a single repository supports moving code across projects.
Several people suggested to create the project/trunk structure through standard cvs2svn, followed by renames. This would have the disadvantage that old revisions use different path names than recent revisions; the suggested approach through dump files works without renames.
Several people also expressed concern about the administrative overhead that hosting the repository on python.org would cause to pydotorg admins. As a specific alternative, BerliOS has been suggested. The pydotorg admins themselves haven't objected to the additional workload; migrating the repository again if they get overworked is an option.
Different authentication strategies were discussed. As alternatives to svn+ssh were suggested
- Subversion over WebDAV, using SSL and basic authentication, with pydotorg-generated passwords mailed to the user. People did not like that approach, since they would need to store the password on disk (because they can't remember it); this is a security risk.
- Subversion over WebDAV, using SSL client certificates. This would work, but would require us to administer a certificate authority.
Instead of hosting this on python.org, people suggested hosting it elsewhere. One issue is whether this alternative should be free or commercial; several people suggested it should better be commercial, to reduce the load on the volunteers. In particular:
Greg Stein suggested http://www.wush.net/subversion.php. They offer 5 GB for $90/month, with 200 GB download/month. The data is on a RAID drive and fully backed up. Anonymous access and email commit notifications are supported. wush.net elaborated the following details:
- The machine would be a Virtuozzo Virtual Private Server (VPS), hosted at PowerVPS.
- The default repository URL would be http://python.wush.net/svn/projectname/, but anything else could be arranged
- we would get SSH login to the machine, with sudo capabilities.
- They have a Web interface for management of the various SVN repositories that we want to host, and to manage user accounts. While svn+ssh would be supported, the user interface does not yet support it.
- For offsite mirroring/backup, they suggest to use rsync instead of download of repository tarballs.
Bob Ippolito reported that they had used wush.net for a commercial project for about 6 months, after which time they left wush.net, because the service was down for three days, with nobody reachable, and no explanation when it came back.
Copyright
This document has been placed in the public domain.
pep-0348 Exception Reorganization for Python 3.0
| PEP: | 348 |
|---|---|
| Title: | Exception Reorganization for Python 3.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Jul-2005 |
| Post-History: |
Contents
- Abstract
- Rationale For Wanting Change
- Philosophy of Reorganization
- New Hierarchy
- Differences Compared to Python 2.4
- Required Superclass for raise
- Bare except Clauses Catch Exception
- Transition Plan
- Rejected Ideas
- DeprecationWarning Inheriting From PendingDeprecationWarning
- AttributeError Inheriting From TypeError or NameError
- Removal of EnvironmentError
- Introduction of MacError and UnixError
- SystemError Subclassing SystemExit
- ControlFlowException Under Exception
- Rename NameError to NamespaceError
- Renaming Existing Exceptions
- Have EOFError Subclass IOError
- Have MemoryError and SystemError Have a Common Superclass
- Common Superclass for PendingDeprecationWarning and DeprecationWarning
- Removing WindowsError
- Superclass for KeyboardInterrupt and SystemExit
- Acknowledgements
- References
- Copyright
Note
This PEP has been rejected [20].
Abstract
Python, as of version 2.4, has 38 exceptions (including warnings) in the built-in namespace in a rather shallow hierarchy. These classes have come about over the years without a chance to learn from experience. This PEP proposes doing a reorganization of the hierarchy for Python 3.0 when backwards-compatibility is not as much of an issue.
Along with this reorganization, adding a requirement that all objects passed to a raise statement must inherit from a specific superclass is proposed. This is to have guarantees about the basic interface of exceptions and to further enhance the natural hierarchy of exceptions.
Lastly, bare except clauses will be changed to be semantically equivalent to except Exception. Most people currently use bare except clause for this purpose and with the exception hierarchy reorganization becomes a viable default.
Rationale For Wanting Change
Exceptions are a critical part of Python. While exceptions are traditionally used to signal errors in a program, they have also grown to be used for flow control for things such as iterators.
While their importance is great, there is a lack of structure to them. This stems from the fact that any object can be raised as an exception. Because of this you have no guarantee in terms of what kind of object will be raised, destroying any possible hierarchy raised objects might adhere to.
But exceptions do have a hierarchy, showing the severity of the exception. The hierarchy also groups related exceptions together to simplify catching them in except clauses. To allow people to be able to rely on this hierarchy, a common superclass that all raise objects must inherit from is being proposed. It also allows guarantees about the interface to raised objects to be made (see PEP 344 [2]). A discussion about all of this has occurred before on python-dev [4].
As bare except clauses stand now, they catch all exceptions. While this can be handy, it is rather overreaching for the common case. Thanks to having a required superclass, catching all exceptions is as easy as catching just one specific exception. This allows bare except clauses to be used for a more useful purpose. Once again, this has been discussed on python-dev [5].
Finally, slight changes to the exception hierarchy will make it much more reasonable in terms of structure. By minor rearranging exceptions that should not typically be caught can be allowed to propagate to the top of the execution stack, terminating the interpreter as intended.
Philosophy of Reorganization
For the reorganization of the hierarchy, there was a general philosophy followed that developed from discussion of earlier drafts of this PEP [7], [8], [9], [10], [11], [12]. First and foremost was to not break anything that works. This meant that renaming exceptions was out of the question unless the name was deemed severely bad. This also meant no removal of exceptions unless they were viewed as truly misplaced. The introduction of new exceptions were only done in situations where there might be a use for catching a superclass of a category of exceptions. Lastly, existing exceptions would have their inheritance tree changed only if it was felt they were truly misplaced to begin with.
For all new exceptions, the proper suffix had to be chosen. For those that signal an error, "Error" is to be used. If the exception is a warning, then "Warning". "Exception" is to be used when none of the other suffixes are proper to use and no specific suffix is a better fit.
After that it came down to choosing which exceptions should and should not inherit from Exception. This was for the purpose of making bare except clauses more useful.
Lastly, the entire existing hierarchy had to inherit from the new exception meant to act as the required superclass for all exceptions to inherit from.
New Hierarchy
Note
Exceptions flagged with "stricter inheritance" will no longer inherit from a certain class. A "broader inheritance" flag means a class has been added to the exception's inheritance tree. All comparisons are against the Python 2.4 exception hierarchy.
+-- BaseException (new; broader inheritance for subclasses)
+-- Exception
+-- GeneratorExit (defined in PEP 342 [1])
+-- StandardError
+-- ArithmeticError
+-- DivideByZeroError
+-- FloatingPointError
+-- OverflowError
+-- AssertionError
+-- AttributeError
+-- EnvironmentError
+-- IOError
+-- EOFError
+-- OSError
+-- ImportError
+-- LookupError
+-- IndexError
+-- KeyError
+-- MemoryError
+-- NameError
+-- UnboundLocalError
+-- NotImplementedError (stricter inheritance)
+-- SyntaxError
+-- IndentationError
+-- TabError
+-- TypeError
+-- RuntimeError
+-- UnicodeError
+-- UnicodeDecodeError
+-- UnicodeEncodeError
+-- UnicodeTranslateError
+-- ValueError
+-- ReferenceError
+-- StopIteration
+-- SystemError
+-- Warning
+-- DeprecationWarning
+-- FutureWarning
+-- PendingDeprecationWarning
+-- RuntimeWarning
+-- SyntaxWarning
+-- UserWarning
+ -- WindowsError
+-- KeyboardInterrupt (stricter inheritance)
+-- SystemExit (stricter inheritance)
Differences Compared to Python 2.4
A more thorough explanation of terms is needed when discussing inheritance changes. Inheritance changes result in either broader or more restrictive inheritance. "Broader" is when a class has an inheritance tree like cls, A and then becomes cls, B, A. "Stricter" is the reverse.
BaseException
The superclass that all exceptions must inherit from. It's name was chosen to reflect that it is at the base of the exception hierarchy while being an exception itself. "Raisable" was considered as a name, it was passed on because its name did not properly reflect the fact that it is an exception itself.
Direct inheritance of BaseException is not expected, and will be discouraged for the general case. Most user-defined exceptions should inherit from Exception instead. This allows catching Exception to continue to work in the common case of catching all exceptions that should be caught. Direct inheritance of BaseException should only be done in cases where an entirely new category of exception is desired.
But, for cases where all exceptions should be caught blindly, except BaseException will work.
KeyboardInterrupt and SystemExit
Both exceptions are no longer under Exception. This is to allow bare except clauses to act as a more viable default case by catching exceptions that inherit from Exception. With both KeyboardInterrupt and SystemExit acting as signals that the interpreter is expected to exit, catching them in the common case is the wrong semantics.
NotImplementedError
Inherits from Exception instead of from RuntimeError.
Originally inheriting from RuntimeError, NotImplementedError does not have any direct relation to the exception meant for use in user code as a quick-and-dirty exception. Thus it now directly inherits from Exception.
Required Superclass for raise
By requiring all objects passed to a raise statement to inherit from a specific superclass, all exceptions are guaranteed to have certain attributes. If PEP 344 [2] is accepted, the attributes outlined there will be guaranteed to be on all exceptions raised. This should help facilitate debugging by making the querying of information from exceptions much easier.
The proposed hierarchy has BaseException as the required base class.
Implementation
Enforcement is straightforward. Modifying RAISE_VARARGS to do an inheritance check first before raising an exception should be enough. For the C API, all functions that set an exception will have the same inheritance check applied.
Bare except Clauses Catch Exception
In most existing Python 2.4 code, bare except clauses are too broad in the exceptions they catch. Typically only exceptions that signal an error are desired to be caught. This means that exceptions that are used to signify that the interpreter should exit should not be caught in the common case.
With KeyboardInterrupt and SystemExit moved to inherit from BaseException instead of Exception, changing bare except clauses to act as except Exception becomes a much more reasonable default. This change also will break very little code since these semantics are what most people want for bare except clauses.
The complete removal of bare except clauses has been argued for. The case has been made that they violate both Only One Way To Do It (OOWTDI) and Explicit Is Better Than Implicit (EIBTI) as listed in the Zen of Python [18]. But Practicality Beats Purity (PBP), also in the Zen of Python, trumps both of these in this case. The BDFL has stated that bare except clauses will work this way [17].
Implementation
The compiler will emit the bytecode for except Exception whenever a bare except clause is reached.
Transition Plan
Because of the complexity and clutter that would be required to add all features planned in this PEP, the transition plan is very simple. In Python 2.5 BaseException is added. In Python 3.0, all remaining features (required superclass, change in inheritance, bare except clauses becoming the same as except Exception) will go into affect. In order to make all of this work in a backwards-compatible way in Python 2.5 would require very deep hacks in the exception machinery which could be error-prone and lead to a slowdown in performance for little benefit.
To help with the transition, the documentation will be changed to reflect several programming guidelines:
- When one wants to catch all exceptions, catch BaseException
- To catch all exceptions that do not represent the termination of the interpreter, catch Exception explicitly
- Explicitly catch KeyboardInterrupt and SystemExit; don't rely on inheritance from Exception to lead to the capture
- Always catch NotImplementedError explicitly instead of relying on the inheritance from RuntimeError
The documentation for the 'exceptions' module [6], tutorial [19], and PEP 290 [3] will all require updating.
Rejected Ideas
DeprecationWarning Inheriting From PendingDeprecationWarning
This was originally proposed because a DeprecationWarning can be viewed as a PendingDeprecationWarning that is being removed in the next version. But since enough people thought the inheritance could logically work the other way around, the idea was dropped.
AttributeError Inheriting From TypeError or NameError
Viewing attributes as part of the interface of a type caused the idea of inheriting from TypeError. But that partially defeats the thinking of duck typing and thus the idea was dropped.
Inheriting from NameError was suggested because objects can be viewed as having their own namespace where the attributes live and when an attribute is not found it is a namespace failure. This was also dropped as a possibility since not everyone shared this view.
Removal of EnvironmentError
Originally proposed based on the idea that EnvironmentError was an unneeded distinction, the BDFL overruled this idea [13].
Introduction of MacError and UnixError
Proposed to add symmetry to WindowsError, the BDFL said they won't be used enough [13]. The idea of then removing WindowsError was proposed and accepted as reasonable, thus completely negating the idea of adding these exceptions.
SystemError Subclassing SystemExit
Proposed because a SystemError is meant to lead to a system exit, the idea was removed since CriticalError indicates this better.
ControlFlowException Under Exception
It has been suggested that ControlFlowException should inherit from Exception. This idea has been rejected based on the thinking that control flow exceptions typically do not all need to be caught by a single except clause.
Rename NameError to NamespaceError
NameError is considered more succinct and leaves open no possible mistyping of the capitalization of "Namespace" [14].
Renaming RuntimeError or Introducing SimpleError
The thinking was that RuntimeError was in no way an obvious name for an exception meant to be used when a situation did not call for the creation of a new exception. The renaming was rejected on the basis that the exception is already used throughout the interpreter [15]. Rejection of SimpleError was founded on the thought that people should be free to use whatever exception they choose and not have one so blatently suggested [16].
Renaming Existing Exceptions
Various renamings were suggested but non garnered more than a +0 vote (renaming ReferenceError to WeakReferenceError). The thinking was that the existing names were fine and no one had actively complained about them ever. To minimize backwards-compatibility issues and causing existing Python programmers extra pain, the renamings were removed.
Have EOFError Subclass IOError
The original thought was that since EOFError deals directly with I/O, it should subclass IOError. But since EOFError is used more as a signal that an event has occurred (the exhaustion of an I/O port), it should not subclass such a specific error exception.
Have MemoryError and SystemError Have a Common Superclass
Both classes deal with the interpreter, so why not have them have a common superclass? Because one of them means that the interpreter is in a state that it should not recover from while the other does not.
Common Superclass for PendingDeprecationWarning and DeprecationWarning
Grouping the deprecation warning exceptions together makes intuitive sense. But this sensical idea does not extend well when one considers how rarely either warning is used, let along at the same time.
Removing WindowsError
Originally proposed based on the idea that having such a platform-specific exception should not be in the built-in namespace. It turns out, though, enough code exists that uses the exception to warrant it staying.
Superclass for KeyboardInterrupt and SystemExit
Proposed to make catching non-Exception inheriting exceptions easier along with easing the transition to the new hierarchy, the idea was rejected by the BDFL [17]. The argument that existing code did not show enough instances of the pair of exceptions being caught and thus did not justify cluttering the built-in namespace was used.
Acknowledgements
Thanks to Robert Brewer, Josiah Carlson, Nick Coghlan, Timothy Delaney, Jack Diedrich, Fred L. Drake, Jr., Philip J. Eby, Greg Ewing, James Y. Knight, MA Lemburg, Guido van Rossum, Stephen J. Turnbull, Raymond Hettinger, and everyone else I missed for participating in the discussion.
References
| [1] | PEP 342 (Coroutines via Enhanced Generators) http://www.python.org/dev/peps/pep-0342/ |
| [2] | (1, 2) PEP 344 (Exception Chaining and Embedded Tracebacks) http://www.python.org/dev/peps/pep-0344/ |
| [3] | PEP 290 (Code Migration and Modernization) http://www.python.org/dev/peps/pep-0290/ |
| [4] | python-dev Summary (An exception is an exception, unless it doesn't inherit from Exception) http://www.python.org/dev/summary/2004-08-01_2004-08-15.html#an-exception-is-an-exception-unless-it-doesn-t-inherit-from-exception |
| [5] | python-dev email (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055116.html |
| [6] | exceptions module http://docs.python.org/library/exceptions.html |
| [7] | python-dev thread (Pre-PEP: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-July/055020.html, http://mail.python.org/pipermail/python-dev/2005-August/055065.html |
| [8] | python-dev thread (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055103.html |
| [9] | python-dev thread (Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055138.html |
| [10] | python-dev thread (Major revision of PEP 348 committed) http://mail.python.org/pipermail/python-dev/2005-August/055199.html |
| [11] | python-dev thread (Exception Reorg PEP revised yet again) http://mail.python.org/pipermail/python-dev/2005-August/055292.html |
| [12] | python-dev thread (PEP 348 (exception reorg) revised again) http://mail.python.org/pipermail/python-dev/2005-August/055412.html |
| [13] | (1, 2) python-dev email (Pre-PEP: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-July/055019.html |
| [14] | python-dev email (PEP, take 2: Exception Reorganization for Python 3.0) http://mail.python.org/pipermail/python-dev/2005-August/055159.html |
| [15] | python-dev email (Exception Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055149.html |
| [16] | python-dev email (Exception Reorg PEP checked in) http://mail.python.org/pipermail/python-dev/2005-August/055175.html |
| [17] | (1, 2) python-dev email (PEP 348 (exception reorg) revised again) http://mail.python.org/pipermail/python-dev/2005-August/055423.html |
| [18] | PEP 20 (The Zen of Python) http://www.python.org/dev/peps/pep-0020/ |
| [19] | Python Tutorial http://docs.python.org/tutorial/ |
| [20] | python-dev email (Bare except clauses in PEP 348) http://mail.python.org/pipermail/python-dev/2005-August/055676.html |
Copyright
This document has been placed in the public domain.
pep-0349 Allow str() to return unicode strings
| PEP: | 349 |
|---|---|
| Title: | Allow str() to return unicode strings |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neil Schemenauer <nas at arctrix.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 02-Aug-2005 |
| Python-Version: | 2.5 |
| Post-History: | 06-Aug-2005 |
Abstract
This PEP proposes to change the str() built-in function so that it
can return unicode strings. This change would make it easier to
write code that works with either string type and would also make
some existing code handle unicode strings. The C function
PyObject_Str() would remain unchanged and the function
PyString_New() would be added instead.
Rationale
Python has had a Unicode string type for some time now but use of
it is not yet widespread. There is a large amount of Python code
that assumes that string data is represented as str instances.
The long term plan for Python is to phase out the str type and use
unicode for all string data. Clearly, a smooth migration path
must be provided.
We need to upgrade existing libraries, written for str instances,
to be made capable of operating in an all-unicode string world.
We can't change to an all-unicode world until all essential
libraries are made capable for it. Upgrading the libraries in one
shot does not seem feasible. A more realistic strategy is to
individually make the libraries capable of operating on unicode
strings while preserving their current all-str environment
behaviour.
First, we need to be able to write code that can accept unicode
instances without attempting to coerce them to str instances. Let
us label such code as Unicode-safe. Unicode-safe libraries can be
used in an all-unicode world.
Second, we need to be able to write code that, when provided only
str instances, will not create unicode results. Let us label such
code as str-stable. Libraries that are str-stable can be used by
libraries and applications that are not yet Unicode-safe.
Sometimes it is simple to write code that is both str-stable and
Unicode-safe. For example, the following function just works:
def appendx(s):
return s + 'x'
That's not too surprising since the unicode type is designed to
make the task easier. The principle is that when str and unicode
instances meet, the result is a unicode instance. One notable
difficulty arises when code requires a string representation of an
object; an operation traditionally accomplished by using the str()
built-in function.
Using the current str() function makes the code not Unicode-safe.
Replacing a str() call with a unicode() call makes the code not
str-stable. Changing str() so that it could return unicode
instances would solve this problem. As a further benefit, some code
that is currently not Unicode-safe because it uses str() would
become Unicode-safe.
Specification
A Python implementation of the str() built-in follows:
def str(s):
"""Return a nice string representation of the object. The
return value is a str or unicode instance.
"""
if type(s) is str or type(s) is unicode:
return s
r = s.__str__()
if not isinstance(r, (str, unicode)):
raise TypeError('__str__ returned non-string')
return r
The following function would be added to the C API and would be the
equivalent to the str() built-in (ideally it be called PyObject_Str,
but changing that function could cause a massive number of
compatibility problems):
PyObject *PyString_New(PyObject *);
A reference implementation is available on Sourceforge [1] as a
patch.
Backwards Compatibility
Some code may require that str() returns a str instance. In the
standard library, only one such case has been found so far. The
function email.header_decode() requires a str instance and the
email.Header.decode_header() function tries to ensure this by
calling str() on its argument. The code was fixed by changing
the line "header = str(header)" to:
if isinstance(header, unicode):
header = header.encode('ascii')
Whether this is truly a bug is questionable since decode_header()
really operates on byte strings, not character strings. Code that
passes it a unicode instance could itself be considered buggy.
Alternative Solutions
A new built-in function could be added instead of changing str().
Doing so would introduce virtually no backwards compatibility
problems. However, since the compatibility problems are expected to
rare, changing str() seems preferable to adding a new built-in.
The basestring type could be changed to have the proposed behaviour,
rather than changing str(). However, that would be confusing
behaviour for an abstract base type.
References
[1] http://www.python.org/sf/1266570
Copyright
This document has been placed in the public domain.
pep-0350 Codetags
| PEP: | 350 |
|---|---|
| Title: | Codetags |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Micah Elliott <mde at tracos.org> |
| Status: | Rejected |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 27-Jun-2005 |
| Post-History: | 10-Aug-2005, 26-Sep-2005 |
Contents
Rejection Notice
This PEP has been rejected. While the community may be interested, there is no desire to make the standard library conform to this standard.
Abstract
This informational PEP aims to provide guidelines for consistent use of codetags, which would enable the construction of standard utilities to take advantage of the codetag information, as well as making Python code more uniform across projects. Codetags also represent a very lightweight programming micro-paradigm and become useful for project management, documentation, change tracking, and project health monitoring. This is submitted as a PEP because its ideas are thought to be Pythonic, although the concepts are not unique to Python programming. Herein are the definition of codetags, the philosophy behind them, a motivation for standardized conventions, some examples, a specification, a toolset description, and possible objections to the Codetag project/paradigm.
This PEP is also living as a wiki [1] for people to add comments.
What Are Codetags?
Programmers widely use ad-hoc code comment markup conventions to serve as reminders of sections of code that need closer inspection or review. Examples of markup include FIXME, TODO, XXX, BUG, but there many more in wide use in existing software. Such markup will henceforth be referred to as codetags. These codetags may show up in application code, unit tests, scripts, general documentation, or wherever suitable.
Codetags have been under discussion and in use (hundreds of codetags in the Python 2.4 sources) in many places (e.g., c2 [3]) for many years. See References for further historic and current information.
Philosophy
If you subscribe to most of these values, then codetags will likely be useful for you.
- As much information as possible should be contained inside the source code (application code or unit tests). This along with use of codetags impedes duplication. Most documentation can be generated from that source code; e.g., by using help2man, man2html, docutils, epydoc/pydoc, ctdoc, etc.
- Information should be almost never duplicated -- it should be recorded in a single original format and all other locations should be automatically generated from the original, or simply be referenced. This is famously known as the Single Point Of Truth (SPOT) or Don't Repeat Yourself (DRY) rule.
- Documentation that gets into customers' hands should be auto-generated from single sources into all other output formats. People want documentation in many forms. It is thus important to have a documentation system that can generate all of these.
- The developers are the documentation team. They write the code and should know the code the best. There should not be a dedicated, disjoint documentation team for any non-huge project.
- Plain text (with non-invasive markup) is the best format for writing anything. All other formats are to be generated from the plain text.
Codetag design was influenced by the following goals:
- Comments should be short whenever possible.
- Codetag fields should be optional and of minimal length. Default values and custom fields can be set by individual code shops.
- Codetags should be minimalistic. The quicker it is to jot something down, the more likely it is to get jotted.
- The most common use of codetags will only have zero to two fields specified, and these should be the easiest to type and read.
Motivation
Various productivity tools can be built around codetags.
See Tools.
Encourages consistency.
Historically, a subset of these codetags has been used informally in the majority of source code in existence, whether in Python or in other languages. Tags have been used in an inconsistent manner with different spellings, semantics, format, and placement. For example, some programmers might include datestamps and/or user identifiers, limit to a single line or not, spell the codetag differently than others, etc.
Encourages adherence to SPOT/DRY principle.
E.g., generating a roadmap dynamically from codetags instead of keeping TODOs in sync with separate roadmap document.
Easy to remember.
All codetags must be concise, intuitive, and semantically non-overlapping with others. The format must also be simple.
Use not required/imposed.
If you don't use codetags already, there's no obligation to start, and no risk of affecting code (but see Objections). A small subset can be adopted and the Tools will still be useful (a few codetags have probably already been adopted on an ad-hoc basis anyway). Also it is very easy to identify and remove (and possibly record) a codetag that is no longer deemed useful.
Gives a global view of code.
Tools can be used to generate documentation and reports.
A logical location for capturing CRCs/Stories/Requirements.
The XP community often does not electronically capture Stories, but codetags seem like a good place to locate them.
Extremely lightweight process.
Creating tickets in a tracking system for every thought degrades development velocity. Even if a ticketing system is employed, codetags are useful for simply containing links to those tickets.
Examples
This shows a simple codetag as commonly found in sources everywhere (with the addition of a trailing <>):
# FIXME: Seems like this loop should be finite. <> while True: ...
The following contrived example demonstrates a typical use of codetags. It uses some of the available fields to specify the assignees (a pair of programmers with initials MDE and CLE), the Date of expected completion (Week 14), and the Priority of the item (2):
# FIXME: Seems like this loop should be finite. <MDE,CLE d:14w p:2> while True: ...
This codetag shows a bug with fields describing author, discovery (origination) date, due date, and priority:
# BUG: Crashes if run on Sundays. # <MDE 2005-09-04 d:14w p:2> if day == 'Sunday': ...
Here is a demonstration of how not to use codetags. This has many problems: 1) Codetags cannot share a line with code; 2) Missing colon after mnemonic; 3) A codetag referring to codetags is usually useless, and worse, it is not completable; 4) No need to have a bunch of fields for a trivial codetag; 5) Fields with unknown values (t:XXX) should not be used:
i = i + 1 # TODO Add some more codetags. # <JRNewbie 2005-04-03 d:2005-09-03 t:XXX d:14w p:0 s:inprogress>
Specification
This describes the format: syntax, mnemonic names, fields, and semantics, and also the separate DONE File.
General Syntax
Each codetag should be inside a comment, and can be any number of lines. It should not share a line with code. It should match the indentation of surrounding code. The end of the codetag is marked by a pair of angle brackets <> containing optional fields, which must not be split onto multiple lines. It is preferred to have a codetag in # comments instead of string comments. There can be multiple fields per codetag, all of which are optional.
In short, a codetag consists of a mnemonic, a colon, commentary text, an opening angle bracket, an optional list of fields, and a closing angle bracket. E.g.,
# MNEMONIC: Some (maybe multi-line) commentary. <field field ...>
Mnemonics
The codetags of interest are listed below, using the following format:
- TODO (MILESTONE, MLSTN, DONE, YAGNI, TBD, TOBEDONE)
- To do: Informal tasks/features that are pending completion.
- FIXME (XXX, DEBUG, BROKEN, REFACTOR, REFACT, RFCTR, OOPS, SMELL, NEEDSWORK, INSPECT)
- Fix me: Areas of problematic or ugly code needing refactoring or cleanup.
- BUG (BUGFIX)
- Bugs: Reported defects tracked in bug database.
- NOBUG (NOFIX, WONTFIX, DONTFIX, NEVERFIX, UNFIXABLE, CANTFIX)
- Will Not Be Fixed: Problems that are well-known but will never be addressed due to design problems or domain limitations.
- REQ (REQUIREMENT, STORY)
- Requirements: Satisfactions of specific, formal requirements.
- RFE (FEETCH, NYI, FR, FTRQ, FTR)
- Requests For Enhancement: Roadmap items not yet implemented.
- IDEA
- Ideas: Possible RFE candidates, but less formal than RFE.
- ??? (QUESTION, QUEST, QSTN, WTF)
- Questions: Misunderstood details.
- !!! (ALERT)
- Alerts: In need of immediate attention.
- HACK (CLEVER, MAGIC)
- Hacks: Temporary code to force inflexible functionality, or simply a test change, or workaround a known problem.
- PORT (PORTABILITY, WKRD)
- Portability: Workarounds specific to OS, Python version, etc.
- CAVEAT (CAV, CAVT, WARNING, CAUTION)
- Caveats: Implementation details/gotchas that stand out as non-intuitive.
- NOTE (HELP)
- Notes: Sections where a code reviewer found something that needs discussion or further investigation.
- FAQ
- Frequently Asked Questions: Interesting areas that require external explanation.
- GLOSS (GLOSSARY)
- Glossary: Definitions for project glossary.
- SEE (REF, REFERENCE)
- See: Pointers to other code, web link, etc.
- TODOC (DOCDO, DODOC, NEEDSDOC, EXPLAIN, DOCUMENT)
- Needs Documentation: Areas of code that still need to be documented.
- CRED (CREDIT, THANKS)
- Credits: Accreditations for external provision of enlightenment.
- STAT (STATUS)
- Status: File-level statistical indicator of maturity of this file.
- RVD (REVIEWED, REVIEW)
- Reviewed: File-level indicator that review was conducted.
File-level codetags might be better suited as properties in the revision control system, but might still be appropriately specified in a codetag.
Some of these are temporary (e.g., FIXME) while others are persistent (e.g., REQ). A mnemonic was chosen over a synonym using three criteria: descriptiveness, length (shorter is better), commonly used.
Choosing between FIXME and XXX is difficult. XXX seems to be more common, but much less descriptive. Furthermore, XXX is a useful placeholder in a piece of code having a value that is unknown. Thus FIXME is the preferred spelling. Sun says [4] that XXX and FIXME are slightly different, giving XXX higher severity. However, with decades of chaos on this topic, and too many millions of developers who won't be influenced by Sun, it is easy to rightly call them synonyms.
DONE is always a completed TODO item, but this should probably be indicated through the revision control system and/or a completion recording mechanism (see DONE File).
It may be a useful metric to count NOTE tags: a high count may indicate a design (or other) problem. But of course the majority of codetags indicate areas of code needing some attention.
An FAQ is probably more appropriately documented in a wiki where users can more easily view and contribute.
Fields
All fields are optional. The proposed standard fields are described in this section. Note that upper case field characters are intended to be replaced.
The Originator/Assignee and Origination Date/Week fields are the most common and don't usually require a prefix.
This lengthy list of fields is liable to scare people (the intended minimalists) away from adopting codetags, but keep in mind that these only exist to support programmers who either 1) like to keep BUG or RFE codetags in a complete form, or 2) are using codetags as their complete and only tracking system. In other words, many of these fields will be used very rarely. They are gathered largely from industry-wide conventions, and example sources include GCC Bugzilla [5] and Python's SourceForge [6] tracking systems.
- AAA[,BBB]...
- List of Originator or Assignee initials (the context determines which unless both should exist). It is also okay to use usernames such as MicahE instead of initials. Initials (in upper case) are the preferred form.
- a:AAA[,BBB]...
- List of Assignee initials. This is necessary only in (rare) cases where a codetag has both an assignee and an originator, and they are different. Otherwise the a: prefix is omitted, and context determines the intent. E.g., FIXME usually has an Assignee, and NOTE usually has an Originator, but if a FIXME was originated (and initialed) by a reviewer, then the assignee's initials would need a a: prefix.
- YYYY[-MM[-DD]] or WW[.D]w
- The Origination Date indicating when the comment was added, in ISO 8601 [2] format (digits and hyphens only). Or Origination Week, an alternative form for specifying an Origination Date. A day of the week can be optionally specified. The w suffix is necessary for distinguishing from a date.
- d:YYYY[-MM[-DD]] or d:WW[.D]w
- Due Date (d) target completion (estimate). Or Due Week (d), an alternative to specifying a Due Date.
- p:N
- Priority (p) level. Range (N) is from 0..3 with 3 being the highest. 0..3 are analogous to low, medium, high, and showstopper/critical. The Severity field could be factored into this single number, and doing so is recommended since having both is subject to varying interpretation. The range and order should be customizable. The existence of this field is important for any tool that itemizes codetags. Thus a (customizable) default value should be supported.
- t:NNNN
- Tracker (t) number corresponding to associated Ticket ID in separate tracking system.
The following fields are also available but expected to be less common.
- c:AAAA
- Category (c) indicating some specific area affected by this item.
- s:AAAA
- Status (s) indicating state of item. Examples are "unexplored", "understood", "inprogress", "fixed", "done", "closed". Note that when an item is completed it is probably better to remove the codetag and record it in a DONE File.
- i:N
- Development cycle Iteration (i). Useful for grouping codetags into completion target groups.
- r:N
- Development cycle Release (r). Useful for grouping codetags into completion target groups.
To summarize, the non-prefixed fields are initials and origination date, and the prefixed fields are: assignee (a), due (d), priority (p), tracker (t), category (c), status (s), iteration (i), and release (r).
It should be possible for groups to define or add their own fields, and these should have upper case prefixes to distinguish them from the standard set. Examples of custom fields are Operating System (O), Severity (S), Affected Version (A), Customer (C), etc.
DONE File
Some codetags have an ability to be completed (e.g., FIXME, TODO, BUG). It is often important to retain completed items by recording them with a completion date stamp. Such completed items are best stored in a single location, global to a project (or maybe a package). The proposed format is most easily described by an example, say ~/src/fooproj/DONE:
# TODO: Recurse into subdirs only on blue # moons. <MDE 2003-09-26> [2005-09-26 Oops, I underestimated this one a bit. Should have used Warsaw's First Law!] # FIXME: ... ...
You can see that the codetag is copied verbatim from the original source file. The date stamp is then entered on the following line with an optional post-mortem commentary. The entry is terminated by a blank line (\n\n).
It may sound burdensome to have to delete codetag lines every time one gets completed. But in practice it is quite easy to setup a Vim or Emacs mapping to auto-record a codetag deletion in this format (sans the commentary).
Tools
Currently, programmers (and sometimes analysts) typically use grep to generate a list of items corresponding to a single codetag. However, various hypothetical productivity tools could take advantage of a consistent codetag format. Some example tools follow.
- Document Generator
- Possible docs: glossary, roadmap, manpages
- Codetag History
- Track (with revision control system interface) when a BUG tag (or any codetag) originated/resolved in a code section
- Code Statistics
- A project Health-O-Meter
- Codetag Lint
- Notify of invalid use of codetags, and aid in porting to codetags
- Story Manager/Browser
- An electronic means to replace XP notecards. In MVC terms, the codetag is the Model, and the Story Manager could be a graphical Viewer/Controller to do visual rearrangement, prioritization, and assignment, milestone management.
- Any Text Editor
- Used for changing, removing, adding, rearranging, recording codetags.
There are some tools already in existence that take advantage of a smaller set of pseudo-codetags (see References). There is also an example codetags implementation under way, known as the Codetag Project [7].
Objections
| Objection: | Extreme Programming argues that such codetags should not ever exist in code since the code is the documentation. |
|---|---|
| Defense: | Maybe you should put the codetags in the unit test files instead. Besides, it's tough to generate documentation from uncommented source code. |
| Objection: | Too much existing code has not followed proposed guidelines. |
|---|---|
| Defense: | [Simple] utilities (ctlint) could convert existing codes. |
| Objection: | Causes duplication with tracking system. |
|---|---|
| Defense: | Not really, unless fields are abused. If an item exists in the tracker, a simple ticket number in the codetag tracker field is sufficient. Maybe a duplicated title would be acceptable. Furthermore, it's too burdensome to have a ticket filed for every item that pops into a developer's mind on-the-go. Additionally, the tracking system could possibly be obviated for simple or small projects that can reasonably fit the relevant data into a codetag. |
| Objection: | Codetags are ugly and clutter code. |
|---|---|
| Defense: | That is a good point. But I'd still rather have such info in a single place (the source code) than various other documents, likely getting duplicated or forgotten about. The completed codetags can be sent off to the DONE File, or to the bit bucket. |
| Objection: | Codetags (and all comments) get out of date. |
|---|---|
| Defense: | Not so much if other sources (externally visible documentation) depend on their being accurate. |
| Objection: | Codetags tend to only rarely have estimated completion dates of any sort. OK, the fields are optional, but you want to suggest fields that actually will be widely used. |
|---|---|
| Defense: | If an item is inestimable don't bother with specifying a date field. Using tools to display items with order and/or color by due date and/or priority, it is easier to make estimates. Having your roadmap be a dynamic reflection of your codetags makes you much more likely to keep the codetags accurate. |
| Objection: | Named variables for the field parameters in the <> should be used instead of cryptic one-character prefixes. I.e., <MDE p:3> should rather be <author=MDE, priority=3>. |
|---|---|
| Defense: | It is just too much typing/verbosity to spell out fields. I argue that p:3 i:2 is as readable as priority=3, iteration=2 and is much more likely to by typed and remembered (see bullet C in Philosophy). In this case practicality beats purity. There are not many fields to keep track of so one letter prefixes are suitable. |
| Objection: | Synonyms should be deprecated since it is better to have a single way to spell something. |
|---|---|
| Defense: | Many programmers prefer short mnemonic names, especially in comments. This is why short mnemonics were chosen as the primary names. However, others feel that an explicit spelling is less confusing and less prone to error. There will always be two camps on this subject. Thus synonyms (and complete, full spellings) should remain supported. |
| Objection: | It is cruel to use [for mnemonics] opaque acronyms and abbreviations which drop vowels; it's hard to figure these things out. On that basis I hate: MLSTN RFCTR RFE FEETCH, NYI, FR, FTRQ, FTR WKRD RVDBY |
|---|---|
| Defense: | Mnemonics are preferred since they are pretty easy to remember and take up less space. If programmers didn't like dropping vowels we would be able to fit very little code on a line. The space is important for those who write comments that often fit on a single line. But when using a canon everywhere it is much less likely to get something to fit on a line. |
| Objection: | It takes too long to type the fields. |
|---|---|
| Defense: | Then don't use (most or any of) them, especially if you're the only programmer. Terminating a codetag with <> is a small chore, and in doing so you enable the use of the proposed tools. Editor auto-completion of codetags is also useful: You can program your editor to stamp a template (e.g. # FIXME . <MDE {date}>) with just a keystroke or two. |
| Objection: | WorkWeek is an obscure and uncommon time unit. |
|---|---|
| Defense: | That's true but it is a highly suitable unit of granularity for estimation/targeting purposes, and it is very compact. The ISO 8601 [2] is widely understood but allows you to only specify either a specific day (restrictive) or month (broad). |
| Objection: | I aesthetically dislike for the comment to be terminated with <> in the empty field case. |
|---|---|
| Defense: | It is necessary to have a terminator since codetags may be followed by non-codetag comments. Or codetags could be limited to a single line, but that's prohibitive. I can't think of any single-character terminator that is appropriate and significantly better than <>. Maybe @ could be a terminator, but then most codetags will have an unnecessary @. |
| Objection: | I can't use codetags when writing HTML, or less specifically, XML. Maybe @fields@ would be a better than <fields> as the delimiters. |
|---|---|
| Defense: | Maybe you're right, but <> looks nicer whenever applicable. XML/SGML could use @ while more common programming languages stick to <>. |
References
Some other tools have approached defining/exploiting codetags. See http://tracos.org/codetag/wiki/Links.
| [1] | http://tracos.org/codetag/wiki/Pep |
| [2] | (1, 2) http://en.wikipedia.org/wiki/ISO_8601 |
| [3] | http://c2.com/cgi/wiki?FixmeComment |
| [4] | http://java.sun.com/docs/codeconv/html/CodeConventions.doc9.html#395 |
| [5] | http://gcc.gnu.org/bugzilla/ |
| [6] | http://sourceforge.net/tracker/?group_id=5470 |
| [7] | http://tracos.org/codetag |
pep-0351 The freeze protocol
| PEP: | 351 |
|---|---|
| Title: | The freeze protocol |
| Version: | 2.5 |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 14-Apr-2005 |
| Post-History: |
Contents
Abstract
This PEP describes a simple protocol for requesting a frozen, immutable copy of a mutable object. It also defines a new built-in function which uses this protocol to provide an immutable copy on any cooperating object.
Rejection Notice
This PEP was rejected. For a rationale, see this thread on python-dev [1].
Rationale
Built-in objects such dictionaries and sets accept only immutable objects as keys. This means that mutable objects like lists cannot be used as keys to a dictionary. However, a Python programmer can convert a list to a tuple; the two objects are similar, but the latter is immutable, and can be used as a dictionary key.
It is conceivable that third party objects also have similar mutable and immutable counterparts, and it would be useful to have a standard protocol for conversion of such objects.
sets.Set objects expose a "protocol for automatic conversion to immutable" so that you can create sets.Sets of sets.Sets. PEP 218 deliberately dropped this feature from built-in sets. This PEP advances that the feature is still useful and proposes a standard mechanism for its support.
Proposal
It is proposed that a new built-in function called freeze() is added.
If freeze() is passed an immutable object, as determined by hash() on that object not raising a TypeError, then the object is returned directly.
If freeze() is passed a mutable object (i.e. hash() of that object raises a TypeError), then freeze() will call that object's __freeze__() method to get an immutable copy. If the object does not have a __freeze__() method, then a TypeError is raised.
Sample implementations
Here is a Python implementation of the freeze() built-in:
def freeze(obj):
try:
hash(obj)
return obj
except TypeError:
freezer = getattr(obj, '__freeze__', None)
if freezer:
return freezer()
raise TypeError('object is not freezable')``
Here are some code samples which show the intended semantics:
class xset(set):
def __freeze__(self):
return frozenset(self)
class xlist(list):
def __freeze__(self):
return tuple(self)
class imdict(dict):
def __hash__(self):
return id(self)
def _immutable(self, *args, **kws):
raise TypeError('object is immutable')
__setitem__ = _immutable
__delitem__ = _immutable
clear = _immutable
update = _immutable
setdefault = _immutable
pop = _immutable
popitem = _immutable
class xdict(dict):
def __freeze__(self):
return imdict(self)
>>> s = set([1, 2, 3])
>>> {s: 4}
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: set objects are unhashable
>>> t = freeze(s)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/tmp/python-lWCjBK.py", line 9, in freeze
TypeError: object is not freezable
>>> t = xset(s)
>>> u = freeze(t)
>>> {u: 4}
{frozenset([1, 2, 3]): 4}
>>> x = 'hello'
>>> freeze(x) is x
True
>>> d = xdict(a=7, b=8, c=9)
>>> hash(d)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: dict objects are unhashable
>>> hash(freeze(d))
-1210776116
>>> {d: 4}
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: dict objects are unhashable
>>> {freeze(d): 4}
{{'a': 7, 'c': 9, 'b': 8}: 4}
Reference implementation
Patch 1335812 [2] provides the C implementation of this feature. It adds the freeze() built-in, along with implementations of the __freeze__() method for lists and sets. Dictionaries are not easily freezable in current Python, so an implementation of dict.__freeze__() is not provided yet.
Open issues
- Should we define a similar protocol for thawing frozen objects?
- Should dicts and sets automatically freeze their mutable keys?
- Should we support "temporary freezing" (perhaps with a method called __congeal__()) a la __as_temporarily_immutable__() in sets.Set?
- For backward compatibility with sets.Set, should we support __as_immutable__()? Or should __freeze__() just be renamed to __as_immutable__()?
References
| [1] | http://mail.python.org/pipermail/python-dev/2006-February/060793.html |
| [2] | http://sourceforge.net/tracker/index.php?func=detail&aid=1335812&group_id=5470&atid=305470 |
Copyright
This document has been placed in the public domain.
pep-0352 Required Superclass for Exceptions
| PEP: | 352 |
|---|---|
| Title: | Required Superclass for Exceptions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon, Guido van Rossum |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Oct-2005 |
| Post-History: |
Contents
Abstract
In Python 2.4 and before, any (classic) class can be raised as an exception. The plan for 2.5 was to allow new-style classes, but this makes the problem worse -- it would mean any class (or instance) can be raised! This is a problem as it prevents any guarantees from being made about the interface of exceptions. This PEP proposes introducing a new superclass that all raised objects must inherit from. Imposing the restriction will allow a standard interface for exceptions to exist that can be relied upon. It also leads to a known hierarchy for all exceptions to adhere to.
One might counter that requiring a specific base class for a particular interface is unPythonic. However, in the specific case of exceptions there's a good reason (which has generally been agreed to on python-dev): requiring hierarchy helps code that wants to catch exceptions by making it possible to catch all exceptions explicitly by writing except BaseException: instead of except *:. [2]
Introducing a new superclass for exceptions also gives us the chance to rearrange the exception hierarchy slightly for the better. As it currently stands, all exceptions in the built-in namespace inherit from Exception. This is a problem since this includes two exceptions (KeyboardInterrupt and SystemExit) that often need to be excepted from the application's exception handling: the default behavior of shutting the interpreter down without a traceback is usually more desirable than whatever the application might do (with the possible exception of applications that emulate Python's interactive command loop with >>> prompt). Changing it so that these two exceptions inherit from the common superclass instead of Exception will make it easy for people to write except clauses that are not overreaching and not catch exceptions that should propagate up.
Requiring a Common Superclass
This PEP proposes introducing a new exception named BaseException that is a new-style class and has a single attribute, args. Below is the code as the exception will work in Python 3.0 (how it will work in Python 2.x is covered in the Transition Plan section):
class BaseException(object):
"""Superclass representing the base of the exception hierarchy.
Provides an 'args' attribute that contains all arguments passed
to the constructor. Suggested practice, though, is that only a
single string argument be passed to the constructor.
"""
def __init__(self, *args):
self.args = args
def __str__(self):
if len(self.args) == 1:
return str(self.args[0])
else:
return str(self.args)
def __repr__(self):
return "%s(*%s)" % (self.__class__.__name__, repr(self.args))
No restriction is placed upon what may be passed in for args for backwards-compatibility reasons. In practice, though, only a single string argument should be used. This keeps the string representation of the exception to be a useful message about the exception that is human-readable; this is why the __str__ method special-cases on length-1 args value. Including programmatic information (e.g., an error code number) should be stored as a separate attribute in a subclass.
The raise statement will be changed to require that any object passed to it must inherit from BaseException. This will make sure that all exceptions fall within a single hierarchy that is anchored at BaseException [2]. This also guarantees a basic interface that is inherited from BaseException. The change to raise will be enforced starting in Python 3.0 (see the Transition Plan below).
With BaseException being the root of the exception hierarchy, Exception will now inherit from it.
Exception Hierarchy Changes
With the exception hierarchy now even more important since it has a basic root, a change to the existing hierarchy is called for. As it stands now, if one wants to catch all exceptions that signal an error and do not mean the interpreter should be allowed to exit, you must specify all but two exceptions specifically in an except clause or catch the two exceptions separately and then re-raise them and have all other exceptions fall through to a bare except clause:
except (KeyboardInterrupt, SystemExit):
raise
except:
...
That is needlessly explicit. This PEP proposes moving KeyboardInterrupt and SystemExit to inherit directly from BaseException.
- BaseException
|- KeyboardInterrupt
|- SystemExit
|- Exception
|- (all other current built-in exceptions)
Doing this makes catching Exception more reasonable. It would catch only exceptions that signify errors. Exceptions that signal that the interpreter should exit will not be caught and thus be allowed to propagate up and allow the interpreter to terminate.
KeyboardInterrupt has been moved since users typically expect an application to exit when they press the interrupt key (usually Ctrl-C). If people have overly broad except clauses the expected behaviour does not occur.
SystemExit has been moved for similar reasons. Since the exception is raised when sys.exit() is called the interpreter should normally be allowed to terminate. Unfortunately overly broad except clauses can prevent the explicitly requested exit from occurring.
To make sure that people catch Exception most of the time, various parts of the documentation and tutorials will need to be updated to strongly suggest that Exception be what programmers want to use. Bare except clauses or catching BaseException directly should be discouraged based on the fact that KeyboardInterrupt and SystemExit almost always should be allowed to propagate up.
Transition Plan
Since semantic changes to Python are being proposed, a transition plan is needed. The goal is to end up with the new semantics being used in Python 3.0 while providing a smooth transition for 2.x code. All deprecations mentioned in the plan will lead to the removal of the semantics starting in the version following the initial deprecation.
Here is BaseException as implemented in the 2.x series:
class BaseException(object):
"""Superclass representing the base of the exception hierarchy.
The __getitem__ method is provided for backwards-compatibility
and will be deprecated at some point. The 'message' attribute
is also deprecated.
"""
def __init__(self, *args):
self.args = args
def __str__(self):
return str(self.args[0]
if len(self.args) <= 1
else self.args)
def __repr__(self):
func_args = repr(self.args) if self.args else "()"
return self.__class__.__name__ + func_args
def __getitem__(self, index):
"""Index into arguments passed in during instantiation.
Provided for backwards-compatibility and will be
deprecated.
"""
return self.args[index]
def _get_message(self):
"""Method for 'message' property."""
warnings.warn("the 'message' attribute has been deprecated "
"since Python 2.6")
return self.args[0] if len(args) == 1 else ''
message = property(_get_message,
doc="access the 'message' attribute; "
"deprecated and provided only for "
"backwards-compatibility")
Deprecation of features in Python 2.9 is optional. This is because it is not known at this time if Python 2.9 (which is slated to be the last version in the 2.x series) will actively deprecate features that will not be in 3.0. It is conceivable that no deprecation warnings will be used in 2.9 since there could be such a difference between 2.9 and 3.0 that it would make 2.9 too "noisy" in terms of warnings. Thus the proposed deprecation warnings for Python 2.9 will be revisited when development of that version begins, to determine if they are still desired.
- Python 2.5 [done]
- all standard exceptions become new-style classes [done]
- introduce BaseException [done]
- Exception, KeyboardInterrupt, and SystemExit inherit from BaseException [done]
- deprecate raising string exceptions [done]
- Python 2.6 [done]
- deprecate catching string exceptions [done]
- deprecate message attribute (see Retracted Ideas) [done]
- Python 2.7 [done]
- deprecate raising exceptions that do not inherit from BaseException
- Python 3.0 [done]
- drop everything that was deprecated above:
- string exceptions (both raising and catching) [done]
- all exceptions must inherit from BaseException [done]
- drop __getitem__, message [done]
- drop everything that was deprecated above:
Retracted Ideas
A previous version of this PEP that was implemented in Python 2.5 included a 'message' attribute on BaseException. Its purpose was to begin a transition to BaseException accepting only a single argument. This was to tighten the interface and to force people to use attributes in subclasses to carry arbitrary information with an exception instead of cramming it all into args.
Unfortunately, while implementing the removal of the args attribute in Python 3.0 at the PyCon 2007 sprint [4], it was discovered that the transition was very painful, especially for C extension modules. It was decided that it would be better to deprecate the message attribute in Python 2.6 (and remove it in Python 2.7 and Python 3.0) and consider a more long-term transition strategy in Python 3.0 to remove multiple-argument support in BaseException in preference of accepting only a single argument. Thus the introduction of message and the original deprecation of args has been retracted.
References
| [1] | PEP 348 (Exception Reorganization for Python 3.0) http://www.python.org/dev/peps/pep-0348/ |
| [2] | (1, 2) python-dev Summary for 2004-08-01 through 2004-08-15 http://www.python.org/dev/summary/2004-08-01_2004-08-15.html#an-exception-is-an-exception-unless-it-doesn-t-inherit-from-exception |
| [3] | SF patch #1104669 (new-style exceptions) http://www.python.org/sf/1104669 |
| [4] | python-3000 email ("How far to go with cleaning up exceptions") http://mail.python.org/pipermail/python-3000/2007-March/005911.html |
Copyright
This document has been placed in the public domain.
pep-0353 Using ssize_t as the index type
| PEP: | 353 |
|---|---|
| Title: | Using ssize_t as the index type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 18-Dec-2005 |
| Post-History: |
Contents
Abstract
In Python 2.4, indices of sequences are restricted to the C type int. On 64-bit machines, sequences therefore cannot use the full address space, and are restricted to 2**31 elements. This PEP proposes to change this, introducing a platform-specific index type Py_ssize_t. An implementation of the proposed change is in http://svn.python.org/projects/python/branches/ssize_t.
Rationale
64-bit machines are becoming more popular, and the size of main memory increases beyond 4GiB. On such machines, Python currently is limited, in that sequences (strings, unicode objects, tuples, lists, array.arrays, ...) cannot contain more than 2GiElements.
Today, very few machines have memory to represent larger lists: as each pointer is 8B (in a 64-bit machine), one needs 16GiB to just hold the pointers of such a list; with data in the list, the memory consumption grows even more. However, there are three container types for which users request improvements today:
- strings (currently restricted to 2GiB)
- mmap objects (likewise; plus the system typically won't keep the whole object in memory concurrently)
- Numarray objects (from Numerical Python)
As the proposed change will cause incompatibilities on 64-bit machines, it should be carried out while such machines are not in wide use (IOW, as early as possible).
Specification
A new type Py_ssize_t is introduced, which has the same size as the compiler's size_t type, but is signed. It will be a typedef for ssize_t where available.
The internal representation of the length fields of all container types is changed from int to ssize_t, for all types included in the standard distribution. In particular, PyObject_VAR_HEAD is changed to use Py_ssize_t, affecting all extension modules that use that macro.
All occurrences of index and length parameters and results are changed to use Py_ssize_t, including the sequence slots in type objects, and the buffer interface.
New conversion functions PyInt_FromSsize_t and PyInt_AsSsize_t, are introduced. PyInt_FromSsize_t will transparently return a long int object if the value exceeds the LONG_MAX; PyInt_AsSsize_t will transparently process long int objects.
New function pointer typedefs ssizeargfunc, ssizessizeargfunc, ssizeobjargproc, ssizessizeobjargproc, and lenfunc are introduced. The buffer interface function types are now called readbufferproc, writebufferproc, segcountproc, and charbufferproc.
A new conversion code 'n' is introduced for PyArg_ParseTuple Py_BuildValue, PyObject_CallFunction and PyObject_CallMethod. This code operates on Py_ssize_t.
The conversion codes 's#' and 't#' will output Py_ssize_t if the macro PY_SSIZE_T_CLEAN is defined before Python.h is included, and continue to output int if that macro isn't defined.
At places where a conversion from size_t/Py_ssize_t to int is necessary, the strategy for conversion is chosen on a case-by-case basis (see next section).
To prevent loading extension modules that assume a 32-bit size type into an interpreter that has a 64-bit size type, Py_InitModule4 is renamed to Py_InitModule4_64.
Conversion guidelines
Module authors have the choice whether they support this PEP in their code or not; if they support it, they have the choice of different levels of compatibility.
If a module is not converted to support this PEP, it will continue to work unmodified on a 32-bit system. On a 64-bit system, compile-time errors and warnings might be issued, and the module might crash the interpreter if the warnings are ignored.
Conversion of a module can either attempt to continue using int indices, or use Py_ssize_t indices throughout.
If the module should continue to use int indices, care must be taken when calling functions that return Py_ssize_t or size_t, in particular, for functions that return the length of an object (this includes the strlen function and the sizeof operator). A good compiler will warn when a Py_ssize_t/size_t value is truncated into an int. In these cases, three strategies are available:
statically determine that the size can never exceed an int (e.g. when taking the sizeof a struct, or the strlen of a file pathname). In this case, write:
some_int = Py_SAFE_DOWNCAST(some_value, Py_ssize_t, int);
This will add an assertion in debug mode that the value really fits into an int, and just add a cast otherwise.
statically determine that the value shouldn't overflow an int unless there is a bug in the C code somewhere. Test whether the value is smaller than INT_MAX, and raise an InternalError if it isn't.
otherwise, check whether the value fits an int, and raise a ValueError if it doesn't.
The same care must be taken for tp_as_sequence slots, in addition, the signatures of these slots change, and the slots must be explicitly recast (e.g. from intargfunc to ssizeargfunc). Compatibility with previous Python versions can be achieved with the test:
#if PY_VERSION_HEX < 0x02050000 && !defined(PY_SSIZE_T_MIN) typedef int Py_ssize_t; #define PY_SSIZE_T_MAX INT_MAX #define PY_SSIZE_T_MIN INT_MIN #endif
and then using Py_ssize_t in the rest of the code. For the tp_as_sequence slots, additional typedefs might be necessary; alternatively, by replacing:
PyObject* foo_item(struct MyType* obj, int index)
{
...
}
with:
PyObject* foo_item(PyObject* _obj, Py_ssize_t index)
{
struct MyType* obj = (struct MyType*)_obj;
...
}
it becomes possible to drop the cast entirely; the type of foo_item should then match the sq_item slot in all Python versions.
If the module should be extended to use Py_ssize_t indices, all usages of the type int should be reviewed, to see whether it should be changed to Py_ssize_t. The compiler will help in finding the spots, but a manual review is still necessary.
Particular care must be taken for PyArg_ParseTuple calls: they need all be checked for s# and t# converters, and PY_SSIZE_T_CLEAN must be defined before including Python.h if the calls have been updated accordingly.
Fredrik Lundh has written a scanner [1] which checks the code of a C module for usage of APIs whose signature has changed.
Discussion
Why not size_t
An initial attempt to implement this feature tried to use size_t. It quickly turned out that this cannot work: Python uses negative indices in many places (to indicate counting from the end). Even in places where size_t would be usable, too many reformulations of code where necessary, e.g. in loops like:
for(index = length-1; index >= 0; index--)
This loop will never terminate if index is changed from int to size_t.
Why not Py_intptr_t
Conceptually, Py_intptr_t and Py_ssize_t are different things: Py_intptr_t needs to be the same size as void*, and Py_ssize_t the same size as size_t. These could differ, e.g. on machines where pointers have segment and offset. On current flat-address space machines, there is no difference, so for all practical purposes, Py_intptr_t would have worked as well.
Doesn't this break much code?
With the changes proposed, code breakage is fairly minimal. On a 32-bit system, no code will break, as Py_ssize_t is just a typedef for int.
On a 64-bit system, the compiler will warn in many places. If these warnings are ignored, the code will continue to work as long as the container sizes don't exceeed 2**31, i.e. it will work nearly as good as it does currently. There are two exceptions to this statement: if the extension module implements the sequence protocol, it must be updated, or the calling conventions will be wrong. The other exception is the places where Py_ssize_t is output through a pointer (rather than a return value); this applies most notably to codecs and slice objects.
If the conversion of the code is made, the same code can continue to work on earlier Python releases.
Doesn't this consume too much memory?
One might think that using Py_ssize_t in all tuples, strings, lists, etc. is a waste of space. This is not true, though: on a 32-bit machine, there is no change. On a 64-bit machine, the size of many containers doesn't change, e.g.
- in lists and tuples, a pointer immediately follows the ob_size member. This means that the compiler currently inserts a 4 padding bytes; with the change, these padding bytes become part of the size.
- in strings, the ob_shash field follows ob_size. This field is of type long, which is a 64-bit type on most 64-bit systems (except Win64), so the compiler inserts padding before it as well.
Open Issues
Marc-Andre Lemburg commented that complete backwards compatibility with existing source code should be preserved. In particular, functions that have Py_ssize_t* output arguments should continue to run correctly even if the callers pass int*.
It is not clear what strategy could be used to implement that requirement.
Copyright
This document has been placed in the public domain.
pep-0354 Enumerations in Python
| PEP: | 354 |
|---|---|
| Title: | Enumerations in Python |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ben Finney <ben+python at benfinney.id.au> |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Dec-2005 |
| Python-Version: | 2.6 |
| Post-History: | 20-Dec-2005 |
| Superseded-By: | 435 |
Contents
Rejection Notice
This PEP has been rejected. This doesn't slot nicely into any of the existing modules (like collections), and the Python standard library eschews having lots of individual data structures in their own modules. Also, the PEP has generated no widespread interest. For those who need enumerations, there are cookbook recipes and PyPI packages that meet these needs.
Note: this PEP was superseded by PEP 435, which has been accepted in May 2013.
Abstract
This PEP specifies an enumeration data type for Python.
An enumeration is an exclusive set of symbolic names bound to arbitrary unique values. Values within an enumeration can be iterated and compared, but the values have no inherent relationship to values outside the enumeration.
Motivation
The properties of an enumeration are useful for defining an immutable, related set of constant values that have a defined sequence but no inherent semantic meaning. Classic examples are days of the week (Sunday through Saturday) and school assessment grades ('A' through 'D', and 'F'). Other examples include error status values and states within a defined process.
It is possible to simply define a sequence of values of some other basic type, such as int or str, to represent discrete arbitrary values. However, an enumeration ensures that such values are distinct from any others, and that operations without meaning ("Wednesday times two") are not defined for these values.
Specification
An enumerated type is created from a sequence of arguments to the type's constructor:
>>> Weekdays = enum('sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat')
>>> Grades = enum('A', 'B', 'C', 'D', 'F')
Enumerations with no values are meaningless. The exception EnumEmptyError is raised if the constructor is called with no value arguments.
The values are bound to attributes of the new enumeration object:
>>> today = Weekdays.mon
The values can be compared:
>>> if today == Weekdays.fri: ... print "Get ready for the weekend"
Values within an enumeration cannot be meaningfully compared except with values from the same enumeration. The comparison operation functions return NotImplemented [1] when a value from an enumeration is compared against any value not from the same enumeration or of a different type:
>>> gym_night = Weekdays.wed
>>> gym_night.__cmp__(Weekdays.mon)
1
>>> gym_night.__cmp__(Weekdays.wed)
0
>>> gym_night.__cmp__(Weekdays.fri)
-1
>>> gym_night.__cmp__(23)
NotImplemented
>>> gym_night.__cmp__("wed")
NotImplemented
>>> gym_night.__cmp__(Grades.B)
NotImplemented
This allows the operation to succeed, evaluating to a boolean value:
>>> gym_night = Weekdays.wed >>> gym_night < Weekdays.mon False >>> gym_night < Weekdays.wed False >>> gym_night < Weekdays.fri True >>> gym_night < 23 False >>> gym_night > 23 True >>> gym_night > "wed" True >>> gym_night > Grades.B True
Coercing a value from an enumeration to a str results in the string that was specified for that value when constructing the enumeration:
>>> gym_night = Weekdays.wed >>> str(gym_night) 'wed'
The sequence index of each value from an enumeration is exported as an integer via that value's index attribute:
>>> gym_night = Weekdays.wed >>> gym_night.index 3
An enumeration can be iterated, returning its values in the sequence they were specified when the enumeration was created:
>>> print [str(day) for day in Weekdays] ['sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat']
Values from an enumeration are hashable, and can be used as dict keys:
>>> plans = {}
>>> plans[Weekdays.sat] = "Feed the horse"
The normal usage of enumerations is to provide a set of possible values for a data type, which can then be used to map to other information about the values:
>>> for report_grade in Grades: ... report_students[report_grade] = \ ... [s for s in students if students.grade == report_grade]
Rationale -- Other designs considered
All in one class
Some implementations have the enumeration and its values all as attributes of a single object or class.
This PEP specifies a design where the enumeration is a container, and the values are simple comparables. It was felt that attempting to place all the properties of enumeration within a single class complicates the design without apparent benefit.
Metaclass for creating enumeration classes
The enumerations specified in this PEP are instances of an enum type. Some alternative designs implement each enumeration as its own class, and a metaclass to define common properties of all enumerations.
One motivation for having a class (rather than an instance) for each enumeration is to allow subclasses of enumerations, extending and altering an existing enumeration. A class, though, implies that instances of that class will be created; it is difficult to imagine what it means to have separate instances of a "days of the week" class, where each instance contains all days. This usually leads to having each class follow the Singleton pattern, further complicating the design.
In contrast, this PEP specifies enumerations that are not expected to be extended or modified. It is, of course, possible to create a new enumeration from the string values of an existing one, or even subclass the enum type if desired.
Hiding attributes of enumerated values
A previous design had the enumerated values hiding as much as possible about their implementation, to the point of not exporting the string key and sequence index.
The design in this PEP acknowledges that programs will often find it convenient to know the enumerated value's enumeration type, sequence index, and string key specified for the value. These are exported by the enumerated value as attributes.
Implementation
This design is based partly on a recipe [2] from the Python Cookbook.
The PyPI package enum [3] provides a Python implementation of the data types described in this PEP.
References and Footnotes
| [1] | The NotImplemented return value from comparison operations signals the Python interpreter to attempt alternative comparisons or other fallbacks. <http://docs.python.org/reference/datamodel.html#the-standard-type-hierarchy> |
| [2] | "First Class Enums in Python", Zoran Isailovski, Python Cookbook recipe 413486 <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413486> |
| [3] | Python Package Index, package enum <http://cheeseshop.python.org/pypi/enum/> |
Copyright
This document has been placed in the public domain.
pep-0355 Path - Object oriented filesystem paths
| PEP: | 355 |
|---|---|
| Title: | Path - Object oriented filesystem paths |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Bjรถrn Lindqvist <bjourne at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 24-Jan-2006 |
| Python-Version: | 2.5 |
| Post-History: |
Rejection Notice
This PEP has been rejected (in this form). The proposed path class
is the ultimate kitchen sink; but the notion that it's better to
implement *all* functionality that uses a path as a method on a single
class is an anti-pattern. (E.g.why not open()? Or execfile()?)
Subclassing from str is a particularly bad idea; many string
operations make no sense when applied to a path. This PEP has
lingered, and while the discussion flares up from time to time,
it's time to put this PEP out of its misery. A less far-fetched
proposal might be more palatable.
Abstract
This PEP describes a new class, Path, to be added to the os
module, for handling paths in an object oriented fashion. The
"weak" deprecation of various related functions is also discussed
and recommended.
Background
The ideas expressed in this PEP are not recent, but have been
debated in the Python community for many years. Many have felt
that the API for manipulating file paths as offered in the os.path
module is inadequate. The first proposal for a Path object was
raised by Just van Rossum on python-dev in 2001 [2]. In 2003,
Jason Orendorff released version 1.0 of the "path module" which
was the first public implementation that used objects to represent
paths [3].
The path module quickly became very popular and numerous attempts
were made to get the path module included in the Python standard
library; [4], [5], [6], [7].
This PEP summarizes the ideas and suggestions people have
expressed about the path module and proposes that a modified
version should be included in the standard library.
Motivation
Dealing with filesystem paths is a common task in any programming
language, and very common in a high-level language like Python.
Good support for this task is needed, because:
- Almost every program uses paths to access files. It makes sense
that a task, that is so often performed, should be as intuitive
and as easy to perform as possible.
- It makes Python an even better replacement language for
over-complicated shell scripts.
Currently, Python has a large number of different functions
scattered over half a dozen modules for handling paths. This
makes it hard for newbies and experienced developers to to choose
the right method.
The Path class provides the following enhancements over the
current common practice:
- One "unified" object provides all functionality from previous
functions.
- Subclassability - the Path object can be extended to support
paths other than filesystem paths. The programmer does not need
to learn a new API, but can reuse his or her knowledge of Path
to deal with the extended class.
- With all related functionality in one place, the right approach
is easier to learn as one does not have to hunt through many
different modules for the right functions.
- Python is an object oriented language. Just like files,
datetimes and sockets are objects so are paths, they are not
merely strings to be passed to functions. Path objects is
inherently a pythonic idea.
- Path takes advantage of properties. Properties make for more
readable code.
if imgpath.ext == 'jpg':
jpegdecode(imgpath)
Is better than:
if os.path.splitexit(imgpath)[1] == 'jpg':
jpegdecode(imgpath)
Rationale
The following points summarize the design:
- Path extends from string, therefore all code which expects
string pathnames need not be modified and no existing code will
break.
- A Path object can be created either by using the classmethod
Path.cwd, by instantiating the class with a string representing
a path or by using the default constructor which is equivalent
to Path(".").
- Path provides common pathname manipulation, pattern expansion,
pattern matching and other high-level file operations including
copying. Basically Path provides everything path-related except
the manipulation of file contents, for which file objects are
better suited.
- Platform incompatibilities are dealt with by not instantiating
system specific methods.
Specification
This class defines the following public interface (docstrings have
been extracted from the reference implementation, and shortened
for brevity; see the reference implementation for more detail):
class Path(str):
# Special Python methods:
def __new__(cls, *args) => Path
"""
Creates a new path object concatenating the *args. *args
may only contain Path objects or strings. If *args is
empty, Path(os.curdir) is created.
"""
def __repr__(self): ...
def __add__(self, more): ...
def __radd__(self, other): ...
# Alternative constructor.
def cwd(cls): ...
# Operations on path strings:
def abspath(self) => Path
"""Returns the absolute path of self as a new Path object."""
def normcase(self): ...
def normpath(self): ...
def realpath(self): ...
def expanduser(self): ...
def expandvars(self): ...
def basename(self): ...
def expand(self): ...
def splitpath(self) => (Path, str)
"""p.splitpath() -> Return (p.parent, p.name)."""
def stripext(self) => Path
"""p.stripext() -> Remove one file extension from the path."""
def splitunc(self): ... [1]
def splitall(self): ...
def relpath(self): ...
def relpathto(self, dest): ...
# Properties about the path:
parent => Path
"""This Path's parent directory as a new path object."""
name => str
"""The name of this file or directory without the full path."""
ext => str
"""
The file extension or an empty string if Path refers to a
file without an extension or a directory.
"""
drive => str
"""
The drive specifier. Always empty on systems that don't
use drive specifiers.
"""
namebase => str
"""
The same as path.name, but with one file extension
stripped off.
"""
uncshare[1]
# Operations that return lists of paths:
def listdir(self, pattern = None): ...
def dirs(self, pattern = None): ...
def files(self, pattern = None): ...
def walk(self, pattern = None): ...
def walkdirs(self, pattern = None): ...
def walkfiles(self, pattern = None): ...
def match(self, pattern) => bool
"""Returns True if self.name matches the given pattern."""
def matchcase(self, pattern) => bool
"""
Like match() but is guaranteed to be case sensitive even
on platforms with case insensitive filesystems.
"""
def glob(self, pattern):
# Methods for retrieving information about the filesystem
# path:
def exists(self): ...
def isabs(self): ...
def isdir(self): ...
def isfile(self): ...
def islink(self): ...
def ismount(self): ...
def samefile(self, other): ... [1]
def atime(self): ...
"""Last access time of the file."""
def mtime(self): ...
"""Last-modified time of the file."""
def ctime(self): ...
"""
Return the system's ctime which, on some systems (like
Unix) is the time of the last change, and, on others (like
Windows), is the creation time for path.
"""
def size(self): ...
def access(self, mode): ... [1]
def stat(self): ...
def lstat(self): ...
def statvfs(self): ... [1]
def pathconf(self, name): ... [1]
# Methods for manipulating information about the filesystem
# path.
def utime(self, times) => None
def chmod(self, mode) => None
def chown(self, uid, gid) => None [1]
def rename(self, new) => None
def renames(self, new) => None
# Create/delete operations on directories
def mkdir(self, mode = 0777): ...
def makedirs(self, mode = 0777): ...
def rmdir(self): ...
def removedirs(self): ...
# Modifying operations on files
def touch(self): ...
def remove(self): ...
def unlink(self): ...
# Modifying operations on links
def link(self, newpath): ...
def symlink(self, newlink): ...
def readlink(self): ...
def readlinkabs(self): ...
# High-level functions from shutil
def copyfile(self, dst): ...
def copymode(self, dst): ...
def copystat(self, dst): ...
def copy(self, dst): ...
def copy2(self, dst): ...
def copytree(self, dst, symlinks = True): ...
def move(self, dst): ...
def rmtree(self, ignore_errors = False, onerror = None): ...
# Special stuff from os
def chroot(self): ... [1]
def startfile(self): ... [1]
Replacing older functions with the Path class
In this section, "a ==> b" means that b can be used as a
replacement for a.
In the following examples, we assume that the Path class is
imported with "from path import Path".
1. Replacing os.path.join
--------------------------
os.path.join(os.getcwd(), "foobar")
==>
Path(Path.cwd(), "foobar")
os.path.join("foo", "bar", "baz")
==>
Path("foo", "bar", "baz")
2. Replacing os.path.splitext
------------------------------
fname = "Python2.4.tar.gz"
os.path.splitext(fname)[1]
==>
fname = Path("Python2.4.tar.gz")
fname.ext
Or if you want both parts:
fname = "Python2.4.tar.gz"
base, ext = os.path.splitext(fname)
==>
fname = Path("Python2.4.tar.gz")
base, ext = fname.namebase, fname.extx
3. Replacing glob.glob
-----------------------
lib_dir = "/lib"
libs = glob.glob(os.path.join(lib_dir, "*s.o"))
==>
lib_dir = Path("/lib")
libs = lib_dir.files("*.so")
Deprecations
Introducing this module to the standard library introduces a need
for the "weak" deprecation of a number of existing modules and
functions. These modules and functions are so widely used that
they cannot be truly deprecated, as in generating
DeprecationWarning. Here "weak deprecation" means notes in the
documentation only.
The table below lists the existing functionality that should be
deprecated.
Path method/property Deprecates function
-------------------- -------------------
normcase() os.path.normcase()
normpath() os.path.normpath()
realpath() os.path.realpath()
expanduser() os.path.expanduser()
expandvars() os.path.expandvars()
parent os.path.dirname()
name os.path.basename()
splitpath() os.path.split()
drive os.path.splitdrive()
ext os.path.splitext()
splitunc() os.path.splitunc()
__new__() os.path.join(), os.curdir
listdir() os.listdir() [fnmatch.filter()]
match() fnmatch.fnmatch()
matchcase() fnmatch.fnmatchcase()
glob() glob.glob()
exists() os.path.exists()
isabs() os.path.isabs()
isdir() os.path.isdir()
isfile() os.path.isfile()
islink() os.path.islink()
ismount() os.path.ismount()
samefile() os.path.samefile()
atime() os.path.getatime()
ctime() os.path.getctime()
mtime() os.path.getmtime()
size() os.path.getsize()
cwd() os.getcwd()
access() os.access()
stat() os.stat()
lstat() os.lstat()
statvfs() os.statvfs()
pathconf() os.pathconf()
utime() os.utime()
chmod() os.chmod()
chown() os.chown()
rename() os.rename()
renames() os.renames()
mkdir() os.mkdir()
makedirs() os.makedirs()
rmdir() os.rmdir()
removedirs() os.removedirs()
remove() os.remove()
unlink() os.unlink()
link() os.link()
symlink() os.symlink()
readlink() os.readlink()
chroot() os.chroot()
startfile() os.startfile()
copyfile() shutil.copyfile()
copymode() shutil.copymode()
copystat() shutil.copystat()
copy() shutil.copy()
copy2() shutil.copy2()
copytree() shutil.copytree()
move() shutil.move()
rmtree() shutil.rmtree()
The Path class deprecates the whole of os.path, shutil, fnmatch
and glob. A big chunk of os is also deprecated.
Closed Issues
A number contentious issues have been resolved since this PEP
first appeared on python-dev:
* The __div__() method was removed. Overloading the / (division)
operator may be "too much magic" and make path concatenation
appear to be division. The method can always be re-added later
if the BDFL so desires. In its place, __new__() got an *args
argument that accepts both Path and string objects. The *args
are concatenated with os.path.join() which is used to construct
the Path object. These changes obsoleted the problematic
joinpath() method which was removed.
* The methods and the properties getatime()/atime,
getctime()/ctime, getmtime()/mtime and getsize()/size duplicated
each other. These methods and properties have been merged to
atime(), ctime(), mtime() and size(). The reason they are not
properties instead, is because there is a possibility that they
may change unexpectedly. The following example is not
guaranteed to always pass the assertion:
p = Path("foobar")
s = p.size()
assert p.size() == s
Open Issues
Some functionality of Jason Orendorff's path module have been
omitted:
* Function for opening a path - better handled by the builtin
open().
* Functions for reading and writing whole files - better handled
by file objects' own read() and write() methods.
* A chdir() function may be a worthy inclusion.
* A deprecation schedule needs to be set up. How much
functionality should Path implement? How much of existing
functionality should it deprecate and when?
* The name obviously has to be either "path" or "Path," but where
should it live? In its own module or in os?
* Due to Path subclassing either str or unicode, the following
non-magic, public methods are available on Path objects:
capitalize(), center(), count(), decode(), encode(),
endswith(), expandtabs(), find(), index(), isalnum(),
isalpha(), isdigit(), islower(), isspace(), istitle(),
isupper(), join(), ljust(), lower(), lstrip(), replace(),
rfind(), rindex(), rjust(), rsplit(), rstrip(), split(),
splitlines(), startswith(), strip(), swapcase(), title(),
translate(), upper(), zfill()
On python-dev it has been argued whether this inheritance is
sane or not. Most persons debating said that most string
methods doesn't make sense in the context of filesystem paths --
they are just dead weight. The other position, also argued on
python-dev, is that inheriting from string is very convenient
because it allows code to "just work" with Path objects without
having to be adapted for them.
One of the problems is that at the Python level, there is no way
to make an object "string-like enough," so that it can be passed
to the builtin function open() (and other builtins expecting a
string or buffer), unless the object inherits from either str or
unicode. Therefore, to not inherit from string requires changes
in CPython's core.
The functions and modules that this new module is trying to
replace (os.path, shutil, fnmatch, glob and parts of os) are
expected to be available in future Python versions for a long
time, to preserve backwards compatibility.
Reference Implementation
Currently, the Path class is implemented as a thin wrapper around
the standard library modules fnmatch, glob, os, os.path and
shutil. The intention of this PEP is to move functionality from
the aforementioned modules to Path while they are being
deprecated.
For more detail and an implementation see:
http://wiki.python.org/moin/PathModule
Examples
In this section, "a ==> b" means that b can be used as a
replacement for a.
1. Make all python files in the a directory executable
------------------------------------------------------
DIR = '/usr/home/guido/bin'
for f in os.listdir(DIR):
if f.endswith('.py'):
path = os.path.join(DIR, f)
os.chmod(path, 0755)
==>
for f in Path('/usr/home/guido/bin').files("*.py"):
f.chmod(0755)
2. Delete emacs backup files
----------------------------
def delete_backups(arg, dirname, names):
for name in names:
if name.endswith('~'):
os.remove(os.path.join(dirname, name))
os.path.walk(os.environ['HOME'], delete_backups, None)
==>
d = Path(os.environ['HOME'])
for f in d.walkfiles('*~'):
f.remove()
3. Finding the relative path to a file
--------------------------------------
b = Path('/users/peter/')
a = Path('/users/peter/synergy/tiki.txt')
a.relpathto(b)
4. Splitting a path into directory and filename
-----------------------------------------------
os.path.split("/path/to/foo/bar.txt")
==>
Path("/path/to/foo/bar.txt").splitpath()
5. List all Python scripts in the current directory tree
--------------------------------------------------------
list(Path().walkfiles("*.py"))
References and Footnotes
[1] Method is not guaranteed to be available on all platforms.
[2] "(idea) subclassable string: path object?", van Rossum, 2001
http://mail.python.org/pipermail/python-dev/2001-August/016663.html
[3] "path module v1.0 released", Orendorff, 2003
http://mail.python.org/pipermail/python-announce-list/2003-January/001984.html
[4] "Some RFE for review", Birkenfeld, 2005
http://mail.python.org/pipermail/python-dev/2005-June/054438.html
[5] "path module", Orendorff, 2003
http://mail.python.org/pipermail/python-list/2003-July/174289.html
[6] "PRE-PEP: new Path class", Roth, 2004
http://mail.python.org/pipermail/python-list/2004-January/201672.html
[7] http://wiki.python.org/moin/PathClass
Copyright
This document has been placed in the public domain.
pep-0356 Python 2.5 Release Schedule
| PEP: | 356 |
|---|---|
| Title: | Python 2.5 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neal Norwitz, Guido van Rossum, Anthony Baxter |
| Status: | Final |
| Type: | Informational |
| Created: | 07-Feb-2006 |
| Python-Version: | 2.5 |
| Post-History: |
Abstract
This document describes the development and release schedule for
Python 2.5. The schedule primarily concerns itself with PEP-sized
items. Small features may be added up to and including the first
beta release. Bugs may be fixed until the final release.
There will be at least two alpha releases, two beta releases, and
one release candidate. The release date is planned for
12 September 2006.
Release Manager
Anthony Baxter has volunteered to be Release Manager.
Martin von Loewis is building the Windows installers,
Ronald Oussoren is building the Mac installers,
Fred Drake the doc packages and
Sean Reifschneider the RPMs.
Release Schedule
alpha 1: April 5, 2006 [completed]
alpha 2: April 27, 2006 [completed]
beta 1: June 20, 2006 [completed]
beta 2: July 11, 2006 [completed]
beta 3: August 3, 2006 [completed]
rc 1: August 17, 2006 [completed]
rc 2: September 12, 2006 [completed]
final: September 19, 2006 [completed]
Completed features for 2.5
PEP 308: Conditional Expressions
PEP 309: Partial Function Application
PEP 314: Metadata for Python Software Packages v1.1
PEP 328: Absolute/Relative Imports
PEP 338: Executing Modules as Scripts
PEP 341: Unified try-except/try-finally to try-except-finally
PEP 342: Coroutines via Enhanced Generators
PEP 343: The "with" Statement
(still need updates in Doc/ref and for the contextlib module)
PEP 352: Required Superclass for Exceptions
PEP 353: Using ssize_t as the index type
PEP 357: Allowing Any Object to be Used for Slicing
- ASCII became the default coding
- AST-based compiler
- Access to C AST from Python through new _ast module
- any()/all() builtin truth functions
New standard library modules
- cProfile -- suitable for profiling long running applications
with minimal overhead
- ctypes -- optional component of the windows installer
- ElementTree and cElementTree -- by Fredrik Lundh
- hashlib -- adds support for SHA-224, -256, -384, and -512
(replaces old md5 and sha modules)
- msilib -- for creating MSI files and bdist_msi in distutils.
- pysqlite
- uuid
- wsgiref
Other notable features
- Added support for reading shadow passwords (http://python.org/sf/579435)
- Added support for the Unicode 4.1 UCD
- Added PEP 302 zipfile/__loader__ support to the following modules:
warnings, linecache, inspect, traceback, site, and doctest
- Added pybench Python benchmark suite -- by Marc-Andre Lemburg
- Add write support for mailboxes from the code in sandbox/mailbox.
(Owner: A.M. Kuchling. It would still be good if another person
would take a look at the new code.)
- Support for building "fat" Mac binaries (Intel and PPC)
- Add new icons for Windows with the new Python logo?
- New utilities in functools to help write wrapper functions that
support naive introspection (e.g. having f.__name__ return
the original function name).
- Upgrade pyexpat to use expat 2.0.
- Python core now compiles cleanly with g++
Possible features for 2.5
Each feature below should implemented prior to beta1 or
will require BDFL approval for inclusion in 2.5.
- Modules under consideration for inclusion:
- Add new icons for MacOS and Unix with the new Python logo?
(Owner: ???)
MacOS: http://hcs.harvard.edu/~jrus/python/prettified-py-icons.png
- Check the various bits of code in Demo/ all still work, update or
remove the ones that don't.
(Owner: Anthony)
- All modules in Modules/ should be updated to be ssize_t clean.
(Owner: Neal)
Deferred until 2.6:
- bdist_deb in distutils package
http://mail.python.org/pipermail/python-dev/2006-February/060926.html
- bdist_egg in distutils package
- pure python pgen module
(Owner: Guido)
- Remove the fpectl module?
- Make everything in Modules/ build cleanly with g++
Open issues
- Bugs that need resolving before release, ie, they block release:
None
- Bugs deferred until 2.5.1 (or later)
http://python.org/sf/1544279 - Socket module is not thread-safe
http://python.org/sf/1541420 - tools and demo missing from windows
http://python.org/sf/1542451 - crash with continue in nested try/finally
http://python.org/sf/1475523 - gettext.py bug (owner: Martin v. Loewis)
http://python.org/sf/1467929 - %-formatting and dicts
http://python.org/sf/1446043 - unicode() does not raise LookupError
- The PEP 302 changes to (at least) pkgutil, runpy and pydoc must be
documented.
- test_zipfile64 takes too long and too much disk space for
most of the buildbots. How should this be handled?
It is currently disabled.
- should C modules listed in "Undocumented modules" be removed too?
"timing" (listed as obsolete), "cl" (listed as possibly not up-to-date),
and "sv" (listed as obsolete hardware specific).
Copyright
This document has been placed in the public domain.
pep-0357 Allowing Any Object to be Used for Slicing
| PEP: | 357 |
|---|---|
| Title: | Allowing Any Object to be Used for Slicing |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Travis Oliphant <oliphant at ee.byu.edu> |
| Status: | Final |
| Type: | Standards Track |
| Created: | 09-Feb-2006 |
| Python-Version: | 2.5 |
| Post-History: |
Abstract
This PEP proposes adding an nb_index slot in PyNumberMethods and an
__index__ special method so that arbitrary objects can be used
whenever integers are explicitly needed in Python, such as in slice
syntax (from which the slot gets its name).
Rationale
Currently integers and long integers play a special role in
slicing in that they are the only objects allowed in slice
syntax. In other words, if X is an object implementing the
sequence protocol, then X[obj1:obj2] is only valid if obj1 and
obj2 are both integers or long integers. There is no way for obj1
and obj2 to tell Python that they could be reasonably used as
indexes into a sequence. This is an unnecessary limitation.
In NumPy, for example, there are 8 different integer scalars
corresponding to unsigned and signed integers of 8, 16, 32, and 64
bits. These type-objects could reasonably be used as integers in
many places where Python expects true integers but cannot inherit from
the Python integer type because of incompatible memory layouts.
There should be some way to be able to tell Python that an object can
behave like an integer.
It is not possible to use the nb_int (and __int__ special method)
for this purpose because that method is used to *coerce* objects
to integers. It would be inappropriate to allow every object that
can be coerced to an integer to be used as an integer everywhere
Python expects a true integer. For example, if __int__ were used
to convert an object to an integer in slicing, then float objects
would be allowed in slicing and x[3.2:5.8] would not raise an error
as it should.
Proposal
Add an nb_index slot to PyNumberMethods, and a corresponding
__index__ special method. Objects could define a function to
place in the nb_index slot that returns a Python integer
(either an int or a long). This integer can
then be appropriately converted to a Py_ssize_t value whenever
Python needs one such as in PySequence_GetSlice,
PySequence_SetSlice, and PySequence_DelSlice.
Specification:
1) The nb_index slot will have the following signature
PyObject *index_func (PyObject *self)
The returned object must be a Python IntType or
Python LongType. NULL should be returned on
error with an appropriate error set.
2) The __index__ special method will have the signature
def __index__(self):
return obj
where obj must be either an int or a long.
3) 3 new abstract C-API functions will be added
a) The first checks to see if the object supports the index
slot and if it is filled in.
int PyIndex_Check(obj)
This will return true if the object defines the nb_index
slot.
b) The second is a simple wrapper around the nb_index call that
raises PyExc_TypeError if the call is not available or if it
doesn't return an int or long. Because the
PyIndex_Check is performed inside the PyNumber_Index call
you can call it directly and manage any error rather than
check for compatibility first.
PyObject *PyNumber_Index (PyObject *obj)
c) The third call helps deal with the common situation of
actually needing a Py_ssize_t value from the object to use for
indexing or other needs.
Py_ssize_t PyNumber_AsSsize_t(PyObject *obj, PyObject *exc)
The function calls the nb_index slot of obj if it is
available and then converts the returned Python integer into
a Py_ssize_t value. If this goes well, then the value is
returned. The second argument allows control over what
happens if the integer returned from nb_index cannot fit
into a Py_ssize_t value.
If exc is NULL, then the returnd value will be clipped to
PY_SSIZE_T_MAX or PY_SSIZE_T_MIN depending on whether the
nb_index slot of obj returned a positive or negative
integer. If exc is non-NULL, then it is the error object
that will be set to replace the PyExc_OverflowError that was
raised when the Python integer or long was converted to Py_ssize_t.
4) A new operator.index(obj) function will be added that calls
equivalent of obj.__index__() and raises an error if obj does not implement
the special method.
Implementation Plan
1) Add the nb_index slot in object.h and modify typeobject.c to
create the __index__ method
2) Change the ISINT macro in ceval.c to ISINDEX and alter it to
accomodate objects with the index slot defined.
3) Change the _PyEval_SliceIndex function to accommodate objects
with the index slot defined.
4) Change all builtin objects (e.g. lists) that use the as_mapping
slots for subscript access and use a special-check for integers to
check for the slot as well.
5) Add the nb_index slot to integers and long_integers
(which just return themselves)
6) Add PyNumber_Index C-API to return an integer from any
Python Object that has the nb_index slot.
7) Add the operator.index(x) function.
8) Alter arrayobject.c and mmapmodule.c to use the new C-API for their
sub-scripting and other needs.
9) Add unit-tests
Discussion Questions
Speed:
Implementation should not slow down Python because integers and long
integers used as indexes will complete in the same number of
instructions. The only change will be that what used to generate
an error will now be acceptable.
Why not use nb_int which is already there?
The nb_int method is used for coercion and so means something
fundamentally different than what is requested here. This PEP
proposes a method for something that *can* already be thought of as
an integer communicate that information to Python when it needs an
integer. The biggest example of why using nb_int would be a bad
thing is that float objects already define the nb_int method, but
float objects *should not* be used as indexes in a sequence.
Why the name __index__?
Some questions were raised regarding the name __index__ when other
interpretations of the slot are possible. For example, the slot
can be used any time Python requires an integer internally (such
as in "mystring" * 3). The name was suggested by Guido because
slicing syntax is the biggest reason for having such a slot and
in the end no better name emerged. See the discussion thread:
http://mail.python.org/pipermail/python-dev/2006-February/thread.html#60594
for examples of names that were suggested such as "__discrete__" and
"__ordinal__".
Why return PyObject * from nb_index?
Intially Py_ssize_t was selected as the return type for the
nb_index slot. However, this led to an inability to track and
distinguish overflow and underflow errors without ugly and brittle
hacks. As the nb_index slot is used in at least 3 different ways
in the Python core (to get an integer, to get a slice end-point,
and to get a sequence index), there is quite a bit of flexibility
needed to handle all these cases. The importance of having the
necessary flexibility to handle all the use cases is critical.
For example, the initial implementation that returned Py_ssize_t for
nb_index led to the discovery that on a 32-bit machine with >=2GB of RAM
s = 'x' * (2**100) works but len(s) was clipped at 2147483647.
Several fixes were suggested but eventually it was decided that
nb_index needed to return a Python Object similar to the nb_int
and nb_long slots in order to handle overflow correctly.
Why can't __index__ return any object with the nb_index method?
This would allow infinite recursion in many different ways that are not
easy to check for. This restriction is similar to the requirement that
__nonzero__ return an int or a bool.
Reference Implementation
Submitted as patch 1436368 to SourceForge.
Copyright
This document is placed in the public domain.
pep-0358 The "bytes" Object
| PEP: | 358 |
|---|---|
| Title: | The "bytes" Object |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neil Schemenauer <nas at arctrix.com>, Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 15-Feb-2006 |
| Python-Version: | 2.6, 3.0 |
| Post-History: |
Update
This PEP has partially been superseded by PEP 3137.
Abstract
This PEP outlines the introduction of a raw bytes sequence type.
Adding the bytes type is one step in the transition to Unicode
based str objects which will be introduced in Python 3.0.
The PEP describes how the bytes type should work in Python 2.6, as
well as how it should work in Python 3.0. (Occasionally there are
differences because in Python 2.6, we have two string types, str
and unicode, while in Python 3.0 we will only have one string
type, whose name will be str but whose semantics will be like the
2.6 unicode type.)
Motivation
Python's current string objects are overloaded. They serve to hold
both sequences of characters and sequences of bytes. This
overloading of purpose leads to confusion and bugs. In future
versions of Python, string objects will be used for holding
character data. The bytes object will fulfil the role of a byte
container. Eventually the unicode type will be renamed to str
and the old str type will be removed.
Specification
A bytes object stores a mutable sequence of integers that are in
the range 0 to 255. Unlike string objects, indexing a bytes
object returns an integer. Assigning or comparing an object that
is not an integer to an element causes a TypeError exception.
Assigning an element to a value outside the range 0 to 255 causes
a ValueError exception. The .__len__() method of bytes returns
the number of integers stored in the sequence (i.e. the number of
bytes).
The constructor of the bytes object has the following signature:
bytes([initializer[, encoding]])
If no arguments are provided then a bytes object containing zero
elements is created and returned. The initializer argument can be
a string (in 2.6, either str or unicode), an iterable of integers,
or a single integer. The pseudo-code for the constructor
(optimized for clear semantics, not for speed) is:
def bytes(initializer=0, encoding=None):
if isinstance(initializer, int): # In 2.6, int -> (int, long)
initializer = [0]*initializer
elif isinstance(initializer, basestring):
if isinstance(initializer, unicode): # In 3.0, "if True"
if encoding is None:
# In 3.0, raise TypeError("explicit encoding required")
encoding = sys.getdefaultencoding()
initializer = initializer.encode(encoding)
initializer = [ord(c) for c in initializer]
else:
if encoding is not None:
raise TypeError("no encoding allowed for this initializer")
tmp = []
for c in initializer:
if not isinstance(c, int):
raise TypeError("initializer must be iterable of ints")
if not 0 <= c < 256:
raise ValueError("initializer element out of range")
tmp.append(c)
initializer = tmp
new = <new bytes object of length len(initializer)>
for i, c in enumerate(initializer):
new[i] = c
return new
The .__repr__() method returns a string that can be evaluated to
generate a new bytes object containing a bytes literal:
>>> bytes([10, 20, 30])
b'\n\x14\x1e'
The object has a .decode() method equivalent to the .decode()
method of the str object. The object has a classmethod .fromhex()
that takes a string of characters from the set [0-9a-fA-F ] and
returns a bytes object (similar to binascii.unhexlify). For
example:
>>> bytes.fromhex('5c5350ff')
b'\\SP\xff'
>>> bytes.fromhex('5c 53 50 ff')
b'\\SP\xff'
The object has a .hex() method that does the reverse conversion
(similar to binascii.hexlify):
>> bytes([92, 83, 80, 255]).hex()
'5c5350ff'
The bytes object has some methods similar to list methods, and
others similar to str methods. Here is a complete list of
methods, with their approximate signatures:
.__add__(bytes) -> bytes
.__contains__(int | bytes) -> bool
.__delitem__(int | slice) -> None
.__delslice__(int, int) -> None
.__eq__(bytes) -> bool
.__ge__(bytes) -> bool
.__getitem__(int | slice) -> int | bytes
.__getslice__(int, int) -> bytes
.__gt__(bytes) -> bool
.__iadd__(bytes) -> bytes
.__imul__(int) -> bytes
.__iter__() -> iterator
.__le__(bytes) -> bool
.__len__() -> int
.__lt__(bytes) -> bool
.__mul__(int) -> bytes
.__ne__(bytes) -> bool
.__reduce__(...) -> ...
.__reduce_ex__(...) -> ...
.__repr__() -> str
.__reversed__() -> bytes
.__rmul__(int) -> bytes
.__setitem__(int | slice, int | iterable[int]) -> None
.__setslice__(int, int, iterable[int]) -> Bote
.append(int) -> None
.count(int) -> int
.decode(str) -> str | unicode # in 3.0, only str
.endswith(bytes) -> bool
.extend(iterable[int]) -> None
.find(bytes) -> int
.index(bytes | int) -> int
.insert(int, int) -> None
.join(iterable[bytes]) -> bytes
.partition(bytes) -> (bytes, bytes, bytes)
.pop([int]) -> int
.remove(int) -> None
.replace(bytes, bytes) -> bytes
.rindex(bytes | int) -> int
.rpartition(bytes) -> (bytes, bytes, bytes)
.split(bytes) -> list[bytes]
.startswith(bytes) -> bool
.reverse() -> None
.rfind(bytes) -> int
.rindex(bytes | int) -> int
.rsplit(bytes) -> list[bytes]
.translate(bytes, [bytes]) -> bytes
Note the conspicuous absence of .isupper(), .upper(), and friends.
(But see "Open Issues" below.) There is no .__hash__() because
the object is mutable. There is no use case for a .sort() method.
The bytes type also supports the buffer interface, supporting
reading and writing binary (but not character) data.
Out of Scope Issues
* Python 3k will have a much different I/O subsystem. Deciding
how that I/O subsystem will work and interact with the bytes
object is out of the scope of this PEP. The expectation however
is that binary I/O will read and write bytes, while text I/O
will read strings. Since the bytes type supports the buffer
interface, the existing binary I/O operations in Python 2.6 will
support bytes objects.
* It has been suggested that a special method named .__bytes__()
be added to the language to allow objects to be converted into
byte arrays. This decision is out of scope.
* A bytes literal of the form b"..." is also proposed. This is
the subject of PEP 3112.
Open Issues
* The .decode() method is redundant since a bytes object b can
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
at all? In a sense the spelling using a constructor is cleaner.
* Need to specify the methods still more carefully.
* Pickling and marshalling support need to be specified.
* Should all those list methods really be implemented?
* A case could be made for supporting .ljust(), .rjust(),
.center() with a mandatory second argument.
* A case could be made for supporting .split() with a mandatory
argument.
* A case could even be made for supporting .islower(), .isupper(),
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
corresponding conversions (.lower() etc.), using the ASCII
definitions for letters, digits and whitespace. If this is
accepted, the cases for .ljust(), .rjust(), .center() and
.split() become much stronger, and they should have default
arguments as well, using an ASCII space or all ASCII whitespace
(for .split()).
Frequently Asked Questions
Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing?
A: In the current version of Python, the encode method returns a str
object and we cannot change that without breaking code. The
construct bytes(s.encode(...)) is expensive because it has to
copy the byte sequence multiple times. Also, Python generally
provides two ways of converting an object of type A into an
object of type B: ask an A instance to convert itself to a B, or
ask the type B to create a new instance from an A. Depending on
what A and B are, both APIs make sense; sometimes reasons of
decoupling require that A can't know about B, in which case you
have to use the latter approach; sometimes B can't know about A,
in which case you have to use the former.
Q: Why does bytes ignore the encoding argument if the initializer is
a str? (This only applies to 2.6.)
A: There is no sane meaning that the encoding can have in that case.
str objects *are* byte arrays and they know nothing about the
encoding of character data they contain. We need to assume that
the programmer has provided a str object that already uses the
desired encoding. If you need something other than a pure copy of
the bytes then you need to first decode the string. For example:
bytes(s.decode(encoding1), encoding2)
Q: Why not have the encoding argument default to Latin-1 (or some
other encoding that covers the entire byte range) rather than
ASCII?
A: The system default encoding for Python is ASCII. It seems least
confusing to use that default. Also, in Py3k, using Latin-1 as
the default might not be what users expect. For example, they
might prefer a Unicode encoding. Any default will not always
work as expected. At least ASCII will complain loudly if you try
to encode non-ASCII data.
Copyright
This document has been placed in the public domain.
pep-0359 The "make" Statement
| PEP: | 359 |
|---|---|
| Title: | The "make" Statement |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Steven Bethard <steven.bethard at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 05-Apr-2006 |
| Python-Version: | 2.6 |
| Post-History: | 05-Apr-2006, 06-Apr-2006, 13-Apr-2006 |
Contents
Abstract
This PEP proposes a generalization of the class-declaration syntax, the make statement. The proposed syntax and semantics parallel the syntax for class definition, and so:
make <callable> <name> <tuple>:
<block>
is translated into the assignment:
<name> = <callable>("<name>", <tuple>, <namespace>)
where <namespace> is the dict created by executing <block>. This is mostly syntactic sugar for:
class <name> <tuple>:
__metaclass__ = <callable>
<block>
and is intended to help more clearly express the intent of the statement when something other than a class is being created. Of course, other syntax for such a statement is possible, but it is hoped that by keeping a strong parallel to the class statement, an understanding of how classes and metaclasses work will translate into an understanding of how the make-statement works as well.
The PEP is based on a suggestion [1] from Michele Simionato on the python-dev list.
Withdrawal Notice
This PEP was withdrawn at Guido's request [2]. Guido didn't like it, and in particular didn't like how the property use-case puts the instance methods of a property at a different level than other instance methods and requires fixed names for the property functions.
Motivation
Class statements provide two nice facilities to Python:
- They execute a block of statements and provide the resulting bindings as a dict to the metaclass.
- They encourage DRY (don't repeat yourself) by allowing the class being created to know the name it is being assigned.
Thus in a simple class statement like:
class C(object):
x = 1
def foo(self):
return 'bar'
the metaclass (type) gets called with something like:
C = type('C', (object,), {'x':1, 'foo':<function foo at ...>})
The class statement is just syntactic sugar for the above assignment statement, but clearly a very useful sort of syntactic sugar. It avoids not only the repetition of C, but also simplifies the creation of the dict by allowing it to be expressed as a series of statements.
Historically, type instances (a.k.a. class objects) have been the only objects blessed with this sort of syntactic support. The make statement aims to extend this support to other sorts of objects where such syntax would also be useful.
Example: simple namespaces
Let's say I have some attributes in a module that I access like:
mod.thematic_roletype mod.opinion_roletype mod.text_format mod.html_format
and since "Namespaces are one honking great idea", I'd like to be able to access these attributes instead as:
mod.roletypes.thematic mod.roletypes.opinion mod.format.text mod.format.html
I currently have two main options:
- Turn the module into a package, turn roletypes and format into submodules, and move the attributes to the submodules.
- Create roletypes and format classes, and move the attributes to the classes.
The former is a fair chunk of refactoring work, and produces two tiny modules without much content. The latter keeps the attributes local to the module, but creates classes when there is no intention of ever creating instances of those classes.
In situations like this, it would be nice to simply be able to declare a "namespace" to hold the few attributes. With the new make statement, I could introduce my new namespaces with something like:
make namespace roletypes:
thematic = ...
opinion = ...
make namespace format:
text = ...
html = ...
and keep my attributes local to the module without making classes that are never intended to be instantiated. One definition of namespace that would make this work is:
class namespace(object):
def __init__(self, name, args, kwargs):
self.__dict__.update(kwargs)
Given this definition, at the end of the make-statements above, roletypes and format would be namespace instances.
Example: GUI objects
In GUI toolkits, objects like frames and panels are often associated with attributes and functions. With the make-statement, code that looks something like:
root = Tkinter.Tk()
frame = Tkinter.Frame(root)
frame.pack()
def say_hi():
print "hi there, everyone!"
hi_there = Tkinter.Button(frame, text="Hello", command=say_hi)
hi_there.pack(side=Tkinter.LEFT)
root.mainloop()
could be rewritten to group the Button's function with its declaration:
root = Tkinter.Tk()
frame = Tkinter.Frame(root)
frame.pack()
make Tkinter.Button hi_there(frame):
text = "Hello"
def command():
print "hi there, everyone!"
hi_there.pack(side=Tkinter.LEFT)
root.mainloop()
Example: custom descriptors
Since descriptors are used to customize access to an attribute, it's often useful to know the name of that attribute. Current Python doesn't give an easy way to find this name and so a lot of custom descriptors, like Ian Bicking's setonce descriptor [3], have to hack around this somehow. With the make-statement, you could create a setonce attribute like:
class A(object):
...
make setonce x:
"A's x attribute"
...
where the setonce descriptor would be defined like:
class setonce(object):
def __init__(self, name, args, kwargs):
self._name = '_setonce_attr_%s' % name
self.__doc__ = kwargs.pop('__doc__', None)
def __get__(self, obj, type=None):
if obj is None:
return self
return getattr(obj, self._name)
def __set__(self, obj, value):
try:
getattr(obj, self._name)
except AttributeError:
setattr(obj, self._name, value)
else:
raise AttributeError("Attribute already set")
def set(self, obj, value):
setattr(obj, self._name, value)
def __delete__(self, obj):
delattr(obj, self._name)
Note that unlike the original implementation, the private attribute name is stable since it uses the name of the descriptor, and therefore instances of class A are pickleable.
Example: property namespaces
Python's property type takes three function arguments and a docstring argument which, though relevant only to the property, must be declared before it and then passed as arguments to the property call, e.g.:
class C(object):
...
def get_x(self):
...
def set_x(self):
...
x = property(get_x, set_x, "the x of the frobulation")
This issue has been brought up before, and Guido [4] and others [5] have briefly mused over alternate property syntaxes to make declaring properties easier. With the make-statement, the following syntax could be supported:
class C(object):
...
make block_property x:
'''The x of the frobulation'''
def fget(self):
...
def fset(self):
...
with the following definition of block_property:
def block_property(name, args, block_dict):
fget = block_dict.pop('fget', None)
fset = block_dict.pop('fset', None)
fdel = block_dict.pop('fdel', None)
doc = block_dict.pop('__doc__', None)
assert not block_dict
return property(fget, fset, fdel, doc)
Example: interfaces
Guido [6] and others have occasionally suggested introducing interfaces into python. Most suggestions have offered syntax along the lines of:
interface IFoo:
"""Foo blah blah"""
def fumble(name, count):
"""docstring"""
but since there is currently no way in Python to declare an interface in this manner, most implementations of Python interfaces use class objects instead, e.g. Zope's:
class IFoo(Interface):
"""Foo blah blah"""
def fumble(name, count):
"""docstring"""
With the new make-statement, these interfaces could instead be declared as:
make Interface IFoo:
"""Foo blah blah"""
def fumble(name, count):
"""docstring"""
which makes the intent (that this is an interface, not a class) much clearer.
Specification
Python will translate a make-statement:
make <callable> <name> <tuple>:
<block>
into the assignment:
<name> = <callable>("<name>", <tuple>, <namespace>)
where <namespace> is the dict created by executing <block>. The <tuple> expression is optional; if not present, an empty tuple will be assumed.
A patch is available implementing these semantics [7].
The make-statement introduces a new keyword, make. Thus in Python 2.6, the make-statement will have to be enabled using from __future__ import make_statement.
Open Issues
Keyword
Does the make keyword break too much code? Originally, the make statement used the keyword create (a suggestion due to Nick Coghlan). However, investigations into the standard library [8] and Zope+Plone code [9] revealed that create would break a lot more code, so make was adopted as the keyword instead. However, there are still a few instances where make would break code. Is there a better keyword for the statement?
Some possible keywords and their counts in the standard library (plus some installed packages):
- make - 2 (both in tests)
- create - 19 (including existing function in imaplib)
- build - 83 (including existing class in distutils.command.build)
- construct - 0
- produce - 0
The make-statement as an alternate constructor
Currently, there are not many functions which have the signature (name, args, kwargs). That means that something like:
make dict params:
x = 1
y = 2
is currently impossible because the dict constructor has a different signature. Does this sort of thing need to be supported? One suggestion, by Carl Banks, would be to add a __make__ magic method that if found would be called instead of __call__. For types, the __make__ method would be identical to __call__ and thus unnecessary, but dicts could support the make-statement by defining a __make__ method on the dict type that looks something like:
def __make__(cls, name, args, kwargs):
return cls(**kwargs)
Of course, rather than adding another magic method, the dict type could just grow a classmethod something like dict.fromblock that could be used like:
make dict.fromblock params:
x = 1
y = 2
So the question is, will many types want to use the make-statement as an alternate constructor? And if so, does that alternate constructor need to have the same name as the original constructor?
Customizing the dict in which the block is executed
Should users of the make-statement be able to determine in which dict object the code is executed? This would allow the make-statement to be used in situations where a normal dict object would not suffice, e.g. if order and repeated names must be allowed. Allowing this sort of customization could allow XML to be written without repeating element names, and with nesting of make-statements corresponding to nesting of XML elements:
make Element html:
make Element body:
text('before first h1')
make Element h1:
attrib(style='first')
text('first h1')
tail('after first h1')
make Element h1:
attrib(style='second')
text('second h1')
tail('after second h1')
If the make-statement tried to get the dict in which to execute its block by calling the callable's __make_dict__ method, the following code would allow the make-statement to be used as above:
class Element(object):
class __make_dict__(dict):
def __init__(self, *args, **kwargs):
self._super = super(Element.__make_dict__, self)
self._super.__init__(*args, **kwargs)
self.elements = []
self.text = None
self.tail = None
self.attrib = {}
def __getitem__(self, name):
try:
return self._super.__getitem__(name)
except KeyError:
if name in ['attrib', 'text', 'tail']:
return getattr(self, 'set_%s' % name)
else:
return globals()[name]
def __setitem__(self, name, value):
self._super.__setitem__(name, value)
self.elements.append(value)
def set_attrib(self, **kwargs):
self.attrib = kwargs
def set_text(self, text):
self.text = text
def set_tail(self, text):
self.tail = text
def __new__(cls, name, args, edict):
get_element = etree.ElementTree.Element
result = get_element(name, attrib=edict.attrib)
result.text = edict.text
result.tail = edict.tail
for element in edict.elements:
result.append(element)
return result
Note, however, that the code to support this is somewhat fragile -- it has to magically populate the namespace with attrib, text and tail, and it assumes that every name binding inside the make statement body is creating an Element. As it stands, this code would break with the introduction of a simple for-loop to any one of the make-statement bodies, because the for-loop would bind a name to a non-Element object. This could be worked around by adding some sort of isinstance check or attribute examination, but this still results in a somewhat fragile solution.
It has also been pointed out that the with-statement can provide equivalent nesting with a much more explicit syntax:
with Element('html') as html:
with Element('body') as body:
body.text = 'before first h1'
with Element('h1', style='first') as h1:
h1.text = 'first h1'
h1.tail = 'after first h1'
with Element('h1', style='second') as h1:
h1.text = 'second h1'
h1.tail = 'after second h1'
And if the repetition of the element names here is too much of a DRY violoation, it is also possible to eliminate all as-clauses except for the first by adding a few methods to Element. [10]
So are there real use-cases for executing the block in a dict of a different type? And if so, should the make-statement be extended to support them?
Optional Extensions
Remove the make keyword
It might be possible to remove the make keyword so that such statements would begin with the callable being called, e.g.:
namespace ns:
badger = 42
def spam():
...
interface C(...):
...
However, almost all other Python statements begin with a keyword, and removing the keyword would make it harder to look up this construct in the documentation. Additionally, this would add some complexity in the grammar and so far I (Steven Bethard) have not been able to implement the feature without the keyword.
Removing __metaclass__ in Python 3000
As a side-effect of its generality, the make-statement mostly eliminates the need for the __metaclass__ attribute in class objects. Thus in Python 3000, instead of:
class <name> <bases-tuple>:
__metaclass__ = <metaclass>
<block>
metaclasses could be supported by using the metaclass as the callable in a make-statement:
make <metaclass> <name> <bases-tuple>:
<block>
Removing the __metaclass__ hook would simplify the BUILD_CLASS opcode a bit.
Removing class statements in Python 3000
In the most extreme application of make-statements, the class statement itself could be deprecated in favor of make type statements.
References
| [1] | Michele Simionato's original suggestion (http://mail.python.org/pipermail/python-dev/2005-October/057435.html) |
| [2] | Guido requests withdrawal (http://mail.python.org/pipermail/python-3000/2006-April/000936.html) |
| [3] | Ian Bicking's setonce descriptor (http://blog.ianbicking.org/easy-readonly-attributes.html) |
| [4] | Guido ponders property syntax (http://mail.python.org/pipermail/python-dev/2005-October/057404.html) |
| [5] | Namespace-based property recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/442418) |
| [6] | Python interfaces (http://www.artima.com/weblogs/viewpost.jsp?thread=86641) |
| [7] | Make Statement patch (http://ucsu.colorado.edu/~bethard/py/make_statement.patch) |
| [8] | Instances of create in the stdlib (http://mail.python.org/pipermail/python-list/2006-April/335159.html) |
| [9] | Instances of create in Zope+Plone (http://mail.python.org/pipermail/python-list/2006-April/335284.html) |
| [10] | Eliminate as-clauses in with-statement XML (http://mail.python.org/pipermail/python-list/2006-April/336774.html) |
Copyright
This document has been placed in the public domain.
pep-0360 Externally Maintained Packages
| PEP: | 360 |
|---|---|
| Title: | Externally Maintained Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 30-May-2006 |
| Post-History: |
Contents
Warning
No new modules are to be added to this PEP. It has been deemed dangerous to codify external maintenance of any code checked into Python's code repository. Code contributors should expect Python's development methodology to be used for any and all code checked into Python's code repository.
Abstract
There are many great pieces of Python software developed outside of the Python standard library (a.k.a., the "stdlib"). Sometimes it makes sense to incorporate these externally maintained packages into the stdlib in order to fill a gap in the tools provided by Python.
But by having the packages maintained externally it means Python's developers do not have direct control over the packages' evolution and maintenance. Some package developers prefer to have bug reports and patches go through them first instead of being directly applied to Python's repository.
This PEP is meant to record details of packages in the stdlib that are maintained outside of Python's repository. Specifically, it is meant to keep track of any specific maintenance needs for each package. It should be mentioned that changes needed in order to fix bugs and keep the code running on all of Python's supported platforms will be done directly in Python's repository without worrying about going through the contact developer. This is so that Python itself is not held up by a single bug and allows the whole process to scale as needed.
It also is meant to allow people to know which version of a package is released with which version of Python.
Externally Maintained Packages
The section title is the name of the package as it is known outside of the Python standard library. The "standard library name" is what the package is named within Python. The "contact person" is the Python developer in charge of maintaining the package. The "synchronisation history" lists what external version of the package was included in each version of Python (if different from the previous Python release).
ElementTree
| Web site: | http://effbot.org/zone/element-index.htm |
|---|---|
| Standard library name: | |
| xml.etree | |
| Contact person: | Fredrik Lundh |
Fredrik has ceded ElementTree maintenance to the core Python development team [1].
Expat XML parser
| Web site: | http://www.libexpat.org/ |
|---|---|
| Standard library name: | |
| N/A (this refers to the parser itself, and not the Python bindings) | |
| Contact person: | None |
Optik
| Web site: | http://optik.sourceforge.net/ |
|---|---|
| Standard library name: | |
| optparse | |
| Contact person: | Greg Ward |
External development seems to have ceased. For new applications, optparse itself has been largely superseded by argparse.
References
| [1] | Fredrik's handing over of ElementTree (http://mail.python.org/pipermail/python-dev/2012-February/116389.html) |
| [2] | Web-SIG mailing list (http://mail.python.org/mailman/listinfo/web-sig) |
Copyright
This document has been placed in the public domain.
pep-0361 Python 2.6 and 3.0 Release Schedule
| PEP: | 361 |
|---|---|
| Title: | Python 2.6 and 3.0 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Neal Norwitz, Barry Warsaw |
| Status: | Final |
| Type: | Informational |
| Created: | 29-June-2006 |
| Python-Version: | 2.6 and 3.0 |
| Post-History: | 17-Mar-2008 |
Abstract
This document describes the development and release schedule for
Python 2.6 and 3.0. The schedule primarily concerns itself with
PEP-sized items. Small features may be added up to and including
the first beta release. Bugs may be fixed until the final
release.
There will be at least two alpha releases, two beta releases, and
one release candidate. The releases are planned for October 2008.
Python 2.6 is not only the next advancement in the Python 2
series, it is also a transitional release, helping developers
begin to prepare their code for Python 3.0. As such, many
features are being backported from Python 3.0 to 2.6. Thus, it
makes sense to release both versions in at the same time. The
precedence for this was set with the Python 1.6 and 2.0 release.
Until rc, we will be releasing Python 2.6 and 3.0 in lockstep, on
a monthly release cycle. The releases will happen on the first
Wednesday of every month through the beta testing cycle. Because
Python 2.6 is ready sooner, and because we have outside deadlines
we'd like to meet, we've decided to split the rc releases. Thus
Python 2.6 final is currently planned to come out two weeks before
Python 3.0 final.
Release Manager and Crew
2.6/3.0 Release Manager: Barry Warsaw
Windows installers: Martin v. Loewis
Mac installers: Ronald Oussoren
Documentation: Georg Brandl
RPMs: Sean Reifschneider
Release Lifespan
Python 3.0 is no longer being maintained for any purpose.
Python 2.6.9 is the final security-only source-only maintenance
release of the Python 2.6 series. With its release on October 29,
2013, all official support for Python 2.6 has ended. Python 2.6
is no longer being maintained for any purpose.
Release Schedule
Feb 29 2008: Python 2.6a1 and 3.0a3 are released
Apr 02 2008: Python 2.6a2 and 3.0a4 are released
May 08 2008: Python 2.6a3 and 3.0a5 are released
Jun 18 2008: Python 2.6b1 and 3.0b1 are released
Jul 17 2008: Python 2.6b2 and 3.0b2 are released
Aug 20 2008: Python 2.6b3 and 3.0b3 are released
Sep 12 2008: Python 2.6rc1 is released
Sep 17 2008: Python 2.6rc2 and 3.0rc1 released
Oct 01 2008: Python 2.6 final released
Nov 06 2008: Python 3.0rc2 released
Nov 21 2008: Python 3.0rc3 released
Dec 03 2008: Python 3.0 final released
Dec 04 2008: Python 2.6.1 final released
Apr 14 2009: Python 2.6.2 final released
Oct 02 2009: Python 2.6.3 final released
Oct 25 2009: Python 2.6.4 final released
Mar 19 2010: Python 2.6.5 final released
Aug 24 2010: Python 2.6.6 final released
Jun 03 2011: Python 2.6.7 final released (security-only)
Apr 10 2012: Python 2.6.8 final released (security-only)
Oct 29 2013: Python 2.6.9 final released (security-only)
Completed features for 3.0
See PEP 3000 [#pep3000] and PEP 3100 [#pep3100] for details on the
Python 3.0 project.
Completed features for 2.6
PEPs:
- 352: Raising a string exception now triggers a TypeError.
Attempting to catch a string exception raises DeprecationWarning.
BaseException.message has been deprecated. [#pep352]
- 358: The "bytes" Object [#pep358]
- 366: Main module explicit relative imports [#pep366]
- 370: Per user site-packages directory [#pep370]
- 3112: Bytes literals in Python 3000 [#pep3112]
- 3127: Integer Literal Support and Syntax [#pep3127]
- 371: Addition of the multiprocessing package [#pep371]
New modules in the standard library:
- json
- new enhanced turtle module
- ast
Deprecated modules and functions in the standard library:
- buildtools
- cfmfile
- commands.getstatus()
- macostools.touched()
- md5
- MimeWriter
- mimify
- popen2, os.popen[234]()
- posixfile
- sets
- sha
Modules removed from the standard library:
- gopherlib
- rgbimg
- macfs
Warnings for features removed in Py3k:
- builtins: apply, callable, coerce, dict.has_key, execfile,
reduce, reload
- backticks and <>
- float args to xrange
- coerce and all its friends
- comparing by default comparison
- {}.has_key()
- file.xreadlines
- softspace removal for print() function
- removal of modules because of PEP 4/3100/3108
Other major features:
- with/as will be keywords
- a __dir__() special method to control dir() was added [1]
- AtheOS support stopped.
- warnings module implemented in C
- compile() takes an AST and can convert to byte code
Possible features for 2.6
New features *should* be implemented prior to alpha2, particularly
any C modifications or behavioral changes. New features *must* be
implemented prior to beta1 or will require Release Manager approval.
The following PEPs are being worked on for inclusion in 2.6: None.
Each non-trivial feature listed here that is not a PEP must be
discussed on python-dev. Other enhancements include:
- distutils replacement (requires a PEP)
New modules in the standard library:
- winerror
http://python.org/sf/1505257
(Patch rejected, module should be written in C)
- setuptools
BDFL pronouncement for inclusion in 2.5:
http://mail.python.org/pipermail/python-dev/2006-April/063964.html
PJE's withdrawal from 2.5 for inclusion in 2.6:
http://mail.python.org/pipermail/python-dev/2006-April/064145.html
Modules to gain a DeprecationWarning (as specified for Python 2.6
or through negligence):
- rfc822
- mimetools
- multifile
- compiler package (or a Py3K warning instead?)
- Convert Parser/*.c to use the C warnings module rather than printf
- Add warnings for Py3k features removed:
* __getslice__/__setslice__/__delslice__
* float args to PyArgs_ParseTuple
* __cmp__?
* other comparison changes?
* int division?
* All PendingDeprecationWarnings (e.g. exceptions)
* using zip() result as a list
* the exec statement (use function syntax)
* function attributes that start with func_* (should use __*__)
* the L suffix for long literals
* renaming of __nonzero__ to __bool__
* multiple inheritance with classic classes? (MRO might change)
* properties and classic classes? (instance attrs shadow property)
- use __bool__ method if available and there's no __nonzero__
- Check the various bits of code in Demo/ and Tools/ all still work,
update or remove the ones that don't.
- All modules in Modules/ should be updated to be ssize_t clean.
- All of Python (including Modules/) should compile cleanly with g++
- Start removing deprecated features and generally moving towards Py3k
- Replace all old style tests (operate on import) with unittest or docttest
- Add tests for all untested modules
- Document undocumented modules/features
- bdist_deb in distutils package
http://mail.python.org/pipermail/python-dev/2006-February/060926.html
- bdist_egg in distutils package
- pure python pgen module
(Owner: Guido)
Deferral to 2.6:
http://mail.python.org/pipermail/python-dev/2006-April/064528.html
- Remove the fpectl module?
Deferred until 2.7
None
Open issues
How should import warnings be handled?
http://mail.python.org/pipermail/python-dev/2006-June/066345.html
http://python.org/sf/1515609
http://python.org/sf/1515361
References
.. [1] Adding a __dir__() magic method
http://mail.python.org/pipermail/python-dev/2006-July/067139.html
.. [#pep358] PEP 358 (The "bytes" Object)
http://www.python.org/dev/peps/pep-0358
.. [#pep366] PEP 366 (Main module explicit relative imports)
http://www.python.org/dev/peps/pep-0366
.. [#pep367] PEP 367 (New Super)
http://www.python.org/dev/peps/pep-0367
.. [#pep371] PEP 371 (Addition of the multiprocessing package)
http://www.python.org/dev/peps/pep-0371
.. [#pep3000] PEP 3000 (Python 3000)
http://www.python.org/dev/peps/pep-3000
.. [#pep3100] PEP 3100 (Miscellaneous Python 3.0 Plans)
http://www.python.org/dev/peps/pep-3100
.. [#pep3112] PEP 3112 (Bytes literals in Python 3000)
http://www.python.org/dev/peps/pep-3112
.. [#pep3127] PEP 3127 (Integer Literal Support and Syntax)
http://www.python.org/dev/peps/pep-3127
.. _Google calendar:
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
Copyright
This document has been placed in the public domain.
pep-0362 Function Signature Object
| PEP: | 362 |
|---|---|
| Title: | Function Signature Object |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org>, Jiwon Seo <seojiwon at gmail.com>, Yury Selivanov <yselivanov at sprymix.com>, Larry Hastings <larry at hastings.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 21-Aug-2006 |
| Python-Version: | 3.3 |
| Post-History: | 04-Jun-2012 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-June/120682.html |
Contents
Abstract
Python has always supported powerful introspection capabilities, including introspecting functions and methods (for the rest of this PEP, "function" refers to both functions and methods). By examining a function object you can fully reconstruct the function's signature. Unfortunately this information is stored in an inconvenient manner, and is spread across a half-dozen deeply nested attributes.
This PEP proposes a new representation for function signatures. The new representation contains all necessary information about a function and its parameters, and makes introspection easy and straightforward.
However, this object does not replace the existing function metadata, which is used by Python itself to execute those functions. The new metadata object is intended solely to make function introspection easier for Python programmers.
Signature Object
A Signature object represents the call signature of a function and its return annotation. For each parameter accepted by the function it stores a Parameter object in its parameters collection.
A Signature object has the following public attributes and methods:
- return_annotation : object
The "return" annotation for the function. If the function has no "return" annotation, this attribute is set to Signature.empty.
- parameters : OrderedDict
An ordered mapping of parameters' names to the corresponding Parameter objects.
- bind(*args, **kwargs) -> BoundArguments
Creates a mapping from positional and keyword arguments to parameters. Raises a TypeError if the passed arguments do not match the signature.
- bind_partial(*args, **kwargs) -> BoundArguments
Works the same way as bind(), but allows the omission of some required arguments (mimics functools.partial behavior.) Raises a TypeError if the passed arguments do not match the signature.
- replace(parameters=<optional>, *, return_annotation=<optional>) -> Signature
Creates a new Signature instance based on the instance replace was invoked on. It is possible to pass different parameters and/or return_annotation to override the corresponding properties of the base signature. To remove return_annotation from the copied Signature, pass in Signature.empty.
Note that the '=<optional>' notation, means that the argument is optional. This notation applies to the rest of this PEP.
Signature objects are immutable. Use Signature.replace() to make a modified copy:
>>> def foo() -> None: ... pass >>> sig = signature(foo) >>> new_sig = sig.replace(return_annotation="new return annotation") >>> new_sig is not sig True >>> new_sig.return_annotation != sig.return_annotation True >>> new_sig.parameters == sig.parameters True >>> new_sig = new_sig.replace(return_annotation=new_sig.empty) >>> new_sig.return_annotation is Signature.empty True
There are two ways to instantiate a Signature class:
- Signature(parameters=<optional>, *, return_annotation=Signature.empty)
Default Signature constructor. Accepts an optional sequence of Parameter objects, and an optional return_annotation. Parameters sequence is validated to check that there are no parameters with duplicate names, and that the parameters are in the right order, i.e. positional-only first, then positional-or-keyword, etc.
- Signature.from_function(function)
Returns a Signature object reflecting the signature of the function passed in.
It's possible to test Signatures for equality. Two signatures are equal when their parameters are equal, their positional and positional-only parameters appear in the same order, and they have equal return annotations.
Changes to the Signature object, or to any of its data members, do not affect the function itself.
Signature also implements __str__:
>>> str(Signature.from_function((lambda *args: None))) '(*args)' >>> str(Signature()) '()'
Parameter Object
Python's expressive syntax means functions can accept many different kinds of parameters with many subtle semantic differences. We propose a rich Parameter object designed to represent any possible function parameter.
A Parameter object has the following public attributes and methods:
- name : str
The name of the parameter as a string. Must be a valid python identifier name (with the exception of POSITIONAL_ONLY parameters, which can have it set to None.)
- default : object
The default value for the parameter. If the parameter has no default value, this attribute is set to Parameter.empty.
- annotation : object
The annotation for the parameter. If the parameter has no annotation, this attribute is set to Parameter.empty.
- kind
Describes how argument values are bound to the parameter. Possible values:
Parameter.POSITIONAL_ONLY - value must be supplied as a positional argument.
Python has no explicit syntax for defining positional-only parameters, but many built-in and extension module functions (especially those that accept only one or two parameters) accept them.
Parameter.POSITIONAL_OR_KEYWORD - value may be supplied as either a keyword or positional argument (this is the standard binding behaviour for functions implemented in Python.)
Parameter.KEYWORD_ONLY - value must be supplied as a keyword argument. Keyword only parameters are those which appear after a "*" or "*args" entry in a Python function definition.
Parameter.VAR_POSITIONAL - a tuple of positional arguments that aren't bound to any other parameter. This corresponds to a "*args" parameter in a Python function definition.
Parameter.VAR_KEYWORD - a dict of keyword arguments that aren't bound to any other parameter. This corresponds to a "**kwargs" parameter in a Python function definition.
Always use Parameter.* constants for setting and checking value of the kind attribute.
- replace(*, name=<optional>, kind=<optional>, default=<optional>, annotation=<optional>) -> Parameter
Creates a new Parameter instance based on the instance replaced was invoked on. To override a Parameter attribute, pass the corresponding argument. To remove an attribute from a Parameter, pass Parameter.empty.
Parameter constructor:
- Parameter(name, kind, *, annotation=Parameter.empty, default=Parameter.empty)
Instantiates a Parameter object. name and kind are required, while annotation and default are optional.
Two parameters are equal when they have equal names, kinds, defaults, and annotations.
Parameter objects are immutable. Instead of modifying a Parameter object, you can use Parameter.replace() to create a modified copy like so:
>>> param = Parameter('foo', Parameter.KEYWORD_ONLY, default=42)
>>> str(param)
'foo=42'
>>> str(param.replace())
'foo=42'
>>> str(param.replace(default=Parameter.empty, annotation='spam'))
"foo:'spam'"
BoundArguments Object
Result of a Signature.bind call. Holds the mapping of arguments to the function's parameters.
Has the following public attributes:
- arguments : OrderedDict
An ordered, mutable mapping of parameters' names to arguments' values. Contains only explicitly bound arguments. Arguments for which bind() relied on a default value are skipped.
- args : tuple
Tuple of positional arguments values. Dynamically computed from the 'arguments' attribute.
- kwargs : dict
Dict of keyword arguments values. Dynamically computed from the 'arguments' attribute.
The arguments attribute should be used in conjunction with Signature.parameters for any arguments processing purposes.
args and kwargs properties can be used to invoke functions:
def test(a, *, b):
...
sig = signature(test)
ba = sig.bind(10, b=20)
test(*ba.args, **ba.kwargs)
Arguments which could be passed as part of either *args or **kwargs will be included only in the BoundArguments.args attribute. Consider the following example:
def test(a=1, b=2, c=3):
pass
sig = signature(test)
ba = sig.bind(a=10, c=13)
>>> ba.args
(10,)
>>> ba.kwargs:
{'c': 13}
Implementation
The implementation adds a new function signature() to the inspect module. The function is the preferred way of getting a Signature for a callable object.
The function implements the following algorithm:
If the object is not callable - raise a TypeError
If the object has a __signature__ attribute and if it is not None - return it
If it has a __wrapped__ attribute, return signature(object.__wrapped__)
If the object is an instance of FunctionType, construct and return a new Signature for it
If the object is a bound method, construct and return a new Signature object, with its first parameter (usually self or cls) removed. (classmethod and staticmethod are supported too. Since both are descriptors, the former returns a bound method, and the latter returns its wrapped function.)
If the object is an instance of functools.partial, construct a new Signature from its partial.func attribute, and account for already bound partial.args and partial.kwargs
If the object is a class or metaclass:
- If the object's type has a __call__ method defined in its MRO, return a Signature for it
- If the object has a __new__ method defined in its MRO, return a Signature object for it
- If the object has a __init__ method defined in its MRO, return a Signature object for it
Return signature(object.__call__)
Note that the Signature object is created in a lazy manner, and is not automatically cached. However, the user can manually cache a Signature by storing it in the __signature__ attribute.
An implementation for Python 3.3 can be found at [1]. The python issue tracking the patch is [2].
Design Considerations
No implicit caching of Signature objects
The first PEP design had a provision for implicit caching of Signature objects in the inspect.signature() function. However, this has the following downsides:
- If the Signature object is cached then any changes to the function it describes will not be reflected in it. However, If the caching is needed, it can be always done manually and explicitly
- It is better to reserve the __signature__ attribute for the cases when there is a need to explicitly set to a Signature object that is different from the actual one
Some functions may not be introspectable
Some functions may not be introspectable in certain implementations of Python. For example, in CPython, built-in functions defined in C provide no metadata about their arguments. Adding support for them is out of scope for this PEP.
Signature and Parameter equivalence
We assume that parameter names have semantic significance--two signatures are equal only when their corresponding parameters are equal and have the exact same names. Users who want looser equivalence tests, perhaps ignoring names of VAR_KEYWORD or VAR_POSITIONAL parameters, will need to implement those themselves.
Examples
Visualizing Callable Objects' Signature
Let's define some classes and functions:
from inspect import signature
from functools import partial, wraps
class FooMeta(type):
def __new__(mcls, name, bases, dct, *, bar:bool=False):
return super().__new__(mcls, name, bases, dct)
def __init__(cls, name, bases, dct, **kwargs):
return super().__init__(name, bases, dct)
class Foo(metaclass=FooMeta):
def __init__(self, spam:int=42):
self.spam = spam
def __call__(self, a, b, *, c) -> tuple:
return a, b, c
@classmethod
def spam(cls, a):
return a
def shared_vars(*shared_args):
"""Decorator factory that defines shared variables that are
passed to every invocation of the function"""
def decorator(f):
@wraps(f)
def wrapper(*args, **kwargs):
full_args = shared_args + args
return f(*full_args, **kwargs)
# Override signature
sig = signature(f)
sig = sig.replace(tuple(sig.parameters.values())[1:])
wrapper.__signature__ = sig
return wrapper
return decorator
@shared_vars({})
def example(_state, a, b, c):
return _state, a, b, c
def format_signature(obj):
return str(signature(obj))
Now, in the python REPL:
>>> format_signature(FooMeta) '(name, bases, dct, *, bar:bool=False)' >>> format_signature(Foo) '(spam:int=42)' >>> format_signature(Foo.__call__) '(self, a, b, *, c) -> tuple' >>> format_signature(Foo().__call__) '(a, b, *, c) -> tuple' >>> format_signature(Foo.spam) '(a)' >>> format_signature(partial(Foo().__call__, 1, c=3)) '(b, *, c=3) -> tuple' >>> format_signature(partial(partial(Foo().__call__, 1, c=3), 2, c=20)) '(*, c=20) -> tuple' >>> format_signature(example) '(a, b, c)' >>> format_signature(partial(example, 1, 2)) '(c)' >>> format_signature(partial(partial(example, 1, b=2), c=3)) '(b=2, c=3)'
Annotation Checker
import inspect
import functools
def checktypes(func):
'''Decorator to verify arguments and return types
Example:
>>> @checktypes
... def test(a:int, b:str) -> int:
... return int(a * b)
>>> test(10, '1')
1111111111
>>> test(10, 1)
Traceback (most recent call last):
...
ValueError: foo: wrong type of 'b' argument, 'str' expected, got 'int'
'''
sig = inspect.signature(func)
types = {}
for param in sig.parameters.values():
# Iterate through function's parameters and build the list of
# arguments types
type_ = param.annotation
if type_ is param.empty or not inspect.isclass(type_):
# Missing annotation or not a type, skip it
continue
types[param.name] = type_
# If the argument has a type specified, let's check that its
# default value (if present) conforms with the type.
if param.default is not param.empty and not isinstance(param.default, type_):
raise ValueError("{func}: wrong type of a default value for {arg!r}". \
format(func=func.__qualname__, arg=param.name))
def check_type(sig, arg_name, arg_type, arg_value):
# Internal function that encapsulates arguments type checking
if not isinstance(arg_value, arg_type):
raise ValueError("{func}: wrong type of {arg!r} argument, " \
"{exp!r} expected, got {got!r}". \
format(func=func.__qualname__, arg=arg_name,
exp=arg_type.__name__, got=type(arg_value).__name__))
@functools.wraps(func)
def wrapper(*args, **kwargs):
# Let's bind the arguments
ba = sig.bind(*args, **kwargs)
for arg_name, arg in ba.arguments.items():
# And iterate through the bound arguments
try:
type_ = types[arg_name]
except KeyError:
continue
else:
# OK, we have a type for the argument, lets get the corresponding
# parameter description from the signature object
param = sig.parameters[arg_name]
if param.kind == param.VAR_POSITIONAL:
# If this parameter is a variable-argument parameter,
# then we need to check each of its values
for value in arg:
check_type(sig, arg_name, type_, value)
elif param.kind == param.VAR_KEYWORD:
# If this parameter is a variable-keyword-argument parameter:
for subname, value in arg.items():
check_type(sig, arg_name + ':' + subname, type_, value)
else:
# And, finally, if this parameter a regular one:
check_type(sig, arg_name, type_, arg)
result = func(*ba.args, **ba.kwargs)
# The last bit - let's check that the result is correct
return_type = sig.return_annotation
if (return_type is not sig._empty and
isinstance(return_type, type) and
not isinstance(result, return_type)):
raise ValueError('{func}: wrong return type, {exp} expected, got {got}'. \
format(func=func.__qualname__, exp=return_type.__name__,
got=type(result).__name__))
return result
return wrapper
Acceptance
PEP 362 was accepted by Guido, Friday, June 22, 2012 [3] . The reference implementation was committed to trunk later that day.
References
| [1] | pep362 branch (https://bitbucket.org/1st1/cpython/overview) |
| [2] | issue 15008 (http://bugs.python.org/issue15008) |
| [3] | "A Desperate Plea For Introspection (aka: BDFAP Needed)" (http://mail.python.org/pipermail/python-dev/2012-June/120682.html) |
Copyright
This document has been placed in the public domain.
pep-0363 Syntax For Dynamic Attribute Access
| PEP: | 363 |
|---|---|
| Title: | Syntax For Dynamic Attribute Access |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ben North <ben at redfrontdoor.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 29-Jan-2007 |
| Post-History: | 12-Feb-2007 |
Abstract
Dynamic attribute access is currently possible using the "getattr"
and "setattr" builtins. The present PEP suggests a new syntax to
make such access easier, allowing the coder for example to write
x.('foo_%d' % n) += 1
z = y.('foo_%d' % n).('bar_%s' % s)
instead of
attr_name = 'foo_%d' % n
setattr(x, attr_name, getattr(x, attr_name) + 1)
z = getattr(getattr(y, 'foo_%d' % n), 'bar_%s' % s)
Rationale
Dictionary access and indexing both have a friendly invocation
syntax: instead of x.__getitem__(12) the coder can write x[12].
This also allows the use of subscripted elements in an augmented
assignment, as in "x[12] += 1". The present proposal brings this
ease-of-use to dynamic attribute access too.
Attribute access is currently possible in two ways:
* When the attribute name is known at code-writing time, the
".NAME" trailer can be used, as in
x.foo = 42
y.bar += 100
* When the attribute name is computed dynamically at run-time, the
"getattr" and "setattr" builtins must be used:
x = getattr(y, 'foo_%d' % n)
setattr(z, 'bar_%s' % s, 99)
The "getattr" builtin also allows the coder to specify a default
value to be returned in the event that the object does not have
an attribute of the given name:
x = getattr(y, 'foo_%d' % n, 0)
This PEP describes a new syntax for dynamic attribute access ---
"x.(expr)" --- with examples given in the Abstract above.
(The new syntax could also allow the provision of a default value in
the "get" case, as in:
x = y.('foo_%d' % n, None)
This 2-argument form of dynamic attribute access would not be
permitted as the target of an (augmented or normal) assignment. The
"Discussion" section below includes opinions specifically on the
2-argument extension.)
Finally, the new syntax can be used with the "del" statement, as in
del x.(attr_name)
Impact On Existing Code
The proposed new syntax is not currently valid, so no existing
well-formed programs have their meaning altered by this proposal.
Across all "*.py" files in the 2.5 distribution, there are around
600 uses of "getattr", "setattr" or "delattr". They break down as
follows (figures have some room for error because they were
arrived at by partially-manual inspection):
c.300 uses of plain "getattr(x, attr_name)", which could be
replaced with the new syntax;
c.150 uses of the 3-argument form, i.e., with the default
value; these could be replaced with the 2-argument form
of the new syntax (the cases break down into c.125 cases
where the attribute name is a literal string, and c.25
where it's only known at run-time);
c.5 uses of the 2-argument form with a literal string
attribute name, which I think could be replaced with the
standard "x.attribute" syntax;
c.120 uses of setattr, of which 15 use getattr to find the
new value; all could be replaced with the new syntax,
the 15 where getattr is also involved would show a
particular increase in clarity;
c.5 uses which would have to stay as "getattr" because they
are calls of a variable named "getattr" whose default
value is the builtin "getattr";
c.5 uses of the 2-argument form, inside a try/except block
which catches AttributeError and uses a default value
instead; these could use 2-argument form of the new
syntax;
c.10 uses of "delattr", which could use the new syntax.
As examples, the line
setattr(self, attr, change_root(self.root, getattr(self, attr)))
from Lib/distutils/command/install.py could be rewritten
self.(attr) = change_root(self.root, self.(attr))
and the line
setattr(self, method_name, getattr(self.metadata, method_name))
from Lib/distutils/dist.py could be rewritten
self.(method_name) = self.metadata.(method_name)
Performance Impact
Initial pystone measurements are inconclusive, but suggest there may
be a performance penalty of around 1% in the pystones score with the
patched version. One suggestion is that this is because the longer
main loop in ceval.c hurts the cache behaviour, but this has not
been confirmed.
On the other hand, measurements suggest a speed-up of around 40--45%
for dynamic attribute access.
Error Cases
Only strings are permitted as attribute names, so for instance the
following error is produced:
>>> x.(99) = 8
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: attribute name must be string, not 'int'
This is handled by the existing PyObject_GetAttr function.
Draft Implementation
A draft implementation adds a new alternative to the "trailer"
clause in Grammar/Grammar; a new AST type, "DynamicAttribute" in
Python.asdl, with accompanying changes to symtable.c, ast.c, and
compile.c, and three new opcodes (load/store/del) with
accompanying changes to opcode.h and ceval.c. The patch consists
of c.180 additional lines in the core code, and c.100 additional
lines of tests. It is available as sourceforge patch #1657573 [1].
Mailing Lists Discussion
Initial posting of this PEP in draft form was to python-ideas on
20070209 [2], and the response was generally positive. The PEP was
then posted to python-dev on 20070212 [3], and an interesting
discussion ensued. A brief summary:
Initially, there was reasonable (but not unanimous) support for the
idea, although the precise choice of syntax had a more mixed
reception. Several people thought the "." would be too easily
overlooked, with the result that the syntax could be confused with a
method/function call. A few alternative syntaxes were suggested:
obj.(foo)
obj.[foo]
obj.{foo}
obj{foo}
obj.*foo
obj->foo
obj<-foo
obj@[foo]
obj.[[foo]]
with "obj.[foo]" emerging as the preferred one. In this initial
discussion, the two-argument form was universally disliked, so it
was to be taken out of the PEP.
Discussion then took a step back to whether this particular feature
provided enough benefit to justify new syntax. As well as requiring
coders to become familiar with the new syntax, there would also be
the problem of backward compatibility --- code using the new syntax
would not run on older pythons.
Instead of new syntax, a new "wrapper class" was proposed, with the
following specification / conceptual implementation suggested by
Martin von Loewis:
class attrs:
def __init__(self, obj):
self.obj = obj
def __getitem__(self, name):
return getattr(self.obj, name)
def __setitem__(self, name, value):
return setattr(self.obj, name, value)
def __delitem__(self, name):
return delattr(self, name)
def __contains__(self, name):
return hasattr(self, name)
This was considered a cleaner and more elegant solution to the
original problem. (Another suggestion was a mixin class providing
dictionary-style access to an object's attributes.)
The decision was made that the present PEP did not meet the burden
of proof for the introduction of new syntax, a view which had been
put forward by some from the beginning of the discussion. The
wrapper class idea was left open as a possibility for a future PEP.
References
[1] Sourceforge patch #1657573
http://sourceforge.net/tracker/index.php?func=detail&aid=1657573&group_id=5470&atid=305470
[2] http://mail.python.org/pipermail/python-ideas/2007-February/000210.html
and following posts
[3] http://mail.python.org/pipermail/python-dev/2007-February/070939.html
and following posts
Copyright
This document has been placed in the public domain.
pep-0364 Transitioning to the Py3K Standard Library
| PEP: | 364 |
|---|---|
| Title: | Transitioning to the Py3K Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 01-Mar-2007 |
| Python-Version: | 2.6 |
| Post-History: |
Contents
Abstract
PEP 3108 describes the reorganization of the Python standard library for the Python 3.0 release [1]. This PEP describes a mechanism for transitioning from the Python 2.x standard library to the Python 3.0 standard library. This transition will allow and encourage Python programmers to use the new Python 3.0 library names starting with Python 2.6, while maintaining the old names for backward compatibility. In this way, a Python programmer will be able to write forward compatible code without sacrificing interoperability with existing Python programs.
Rationale
PEP 3108 presents a rationale for Python standard library (stdlib) reorganization. The reader is encouraged to consult that PEP for details about why and how the library will be reorganized. Should PEP 3108 be accepted in part or in whole, then it is advantageous to allow Python programmers to begin the transition to the new stdlib module names in Python 2.x, so that they can write forward compatible code starting with Python 2.6.
Note that PEP 3108 proposes to remove some "silly old stuff", i.e. modules that are no longer useful or necessary. The PEP you are reading does not address this because there are no forward compatibility issues for modules that are to be removed, except to stop using such modules.
This PEP concerns only the mechanism by which mappings from old stdlib names to new stdlib names are maintained. Please consult PEP 3108 for all specific module renaming proposals. Specifically see the section titled Modules to Rename for guidelines on the old name to new name mappings. The few examples in this PEP are given for illustrative purposes only and should not be used for specific renaming recommendations.
Supported Renamings
There are at least 4 use cases explicitly supported by this PEP:
- Simple top-level package name renamings, such as StringIO to stringio;
- Sub-package renamings where the package name may or may not be renamed, such as email.MIMEText to email.mime.text;
- Extension module renaming, such as cStringIO to cstringio;
- Third party renaming of any of the above.
Two use cases supported by this PEP include renaming simple top-level modules, such as StringIO, as well as modules within packages, such as email.MIMEText.
In the former case, PEP 3108 currently recommends StringIO be renamed to stringio, following PEP 8 recommendations [2].
In the latter case, the email 4.0 package distributed with Python 2.5 already renamed email.MIMEText to email.mime.text, although it did so in a one-off, uniquely hackish way inside the email package. The mechanism described in this PEP is general enough to handle all module renamings, obviating the need for the Python 2.5 hack (except for backward compatibility with earlier Python versions).
An additional use case is to support the renaming of C extension modules. As long as the new name for the C module is importable, it can be remapped to the new name. E.g. cStringIO renamed to cstringio.
Third party package renaming is also supported, via several public interfaces accessible by any Python module.
Remappings are not performed recursively.
.mv files
Remapping files are called .mv files; the suffix was chosen to be evocative of the Unix mv(1) command. An .mv file is a simple line-oriented text file. All blank lines and lines that start with a # are ignored. All other lines must contain two whitespace separated fields. The first field is the old module name, and the second field is the new module name. Both module names must be specified using their full dotted-path names. Here is an example .mv file from Python 2.6:
# Map the various string i/o libraries to their new names StringIO stringio cStringIO cstringio
.mv files can appear anywhere in the file system, and there is a programmatic interface provided to parse them, and register the remappings inside them. By default, when Python starts up, all the .mv files in the oldlib package are read, and their remappings are automatically registered. This is where all the module remappings should be specified for top-level Python 2.x standard library modules.
Implementation Specification
This section provides the full specification for how module renamings in Python 2.x are implemented. The central mechanism relies on various import hooks as described in PEP 302 [3]. Specifically sys.path_importer_cache, sys.path, and sys.meta_path are all employed to provide the necessary functionality.
When Python's import machinery is initialized, the oldlib package is imported. Inside oldlib there is a class called OldStdlibLoader. This class implements the PEP 302 interface and is automatically instantiated, with zero arguments. The constructor reads all the .mv files from the oldlib package directory, automatically registering all the remappings found in those .mv files. This is how the Python 2.x standard library is remapped.
The OldStdlibLoader class should not be instantiated by other Python modules. Instead, you can access the global OldStdlibLoader instance via the sys.stdlib_remapper instance. Use this instance if you want programmatic access to the remapping machinery.
One important implementation detail: as needed by the PEP 302 API, a magic string is added to sys.path, and module __path__ attributes in order to hook in our remapping loader. This magic string is currently <oldlib> and some changes were necessary to Python's site.py file in order to treat all sys.path entries starting with < as special. Specifically, no attempt is made to make them absolute file names (since they aren't file names at all).
In order for the remapping import hooks to work, the module or package must be physically located under its new name. This is because the import hooks catch only modules that are not already imported, and cannot be imported by Python's built-in import rules. Thus, if a module has been moved, say from Lib/StringIO.py to Lib/stringio.py, and the former's .pyc file has been removed, then without the remapper, this would fail:
import StringIO
Instead, with the remapper, this failing import will be caught, the old name will be looked up in the registered remappings, and in this case, the new name stringio will be found. The remapper then attempts to import the new name, and if that succeeds, it binds the resulting module into sys.modules, under both the old and new names. Thus, the above import will result in entries in sys.modules for 'StringIO' and 'stringio', and both will point to the exact same module object.
Note that no way to disable the remapping machinery is proposed, short of moving all the .mv files away or programmatically removing them in some custom start up code. In Python 3.0, the remappings will be eliminated, leaving only the "new" names.
Programmatic Interface
Several methods are added to the sys.stdlib_remapper object, which third party packages can use to register their own remappings. Note however that in all cases, there is one and only one mapping from an old name to a new name. If two .mv files contain different mappings for an old name, or if a programmatic call is made with an old name that is already remapped, the previous mapping is lost. This will not affect any already imported modules.
The following methods are available on the sys.stdlib_remapper object:
- read_mv_file(filename) -- Read the given file and register all remappings found in the file.
- read_directory_mv_files(dirname, suffix='.mv') -- List the given directory, reading all files in that directory that have the matching suffix (.mv by default). For each parsed file, register all the remappings found in that file.
- set_mapping(oldname, newname) -- Register a new mapping from an old module name to a new module name. Both must be the full dotted-path name to the module. newname may be None in which case any existing mapping for oldname will be removed (it is not an error if there is no existing mapping).
- get_mapping(oldname, default=None) -- Return any registered newname for the given oldname. If there is no registered remapping, default is returned.
Open Issues
Should there be a command line switch and/or environment variable to disable all remappings?
Should remappings occur recursively?
Should we automatically parse package directories for .mv files when the package's __init__.py is loaded? This would allow packages to easily include .mv files for their own remappings. Compare what the email package currently has to do if we place its .mv file in the email package instead of in the oldlib package:
# Expose old names import os, sys sys.stdlib_remapper.read_directory_mv_files(os.path.dirname(__file__))I think we should automatically read a package's directory for any .mv files it might contain.
Reference Implementation
A reference implementation, in the form of a patch against the current (as of this writing) state of the Python 2.6 svn trunk, is available as SourceForge patch #1675334 [4]. Note that this patch includes a rename of cStringIO to cstringio, but this is primarily for illustrative and unit testing purposes. Should the patch be accepted, we might want to split this change off into other PEP 3108 changes.
References
| [1] | PEP 3108, Standard Library Reorganization, Cannon (http://www.python.org/dev/peps/pep-3108) |
| [2] | PEP 8, Style Guide for Python Code, GvR, Warsaw (http://www.python.org/dev/peps/pep-0008) |
| [3] | PEP 302, New Import Hooks, JvR, Moore (http://www.python.org/dev/peps/pep-0302) |
| [4] | Reference implementation (http://bugs.python.org/issue1675334) |
Copyright
This document has been placed in the public domain.
pep-0365 Adding the pkg_resources module
| PEP: | 365 |
|---|---|
| Title: | Adding the pkg_resources module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Phillip J. Eby <pje at telecommunity.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-Apr-2007 |
| Post-History: | 30-Apr-2007 |
Abstract
This PEP proposes adding an enhanced version of the pkg_resources module to the standard library.
pkg_resources is a module used to find and manage Python package/version dependencies and access bundled files and resources, including those inside of zipped .egg files. Currently, pkg_resources is only available through installing the entire setuptools distribution, but it does not depend on any other part of setuptools; in effect, it comprises the entire runtime support library for Python Eggs, and is independently useful.
In addition, with one feature addition, this module could support easy bootstrap installation of several Python package management tools, including setuptools, workingenv, and zc.buildout.
Proposal
Rather than proposing to include setuptools in the standard library, this PEP proposes only that pkg_resources be added to the standard library for Python 2.6 and 3.0. pkg_resources is considerably more stable than the rest of setuptools, with virtually no new features being added in the last 12 months.
However, this PEP also proposes that a new feature be added to pkg_resources, before being added to the stdlib. Specifically, it should be possible to do something like:
python -m pkg_resources SomePackage==1.2
to request downloading and installation of SomePackage from PyPI. This feature would not be a replacement for easy_install; instead, it would rely on SomePackage having pure-Python .egg files listed for download via the PyPI XML-RPC API, and the eggs would be placed in the $PYTHON_EGG_CACHE directory, where they would not be importable by default. (And no scripts would be installed.) However, if the download egg contains installation bootstrap code, it will be given a chance to run.
These restrictions would allow the code to be extremely simple, yet still powerful enough to support users downloading package management tools such as setuptools, workingenv and zc.buildout, simply by supplying the tool's name on the command line.
Rationale
Many users have requested that setuptools be included in the standard library, to save users needing to go through the awkward process of bootstrapping it. However, most of the bootstrapping complexity comes from the fact that setuptools-installed code cannot use the pkg_resources runtime module unless setuptools is already installed. Thus, installing setuptools requires (in a sense) that setuptools already be installed.
Other Python package management tools, such as workingenv and zc.buildout, have similar bootstrapping issues, since they both make use of setuptools, but also want to provide users with something approaching a "one-step install". The complexity of creating bootstrap utilities for these and any other such tools that arise in future, is greatly reduced if pkg_resources is already present, and is also able to download pre-packaged eggs from PyPI.
(It would also mean that setuptools would not need to be installed in order to simply use eggs, as opposed to building them.)
Finally, in addition to providing access to eggs built via setuptools or other packaging tools, it should be noted that since Python 2.5, the distutils install package metadata (aka PKG-INFO) files that can be read by pkg_resources to identify what distributions are already on sys.path. In environments where Python packages are installed using system package tools (like RPM), the pkg_resources module provides an API for detecting what versions of what packages are installed, even if those packages were installed via the distutils instead of setuptools.
Implementation and Documentation
The pkg_resources implementation is maintained in the Python SVN repository under /sandbox/trunk/setuptools/; see pkg_resources.py and pkg_resources.txt. Documentation for the egg format(s) supported by pkg_resources can be found in doc/formats.txt. HTML versions of these documents are available at:
- http://peak.telecommunity.com/DevCenter/PkgResources and
- http://peak.telecommunity.com/DevCenter/EggFormats
(These HTML versions are for setuptools 0.6; they may not reflect all of the changes found in the Subversion trunk's .txt versions.)
Copyright
This document has been placed in the public domain.
pep-0366 Main module explicit relative imports
| PEP: | 366 |
|---|---|
| Title: | Main module explicit relative imports |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 1-May-2007 |
| Python-Version: | 2.6, 3.0 |
| Post-History: | 1-May-2007, 4-Jul-2007, 7-Jul-2007, 23-Nov-2007 |
Contents
Abstract
This PEP proposes a backwards compatible mechanism that permits the use of explicit relative imports from executable modules within packages. Such imports currently fail due to an awkward interaction between PEP 328 and PEP 338.
By adding a new module level attribute, this PEP allows relative imports to work automatically if the module is executed using the -m switch. A small amount of boilerplate in the module itself will allow the relative imports to work when the file is executed by name.
Guido accepted the PEP in November 2007 [5].
Proposed Change
The major proposed change is the introduction of a new module level attribute, __package__. When it is present, relative imports will be based on this attribute rather than the module __name__ attribute.
As with the current __name__ attribute, setting __package__ will be the responsibility of the PEP 302 loader used to import a module. Loaders which use imp.new_module() to create the module object will have the new attribute set automatically to None. When the import system encounters an explicit relative import in a module without __package__ set (or with it set to None), it will calculate and store the correct value (__name__.rpartition('.')[0] for normal modules and __name__ for package initialisation modules). If __package__ has already been set then the import system will use it in preference to recalculating the package name from the __name__ and __path__ attributes.
The runpy module will explicitly set the new attribute, basing it off the name used to locate the module to be executed rather than the name used to set the module's __name__ attribute. This will allow relative imports to work correctly from main modules executed with the -m switch.
When the main module is specified by its filename, then the __package__ attribute will be set to None. To allow relative imports when the module is executed directly, boilerplate similar to the following would be needed before the first relative import statement:
if __name__ == "__main__" and __package__ is None:
__package__ = "expected.package.name"
Note that this boilerplate is sufficient only if the top level package is already accessible via sys.path. Additional code that manipulates sys.path would be needed in order for direct execution to work without the top level package already being importable.
This approach also has the same disadvantage as the use of absolute imports of sibling modules - if the script is moved to a different package or subpackage, the boilerplate will need to be updated manually. It has the advantage that this change need only be made once per file, regardless of the number of relative imports.
Note that setting __package__ to the empty string explicitly is permitted, and has the effect of disabling all relative imports from that module (since the import machinery will consider it to be a top level module in that case). This means that tools like runpy do not need to provide special case handling for top level modules when setting __package__.
Rationale for Change
The current inability to use explicit relative imports from the main module is the subject of at least one open SF bug report (#1510172) [1], and has most likely been a factor in at least a few queries on comp.lang.python (such as Alan Isaac's question in [2]).
This PEP is intended to provide a solution which permits explicit relative imports from main modules, without incurring any significant costs during interpreter startup or normal module import.
The section in PEP 338 on relative imports and the main module provides further details and background on this problem.
Reference Implementation
Rev 47142 in SVN implemented an early variant of this proposal which stored the main module's real module name in the __module_name__ attribute. It was reverted due to the fact that 2.5 was already in beta by that time.
Patch 1487 [4] is the proposed implementation for this PEP.
Alternative Proposals
PEP 3122 proposed addressing this problem by changing the way the main module is identified. That's a significant compatibility cost to incur to fix something that is a pretty minor bug in the overall scheme of things, and the PEP was rejected [3].
The advantage of the proposal in this PEP is that its only impact on normal code is the small amount of time needed to set the extra attribute when importing a module. Relative imports themselves should be sped up fractionally, as the package name is cached in the module globals, rather than having to be worked out again for each relative import.
References
| [1] | Absolute/relative import not working? (http://www.python.org/sf/1510172) |
| [2] | c.l.p. question about modules and relative imports (http://groups.google.com/group/comp.lang.python/browse_thread/thread/c44c769a72ca69fa/) |
| [3] | Guido's rejection of PEP 3122 (http://mail.python.org/pipermail/python-3000/2007-April/006793.html) |
| [4] | PEP 366 implementation patch (http://bugs.python.org/issue1487) |
| [5] | Acceptance of the PEP (http://mail.python.org/pipermail/python-dev/2007-November/075475.html) |
Copyright
This document has been placed in the public domain.
pep-0367 New Super
| PEP: | 367 |
|---|---|
| Title: | New Super |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Calvin Spealman <ironfroggy at gmail.com>, Tim Delaney <timothy.c.delaney at gmail.com> |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Apr-2007 |
| Python-Version: | 2.6 |
| Post-History: | 28-Apr-2007, 29-Apr-2007 (1), 29-Apr-2007 (2), 14-May-2007 |
Contents
Numbering Note
This PEP has been renumbered to PEP 3135. The text below is the last version submitted under the old number.
Abstract
This PEP proposes syntactic sugar for use of the super type to automatically construct instances of the super type binding to the class that a method was defined in, and the instance (or class object for classmethods) that the method is currently acting upon.
The premise of the new super usage suggested is as follows:
super.foo(1, 2)
to replace the old:
super(Foo, self).foo(1, 2)
and the current __builtin__.super be aliased to __builtin__.__super__ (with __builtin__.super to be removed in Python 3.0).
It is further proposed that assignment to super become a SyntaxError, similar to the behaviour of None.
Rationale
The current usage of super requires an explicit passing of both the class and instance it must operate from, requiring a breaking of the DRY (Don't Repeat Yourself) rule. This hinders any change in class name, and is often considered a wart by many.
Specification
Within the specification section, some special terminology will be used to distinguish similar and closely related concepts. "super type" will refer to the actual builtin type named "super". A "super instance" is simply an instance of the super type, which is associated with a class and possibly with an instance of that class.
Because the new super semantics are not backwards compatible with Python 2.5, the new semantics will require a __future__ import:
from __future__ import new_super
The current __builtin__.super will be aliased to __builtin__.__super__. This will occur regardless of whether the new super semantics are active. It is not possible to simply rename __builtin__.super, as that would affect modules that do not use the new super semantics. In Python 3.0 it is proposed that the name __builtin__.super will be removed.
Replacing the old usage of super, calls to the next class in the MRO (method resolution order) can be made without explicitly creating a super instance (although doing so will still be supported via __super__). Every function will have an implicit local named super. This name behaves identically to a normal local, including use by inner functions via a cell, with the following exceptions:
- Assigning to the name super will raise a SyntaxError at compile time;
- Calling a static method or normal function that accesses the name super will raise a TypeError at runtime.
Every function that uses the name super, or has an inner function that uses the name super, will include a preamble that performs the equivalent of:
super = __builtin__.__super__(<class>, <instance>)
where <class> is the class that the method was defined in, and <instance> is the first parameter of the method (normally self for instance methods, and cls for class methods). For static methods and normal functions, <class> will be None, resulting in a TypeError being raised during the preamble.
Note: The relationship between super and __super__ is similar to that between import and __import__.
Much of this was discussed in the thread of the python-dev list, "Fixing super anyone?" [1].
Open Issues
Determining the class object to use
The exact mechanism for associating the method with the defining class is not specified in this PEP, and should be chosen for maximum performance. For CPython, it is suggested that the class instance be held in a C-level variable on the function object which is bound to one of NULL (not part of a class), Py_None (static method) or a class object (instance or class method).
Should super actually become a keyword?
With this proposal, super would become a keyword to the same extent that None is a keyword. It is possible that further restricting the super name may simplify implementation, however some are against the actual keyword- ization of super. The simplest solution is often the correct solution and the simplest solution may well not be adding additional keywords to the language when they are not needed. Still, it may solve other open issues.
Closed Issues
super used with __call__ attributes
It was considered that it might be a problem that instantiating super instances the classic way, because calling it would lookup the __call__ attribute and thus try to perform an automatic super lookup to the next class in the MRO. However, this was found to be false, because calling an object only looks up the __call__ method directly on the object's type. The following example shows this in action.
class A(object):
def __call__(self):
return '__call__'
def __getattribute__(self, attr):
if attr == '__call__':
return lambda: '__getattribute__'
a = A()
assert a() == '__call__'
assert a.__call__() == '__getattribute__'
In any case, with the renaming of __builtin__.super to __builtin__.__super__ this issue goes away entirely.
Reference Implementation
It is impossible to implement the above specification entirely in Python. This reference implementation has the following differences to the specification:
- New super semantics are implemented using bytecode hacking.
- Assignment to super is not a SyntaxError. Also see point #4.
- Classes must either use the metaclass autosuper_meta or inherit from the base class autosuper to acquire the new super semantics.
- super is not an implicit local variable. In particular, for inner functions to be able to use the super instance, there must be an assignment of the form super = super in the method.
The reference implementation assumes that it is being run on Python 2.5+.
#!/usr/bin/env python
#
# autosuper.py
from array import array
import dis
import new
import types
import __builtin__
__builtin__.__super__ = __builtin__.super
del __builtin__.super
# We need these for modifying bytecode
from opcode import opmap, HAVE_ARGUMENT, EXTENDED_ARG
LOAD_GLOBAL = opmap['LOAD_GLOBAL']
LOAD_NAME = opmap['LOAD_NAME']
LOAD_CONST = opmap['LOAD_CONST']
LOAD_FAST = opmap['LOAD_FAST']
LOAD_ATTR = opmap['LOAD_ATTR']
STORE_FAST = opmap['STORE_FAST']
LOAD_DEREF = opmap['LOAD_DEREF']
STORE_DEREF = opmap['STORE_DEREF']
CALL_FUNCTION = opmap['CALL_FUNCTION']
STORE_GLOBAL = opmap['STORE_GLOBAL']
DUP_TOP = opmap['DUP_TOP']
POP_TOP = opmap['POP_TOP']
NOP = opmap['NOP']
JUMP_FORWARD = opmap['JUMP_FORWARD']
ABSOLUTE_TARGET = dis.hasjabs
def _oparg(code, opcode_pos):
return code[opcode_pos+1] + (code[opcode_pos+2] << 8)
def _bind_autosuper(func, cls):
co = func.func_code
name = func.func_name
newcode = array('B', co.co_code)
codelen = len(newcode)
newconsts = list(co.co_consts)
newvarnames = list(co.co_varnames)
# Check if the global 'super' keyword is already present
try:
sn_pos = list(co.co_names).index('super')
except ValueError:
sn_pos = None
# Check if the varname 'super' keyword is already present
try:
sv_pos = newvarnames.index('super')
except ValueError:
sv_pos = None
# Check if the callvar 'super' keyword is already present
try:
sc_pos = list(co.co_cellvars).index('super')
except ValueError:
sc_pos = None
# If 'super' isn't used anywhere in the function, we don't have anything to do
if sn_pos is None and sv_pos is None and sc_pos is None:
return func
c_pos = None
s_pos = None
n_pos = None
# Check if the 'cls_name' and 'super' objects are already in the constants
for pos, o in enumerate(newconsts):
if o is cls:
c_pos = pos
if o is __super__:
s_pos = pos
if o == name:
n_pos = pos
# Add in any missing objects to constants and varnames
if c_pos is None:
c_pos = len(newconsts)
newconsts.append(cls)
if n_pos is None:
n_pos = len(newconsts)
newconsts.append(name)
if s_pos is None:
s_pos = len(newconsts)
newconsts.append(__super__)
if sv_pos is None:
sv_pos = len(newvarnames)
newvarnames.append('super')
# This goes at the start of the function. It is:
#
# super = __super__(cls, self)
#
# If 'super' is a cell variable, we store to both the
# local and cell variables (i.e. STORE_FAST and STORE_DEREF).
#
preamble = [
LOAD_CONST, s_pos & 0xFF, s_pos >> 8,
LOAD_CONST, c_pos & 0xFF, c_pos >> 8,
LOAD_FAST, 0, 0,
CALL_FUNCTION, 2, 0,
]
if sc_pos is None:
# 'super' is not a cell variable - we can just use the local variable
preamble += [
STORE_FAST, sv_pos & 0xFF, sv_pos >> 8,
]
else:
# If 'super' is a cell variable, we need to handle LOAD_DEREF.
preamble += [
DUP_TOP,
STORE_FAST, sv_pos & 0xFF, sv_pos >> 8,
STORE_DEREF, sc_pos & 0xFF, sc_pos >> 8,
]
preamble = array('B', preamble)
# Bytecode for loading the local 'super' variable.
load_super = array('B', [
LOAD_FAST, sv_pos & 0xFF, sv_pos >> 8,
])
preamble_len = len(preamble)
need_preamble = False
i = 0
while i < codelen:
opcode = newcode[i]
need_load = False
remove_store = False
if opcode == EXTENDED_ARG:
raise TypeError("Cannot use 'super' in function with EXTENDED_ARG opcode")
# If the opcode is an absolute target it needs to be adjusted
# to take into account the preamble.
elif opcode in ABSOLUTE_TARGET:
oparg = _oparg(newcode, i) + preamble_len
newcode[i+1] = oparg & 0xFF
newcode[i+2] = oparg >> 8
# If LOAD_GLOBAL(super) or LOAD_NAME(super) then we want to change it into
# LOAD_FAST(super)
elif (opcode == LOAD_GLOBAL or opcode == LOAD_NAME) and _oparg(newcode, i) == sn_pos:
need_preamble = need_load = True
# If LOAD_FAST(super) then we just need to add the preamble
elif opcode == LOAD_FAST and _oparg(newcode, i) == sv_pos:
need_preamble = need_load = True
# If LOAD_DEREF(super) then we change it into LOAD_FAST(super) because
# it's slightly faster.
elif opcode == LOAD_DEREF and _oparg(newcode, i) == sc_pos:
need_preamble = need_load = True
if need_load:
newcode[i:i+3] = load_super
i += 1
if opcode >= HAVE_ARGUMENT:
i += 2
# No changes needed - get out.
if not need_preamble:
return func
# Our preamble will have 3 things on the stack
co_stacksize = max(3, co.co_stacksize)
# Conceptually, our preamble is on the `def` line.
co_lnotab = array('B', co.co_lnotab)
if co_lnotab:
co_lnotab[0] += preamble_len
co_lnotab = co_lnotab.tostring()
# Our code consists of the preamble and the modified code.
codestr = (preamble + newcode).tostring()
codeobj = new.code(co.co_argcount, len(newvarnames), co_stacksize,
co.co_flags, codestr, tuple(newconsts), co.co_names,
tuple(newvarnames), co.co_filename, co.co_name,
co.co_firstlineno, co_lnotab, co.co_freevars,
co.co_cellvars)
func.func_code = codeobj
func.func_class = cls
return func
class autosuper_meta(type):
def __init__(cls, name, bases, clsdict):
UnboundMethodType = types.UnboundMethodType
for v in vars(cls):
o = getattr(cls, v)
if isinstance(o, UnboundMethodType):
_bind_autosuper(o.im_func, cls)
class autosuper(object):
__metaclass__ = autosuper_meta
if __name__ == '__main__':
class A(autosuper):
def f(self):
return 'A'
class B(A):
def f(self):
return 'B' + super.f()
class C(A):
def f(self):
def inner():
return 'C' + super.f()
# Needed to put 'super' into a cell
super = super
return inner()
class D(B, C):
def f(self, arg=None):
var = None
return 'D' + super.f()
assert D().f() == 'DBCA'
Disassembly of B.f and C.f reveals the different preambles used when super is simply a local variable compared to when it is used by an inner function.
>>> dis.dis(B.f)
214 0 LOAD_CONST 4 (<type 'super'>)
3 LOAD_CONST 2 (<class '__main__.B'>)
6 LOAD_FAST 0 (self)
9 CALL_FUNCTION 2
12 STORE_FAST 1 (super)
215 15 LOAD_CONST 1 ('B')
18 LOAD_FAST 1 (super)
21 LOAD_ATTR 1 (f)
24 CALL_FUNCTION 0
27 BINARY_ADD
28 RETURN_VALUE
>>> dis.dis(C.f)
218 0 LOAD_CONST 4 (<type 'super'>)
3 LOAD_CONST 2 (<class '__main__.C'>)
6 LOAD_FAST 0 (self)
9 CALL_FUNCTION 2
12 DUP_TOP
13 STORE_FAST 1 (super)
16 STORE_DEREF 0 (super)
219 19 LOAD_CLOSURE 0 (super)
22 LOAD_CONST 1 (<code object inner at 00C160A0, file "autosuper.py", line 219>)
25 MAKE_CLOSURE 0
28 STORE_FAST 2 (inner)
223 31 LOAD_FAST 1 (super)
34 STORE_DEREF 0 (super)
224 37 LOAD_FAST 2 (inner)
40 CALL_FUNCTION 0
43 RETURN_VALUE
Note that in the final implementation, the preamble would not be part of the bytecode of the method, but would occur immediately following unpacking of parameters.
Alternative Proposals
No Changes
Although its always attractive to just keep things how they are, people have sought a change in the usage of super calling for some time, and for good reason, all mentioned previously.
- Decoupling from the class name (which might not even be bound to the right class anymore!)
- Simpler looking, cleaner super calls would be better
Dynamic attribute on super type
The proposal adds a dynamic attribute lookup to the super type, which will automatically determine the proper class and instance parameters. Each super attribute lookup identifies these parameters and performs the super lookup on the instance, as the current super implementation does with the explicit invokation of a super instance upon a class and instance.
This proposal relies on sys._getframe(), which is not appropriate for anything except a prototype implementation.
super(__this_class__, self)
This is nearly an anti-proposal, as it basically relies on the acceptance of the __this_class__ PEP, which proposes a special name that would always be bound to the class within which it is used. If that is accepted, __this_class__ could simply be used instead of the class' name explicitly, solving the name binding issues [2].
self.__super__.foo(*args)
The __super__ attribute is mentioned in this PEP in several places, and could be a candidate for the complete solution, actually using it explicitly instead of any super usage directly. However, double-underscore names are usually an internal detail, and attempted to be kept out of everyday code.
super(self, *args) or __super__(self, *args)
This solution only solves the problem of the type indication, does not handle differently named super methods, and is explicit about the name of the instance. It is less flexable without being able to enacted on other method names, in cases where that is needed. One use case this fails is where a base- class has a factory classmethod and a subclass has two factory classmethods, both of which needing to properly make super calls to the one in the base- class.
super.foo(self, *args)
This variation actually eliminates the problems with locating the proper instance, and if any of the alternatives were pushed into the spotlight, I would want it to be this one.
super or super()
This proposal leaves no room for different names, signatures, or application to other classes, or instances. A way to allow some similar use alongside the normal proposal would be favorable, encouraging good design of multiple inheritance trees and compatible methods.
super(*p, **kw)
There has been the proposal that directly calling super(*p, **kw) would be equivalent to calling the method on the super object with the same name as the method currently being executed i.e. the following two methods would be equivalent:
def f(self, *p, **kw):
super.f(*p, **kw)
def f(self, *p, **kw):
super(*p, **kw)
There is strong sentiment for and against this, but implementation and style concerns are obvious. Guido has suggested that this should be excluded from this PEP on the principle of KISS (Keep It Simple Stupid).
History
- 29-Apr-2007 - Changed title from "Super As A Keyword" to "New Super"
- Updated much of the language and added a terminology section for clarification in confusing places.
- Added reference implementation and history sections.
- 06-May-2007 - Updated by Tim Delaney to reflect discussions on the python-3000
- and python-dev mailing lists.
References
| [1] | Fixing super anyone? (http://mail.python.org/pipermail/python-3000/2007-April/006667.html) |
| [2] | PEP 3130: Access to Module/Class/Function Currently Being Defined (this) (http://mail.python.org/pipermail/python-ideas/2007-April/000542.html) |
Copyright
This document has been placed in the public domain.
pep-0368 Standard image protocol and class
| PEP: | 368 |
|---|---|
| Title: | Standard image protocol and class |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Lino Mastrodomenico <l.mastrodomenico at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Jun-2007 |
| Python-Version: | 2.6, 3.0 |
| Post-History: |
Contents
Abstract
The current situation of image storage and manipulation in the Python world is extremely fragmented: almost every library that uses image objects has implemented its own image class, incompatible with everyone else's and often not very pythonic. A basic RGB image class exists in the standard library (Tkinter.PhotoImage), but is pretty much unusable, and unused, for anything except Tkinter programming.
This fragmentation not only takes up valuable space in the developers minds, but also makes the exchange of images between different libraries (needed in relatively common use cases) slower and more complex than it needs to be.
This PEP proposes to improve the situation by defining a simple and pythonic image protocol/interface that can be hopefully accepted and implemented by existing image classes inside and outside the standard library without breaking backward compatibility with their existing user bases. In practice this is a definition of how a minimal image-like object should look and act (in a similar way to the read() and write() methods in file-like objects).
The inclusion in the standard library of a class that provides basic image manipulation functionality and implements the new protocol is also proposed, together with a mixin class that helps adding support for the protocol to existing image classes.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
Rationale
A good way to have high quality modules ready for inclusion in the Python standard library is to simply wait for natural selection among competing external libraries to provide a clear winner with useful functionality and a big user base. Then the de-facto standard can be officially sanctioned by including it in the standard library.
Unfortunately this approach hasn't worked well for the creation of a dominant image class in the Python world: almost every third-party library that requires an image object creates its own class incompatible with the ones from other libraries. This is a real problem because it's entirely reasonable for a program to create and manipulate an image using, e.g., PIL (the Python Imaging Library) and then display it using wxPython or pygame. But these libraries have different and incompatible image classes, and the usual solution is to manually "export" an image from the source to a (width, height, bytes_string) tuple and "import" it creating a new instance in the target format. This approach works, but is both uglier and slower than it needs to be.
Another "solution" that has been sometimes used is the creation of specific adapters and/or converters from a class to another (e.g. PIL offers the ImageTk module for converting PIL images to a class compatible with the Tkinter one). But this approach doesn't scale well with the number of libraries involved and it's still annoying for the user: if I have a perfectly good image object why should I convert before passing it to the next method, why can't it simply accept my image as-is?
The problem isn't by any stretch limited to the three mentioned libraries and has probably multiple causes, including two that IMO are very important to understand before solving it:
- in today's computing world an image is a basic type not strictly tied to a specific domain. This is why there will never be a clear winner between the image classes from the three libraries mentioned above (PIL, wxPython and pygame): they cover different domains and don't really compete with each other;
- the Python standard library has never provided a good image class that can be adopted or imitated by third part modules. Tkinter.PhotoImage provides basic RGB functionality, but it's by far the slowest and ugliest of the bunch and it can be instantiated only after the Tkinter root window has been created.
This PEP tries to improve this situation in four ways:
- It defines a simple and pythonic image protocol/interface (both on the Python and the C side) that can be hopefully accepted and implemented by existing image classes inside and outside the standard library without breaking backward compatibility with their existing user bases.
- It proposes the inclusion in the standard library of three new
classes:
- ImageMixin provides almost everything necessary to implement the new protocol; its main purpose is to make as simple as possible to support this interface for existing libraries, in some cases as simple as adding it to the list of base classes and doing minor additions to the constructor.
- Image is a subclass of ImageMixin and will add a constructor that can resize and/or convert an image between different pixel formats. This is intended to provide a fast and efficient default implementation of the new protocol.
- ImageSize is a minor helper class. See below for details.
- Tkinter.PhotoImage will implement the new protocol (mostly through the ImageMixin class) and all the Tkinter methods that can receive an image will be modified the accept any object that implements the interface. As an aside the author of this PEP will collaborate with the developers of the most common external libraries to achieve the same goal (supporting the protocol in their classes and accepting any class that implements it).
- New PyImage_* functions will be added to the CPython C API: they implement the C side of the protocol and accept as first parameter any object that supports it, even if it isn't an instance of the Image/ImageMixin classes.
The main effects for the end user will be a simplification of the interchange of images between different libraries (if everything goes well, any Python library will accept images from any other library) and the out-of-the-box availability of the new Image class. The new class is intended to cover simple but common use cases like cropping and/or resizing a photograph to the desired size and passing it an appropriate widget for displaying it on a window, or darkening a texture and passing it to a 3D library.
The Image class is not intended to replace or compete with PIL, Pythonmagick or NumPy, even if it provides a (very small) subset of the functionality of these three libraries. In particular PIL offers very rich image manipulation features with dozens of classes, filters, transformations and file formats. The inclusion of PIL (or something similar) in the standard library may, or may not, be a worthy goal but it's completely outside the scope of this PEP.
Specification
The imageop module is used as the default location for the new classes and objects because it has for a long time hosted functions that provided a somewhat similar functionality, but a new module may be created if preferred (e.g. a new "image" or "media" module; the latter may eventually include other multimedia classes).
MODES is a new module level constant: it is a set of the pixel formats supported by the Image class. Any image object that implements the new protocol is guaranteed to be formatted in one of these modes, but libraries that accept images are allowed to support only a subset of them.
These modes are in turn also available as module level constants (e.g. imageop.RGB).
The following table is a summary of the modes currently supported and their properties:
| Name | Component names | Bits per component | Subsampling | Valid intervals |
|---|---|---|---|---|
| L | l (lowercase L) | 8 | no | full range |
| L16 | l | 16 | no | full range |
| L32 | l | 32 | no | full range |
| LA | l, a | 8 | no | full range |
| LA32 | l, a | 16 | no | full range |
| RGB | r, g, b | 8 | no | full range |
| RGB48 | r, g, b | 16 | no | full range |
| RGBA | r, g, b, a | 8 | no | full range |
| RGBA64 | r, g, b, a | 16 | no | full range |
| YV12 | y, cr, cb | 8 | 1, 2, 2 | 16-235, 16-240, 16-240 |
| JPEG_YV12 | y, cr, cb | 8 | 1, 2, 2 | full range |
| CMYK | c, m, y, k | 8 | no | full range |
| CMYK64 | c, m, y, k | 16 | no | full range |
When the name of a mode ends with a number, it represents the average number of bits per pixel. All the other modes simply use a byte per component per pixel.
No palette modes or modes with less than 8 bits per component are supported. Welcome to the 21st century.
Here's a quick description of the modes and the rationale for their inclusion; there are four groups of modes:
grayscale (L* modes): they are heavily used in scientific computing (those people may also need a very high dynamic range and precision, hence L32, the only mode with 32 bits per component) and sometimes it can be useful to consider a single component of a color image as a grayscale image (this is used by the individual planes of the planar images, see YV12 below); the name of the component ('l', lowercase letter L) stands for luminance, the second optional component ('a') is the alpha value and represents the opacity of the pixels: alpha = 0 means full transparency, alpha = 255/65535 represents a fully opaque pixel;
RGB* modes: the garden variety color images. The optional alpha component has the same meaning as in grayscale modes;
YCbCr, a.k.a. YUV (*YV12 modes). These modes are planar (i.e. the values of all the pixel for each component are stored in a consecutive memory area, instead of the usual arrangement where all the components of a pixel reside in consecutive bytes) and use a 1, 2, 2 (a.k.a. 4:2:0) subsampling (i.e. each pixel has its own Y value, but the Cb and Cr components are shared between groups of 2x2 adjacent pixels) because this is the format that's by far the most common for YCbCr images. Please note that the V (Cr) plane is stored before the U (Cb) plane.
YV12 is commonly used for MPEG2 (including DVDs), MPEG4 (both ASP/DivX and AVC/H.264) and Theora video frames. Valid values for Y are in range(16, 236) (excluding 236), and valid values for Cb and Cr are in range(16, 241). JPEG_YV12 is similar to YV12, but the three components can have the full range of 256 values. It's the native format used by almost all JPEG/JFIF files and by MJPEG video frames. The "strangeness" of these two wrt all the other supported modes derives from the fact that they are widely used that way by a lot of existing libraries and applications; this is also the reason why they are included (and the fact that they can't losslessly converted to RGB because YCbCr is a bigger color space); the funny 4:2:0 planar arrangement of the pixel values is relatively easy to support because in most cases the three planes can be considered three separate grayscale images;
CMYK* modes (cyan, magenta, yellow and black) are subtractive color modes, used for printing color images on dead trees. Professional designers love to pretend that they can't live without them, so here they are.
Python API
See the examples below.
In Python 2.x, all the new classes defined here are new-style classes.
Mode Objects
The mode objects offer a number of attributes and methods that can be used for implementing generic algorithms that work on different types of images:
components
The number of components per pixel (e.g. 4 for an RGBA image).
component_names
A tuple of strings; see the column "Component names" in the above table.
bits_per_component
8, 16 or 32; see "Bits per component" in the above table.
bytes_per_pixel
components * bits_per_component // 8, only available for non planar modes (see below).
planar
Boolean; True if the image components reside each in a separate plane. Currently this happens if and only if the mode uses subsampling.
subsampling
A tuple that for each component in the mode contains a tuple of two integers that represent the amount of downsampling in the horizontal and vertical direction, respectively. In practice it's ((1, 1), (2, 2), (2, 2)) for YV12 and JPEG_YV12 and ((1, 1),) * components for everything else.
x_divisor
max(x for x, y in subsampling); the width of an image that uses this mode must be divisible for this value.
y_divisor
max(y for x, y in subsampling); the height of an image that uses this mode must be divisible for this value.
intervals
A tuple that for each component in the mode contains a tuple of two integers: the minimum and maximum valid value for the component. Its value is ((16, 235), (16, 240), (16, 240)) for YV12 and ((0, 2 ** bits_per_component - 1),) * components for everything else.
get_length(iterable[integer]) -> int
The parameter must be an iterable that contains two integers: the width and height of an image; it returns the number of bytes needed to store an image of these dimensions with this mode.
Implementation detail: the modes are instances of a subclass of str and have a value equal to their name (e.g. imageop.RGB == 'RGB') except for L32 that has value 'I'. This is only intended for backward compatibility with existing PIL users; new code that uses the image protocol proposed here should not rely on this detail.
Image Protocol
Any object that supports the image protocol must provide the following methods and attributes:
mode
The format and the arrangement of the pixels in this image; it's one of the constants in the MODES set.
size
An instance of the ImageSize class; it's a named tuple of two integers: the width and the height of the image in pixels; both of them must be >= 1 and can also be accessed as the width and height attributes of size.
buffer
A sequence of integers between 0 and 255; they are the actual bytes used for storing the image data (i.e. modifying their values affects the image pixels and vice versa); the data has a row-major/C-contiguous order without padding and without any special memory alignment, even when there are more than 8 bits per component. The only supported methods are __len__, __getitem__/__setitem__ (with both integers and slice indexes) and __iter__; on the C side it implements the buffer protocol.
This is a pretty low level interface to the image and the user is responsible for using the correct (native) byte order for modes with more than 8 bit per component and the correct value ranges for YV12 images. A buffer may or may not keep a reference to its image, but it's still safe (if useless) to use the buffer even after the corresponding image has been destroyed by the garbage collector (this will require changes to the image class of wxPython and possibly other libraries). Implementation detail: this can be an array('B'), a bytes() object or a specialized fixed-length type.
info
A dict object that can contain arbitrary metadata associated with the image (e.g. DPI, gamma, ICC profile, exposure time...); the interpretation of this data is beyond the scope of this PEP and probably depends on the library used to create and/or to save the image; if a method of the image returns a new image, it can copy or adapt metadata from its own info attribute (the ImageMixin implementation always creates a new image with an empty info dictionary).
Shortcuts for the corresponding mode.* attributes.
map(function[, function...]) -> None
For every pixel in the image, maps each component through the corresponding function. If only one function is passed, it is used repeatedly for each component. This method modifies the image in place and is usually very fast (most of the time the functions are called only a small number of times, possibly only once for simple functions without branches), but it imposes a number of restrictions on the function(s) passed:
- it must accept a single integer argument and return a number (map will round the result to the nearest integer and clip it to range(0, 2 ** bits_per_component), if necessary);
- it must not try to intercept any BaseException, Exception or any unknown subclass of Exception raised by any operation on the argument (implementations may try to optimize the speed by passing funny objects, so even a simple "if n == 10:" may raise an exception: simply ignore it, map will take care of it); catching any other exception is fine;
- it should be side-effect free and its result should not depend on values (other than the argument) that may change during a single invocation of map.
Return a copy of the image rotated 90, 180 or 270 degrees counterclockwise around its center.
clip() -> None
Saturates invalid component values in YV12 images to the minimum or the maximum allowed (see mode.intervals), for other image modes this method does nothing, very fast; libraries that save/export YV12 images are encouraged to always call this method, since intermediate operations (e.g. the map method) may assign to pixels values outside the valid intervals.
split() -> tuple[image]
Returns a tuple of L, L16 or L32 images corresponding to the individual components in the image.
Planar images also supports attributes with the same names defined in component_names: they contain grayscale (mode L) images that offer a view on the pixel values for the corresponding component; any change to the subimages is immediately reflected on the parent image and vice versa (their buffers refer to the same memory location).
Non-planar images offer the following additional methods:
pixels() -> iterator[pixel]
Returns an iterator that iterates over all the pixels in the image, starting from the top line and scanning each line from left to right. See below for a description of the pixel objects.
__iter__() -> iterator[line]
Returns an iterator that iterates over all the lines in the image, from top to bottom. See below for a description of the line objects.
__len__() -> int
Returns the number of lines in the image (size.height).
__getitem__(integer) -> line
Returns the line at the specified (y) position.
__getitem__(tuple[integer]) -> pixel
The parameter must be a tuple of two integers; they are interpreted respectively as x and y coordinates in the image (0, 0 is the top left corner) and a pixel object is returned.
__getitem__(slice | tuple[integer | slice]) -> image
The parameter must be a slice or a tuple that contains two slices or an integer and a slice; the selected area of the image is copied and a new image is returned; image[x:y:z] is equivalent to image[:, x:y:z].
__setitem__(tuple[integer], integer | iterable[integer]) -> None
Modifies the pixel at specified position; image[x, y] = integer is a shortcut for image[x, y] = (integer,) for images with a single component.
__setitem__(slice | tuple[integer | slice], image) -> None
Selects an area in the same way as the corresponding form of the __getitem__ method and assigns to it a copy of the pixels from the image in the second argument, that must have exactly the same mode as this image and the same size as the specified area; the alpha component, if present, is simply copied and doesn't affect the other components of the image (i.e. no alpha compositing is performed).
The mode, size and buffer (including the address in memory of the buffer) never change after an image is created.
It is expected that, if PEP 3118 is accepted, all the image objects will support the new buffer protocol, however this is beyond the scope of this PEP.
Image and ImageMixin Classes
The ImageMixin class implements all the methods and attributes described above except mode, size, buffer and info. Image is a subclass of ImageMixin that adds support for these four attributes and offers the following constructor (please note that the constructor is not part of the image protocol):
__init__(mode, size, color, source)
mode must be one of the constants in the MODES set, size is a sequence of two integers (width and height of the new image); color is a sequence of integers, one for each component of the image, used to initialize all the pixels to the same value; source can be a sequence of integers of the appropriate size and format that is copied as-is in the buffer of the new image or an existing image; in Python 2.x source can also be an instance of str and is interpreted as a sequence of bytes. color and source are mutually exclusive and if they are both omitted the image is initialized to transparent black (all the bytes in the buffer have value 16 in the YV12 mode, 255 in the CMYK* modes and 0 for everything else). If source is present and is an image, mode and/or size can be omitted; if they are specified and are different from the source mode and/or size, the source image is converted.
The exact algorithms used for resizing and doing color space conversions may differ between Python versions and implementations, but they always give high quality results (e.g.: a cubic spline interpolation can be used for upsampling and an antialias filter can be used for downsampling images); any combination of mode conversion is supported, but the algorithm used for conversions to and from the CMYK* modes is pretty naĂŻve: if you have the exact color profiles of your devices you may want to use a good color management tool such as LittleCMS. The new image has an empty info dict.
Line Objects
The line objects (returned, e.g., when iterating over an image) support the following attributes and methods:
mode
The mode of the image from where this line comes.
__iter__() -> iterator[pixel]
Returns an iterator that iterates over all the pixels in the line, from left to right. See below for a description of the pixel objects.
__len__() -> int
Returns the number of pixels in the line (the image width).
__getitem__(integer) -> pixel
Returns the pixel at the specified (x) position.
__getitem__(slice) -> image
The selected part of the line is copied and a new image is returned; the new image will always have height 1.
__setitem__(integer, integer | iterable[integer]) -> None
Modifies the pixel at the specified position; line[x] = integer is a shortcut for line[x] = (integer,) for images with a single component.
__setitem__(slice, image) -> None
Selects a part of the line and assigns to it a copy of the pixels from the image in the second argument, that must have height 1, a width equal to the specified slice and the same mode as this line; the alpha component, if present, is simply copied and doesn't affect the other components of the image (i.e. no alpha compositing is performed).
Pixel Objects
The pixel objects (returned, e.g., when iterating over a line) support the following attributes and methods:
mode
The mode of the image from where this pixel comes.
value
A tuple of integers, one for each component. Any iterable of the correct length can be assigned to value (it will be automagically converted to a tuple), but you can't assign to it an integer, even if the mode has only a single component: use, e.g., pixel.l = 123 instead.
r, g, b, a, l, c, m, y, k
The integer values of each component; only those applicable for the current mode (in mode.component_names) will be available.
These four methods emulate a fixed length list of integers, one for each pixel component.
ImageSize Class
ImageSize is a named tuple, a class identical to tuple except that:
- its constructor only accepts two integers, width and height; they are converted in the constructor using their __index__() methods, so all the ImageSize objects are guaranteed to contain only int (or possibly long, in Python 2.x) instances;
- it has a width and a height property that are equivalent to the first and the second number in the tuple, respectively;
- the string returned by its __repr__ method is 'imageop.ImageSize(width=%d, height=%d)' % (width, height).
ImageSize is not usually instantiated by end-users, but can be used when creating a new class that implements the image protocol, since the size attribute must be an ImageSize instance.
C API
The available image modes are visible at the C level as PyImage_* constants of type PyObject * (e.g.: PyImage_RGB is imageop.RGB).
The following functions offer a C-friendly interface to mode and image objects (all the functions return NULL or -1 on failure):
int PyImageMode_Check(PyObject *obj)
Returns true if the object obj is a valid image mode.
These functions are equivalent to their corresponding Python attributes or methods.
int PyImage_Check(PyObject *obj)
Returns true if the object obj is an Image object or an instance of a subtype of the Image type; see also PyObject_CheckImage below.
int PyImage_CheckExact(PyObject *obj)
Returns true if the object obj is an Image object, but not an instance of a subtype of the Image type.
Returns a new Image instance, initialized to transparent black (see Image.__init__ above for the details).
Returns a new Image instance, initialized with the contents of the image object rescaled and converted to the specified mode, if necessary.
Returns a new Image instance, initialized with the contents of the buffer object.
int PyObject_CheckImage(PyObject *obj)
Returns true if the object obj implements a sufficient subset of the image protocol to be accepted by the functions defined below, even if its class is not a subclass of ImageMixin and/or Image. Currently it simply checks for the existence and correctness of the attributes mode, size and buffer.
These functions are equivalent to their corresponding Python attributes or methods; the image memory can be accessed only with the GIL and a reference to the image or its buffer held, and extra care should be taken for modes with more than 8 bits per component: the data is stored in native byte order and it can be not aligned on 2 or 4 byte boundaries.
Examples
A few examples of common operations with the new Image class and protocol:
# create a new black RGB image of 6x9 pixels
rgb_image = imageop.Image(imageop.RGB, (6, 9))
# same as above, but initialize the image to bright red
rgb_image = imageop.Image(imageop.RGB, (6, 9), color=(255, 0, 0))
# convert the image to YCbCr
yuv_image = imageop.Image(imageop.JPEG_YV12, source=rgb_image)
# read the value of a pixel and split it into three ints
r, g, b = rgb_image[x, y]
# modify the magenta component of a pixel in a CMYK image
cmyk_image[x, y].m = 13
# modify the Y (luma) component of a pixel in a *YV12 image and
# its corresponding subsampled Cr (red chroma)
yuv_image.y[x, y] = 42
yuv_image.cr[x // 2, y // 2] = 54
# iterate over an image
for line in rgb_image:
for pixel in line:
# swap red and blue, and set green to 0
pixel.value = pixel.b, 0, pixel.r
# find the maximum value of the red component in the image
max_red = max(pixel.r for pixel in rgb_image.pixels())
# count the number of colors in the image
num_of_colors = len(set(tuple(pixel) for pixel in image.pixels()))
# copy a block of 4x2 pixels near the upper right corner of an
# image and paste it into the lower left corner of the same image
image[:4, -2:] = image[-6:-2, 1:3]
# create a copy of the image, except that the new image can have a
# different (usually empty) info dict
new_image = image[:]
# create a mirrored copy of the image, with the left and right
# sides flipped
flipped_image = image[::-1, :]
# downsample an image to half its original size using a fast, low
# quality operation and a slower, high quality one:
low_quality_image = image[::2, ::2]
new_size = image.size.width // 2, image.size.height // 2
high_quality_image = imageop.Image(size=new_size, source=image)
# direct buffer access
rgb_image[0, 0] = r, g, b
assert tuple(rgb_image.buffer[:3]) == (r, g, b)
Backwards Compatibility
There are three areas touched by this PEP where backwards compatibility should be considered:
- Python 2.6: new classes and objects are added to the imageop module without touching the existing module contents; new methods and attributes will be added to Tkinter.PhotoImage and its __getitem__ and __setitem__ methods will be modified to accept integers, tuples and slices (currently they only accept strings). All the changes provide a superset of the existing functionality, so no major compatibility issues are expected.
- Python 3.0: the legacy contents of the imageop module will be deleted, according to PEP 3108; everything defined in this proposal will work like in Python 2.x with the exception of the usual 2.x/3.0 differences (e.g. support for long integers and for interpreting str instances as sequences of bytes will be dropped).
- external libraries: the names and the semantics of the standard image methods and attributes are carefully chosen to allow some external libraries that manipulate images (including at least PIL, wxPython and pygame) to implement the new protocol in their image classes without breaking compatibility with existing code. The only blatant conflicts between the image protocol and NumPy arrays are the value of the size attribute and the coordinates order in the image[x, y] expression.
Reference Implementation
If this PEP is accepted, the author will provide a reference implementation of the new classes in pure Python (that can run in CPython, PyPy, Jython and IronPython) and a second one optimized for speed in Python and C, suitable for inclusion in the CPython standard library. The author will also submit the required Tkinter patches. For all the code will be available a version for Python 2.x and a version for Python 3.0 (it is expected that the two version will be very similar and the Python 3.0 one will probably be generated almost completely automatically).
Acknowledgments
The implementation of this PEP, if accepted, is sponsored by Google through the Google Summer of Code program.
Copyright
This document has been placed in the public domain.
pep-0369 Post import hooks
| PEP: | 369 |
|---|---|
| Title: | Post import hooks |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christian Heimes <christian at python.org> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 02-Jan-2008 |
| Python-Version: | 2.6, 3.0 |
| Post-History: | 02-Dec-2012 |
Contents
Withdrawal Notice
This PEP has been withdrawn by its author, as much of the detailed design is no longer valid following the migration to importlib in Python 3.3.
Abstract
This PEP proposes enhancements for the import machinery to add post import hooks. It is intended primarily to support the wider use of abstract base classes that is expected in Python 3.0.
The PEP originally started as a combined PEP for lazy imports and post import hooks. After some discussion on the python-dev mailing list the PEP was parted in two separate PEPs. [1]
Rationale
Python has no API to hook into the import machinery and execute code after a module is successfully loaded. The import hooks of PEP 302 are about finding modules and loading modules but they were not designed to as post import hooks.
Use cases
A use case for a post import hook is mentioned in Nick Coghlan's initial posting [2]. about callbacks on module import. It was found during the development of Python 3.0 and its ABCs. We wanted to register classes like decimal.Decimal with an ABC but the module should not be imported on every interpreter startup. Nick came up with this example:
@imp.when_imported('decimal')
def register(decimal):
Inexact.register(decimal.Decimal)
The function register is registered as callback for the module named 'decimal'. When decimal is imported the function is called with the module object as argument.
While this particular example isn't necessary in practice, (as decimal.Decimal will inherit from the appropriate abstract Number base class in 2.6 and 3.0), it still illustrates the principle.
Existing implementations
PJE's peak.util.imports [3] implements post load hooks. My implementation shares a lot with his and it's partly based on his ideas.
Post import hook implementation
Post import hooks are called after a module has been loaded. The hooks are callable which take one argument, the module instance. They are registered by the dotted name of the module, e.g. 'os' or 'os.path'.
The callable are stored in the dict sys.post_import_hooks which is a mapping from names (as string) to a list of callables or None.
States
No hook was registered
sys.post_import_hooks contains no entry for the module
A hook is registered and the module is not loaded yet
The import hook registry contains an entry sys.post_import_hooks["name"] = [hook1]
A module is successfully loaded
The import machinery checks if sys.post_import_hooks contains post import hooks for the newly loaded module. If hooks are found then the hooks are called in the order they were registered with the module instance as first argument. The processing of the hooks is stopped when a method raises an exception. At the end the entry for the module name set to None, even when an error has occured.
Additionally the new __notified__ slot of the module object is set to True in order to prevent infinity recursions when the notification method is called inside a hook. For object which don't subclass from PyModule a new attribute is added instead.
A module can't be loaded
The import hooks are neither called nor removed from the registry. It may be possible to load the module later.
A hook is registered but the module is already loaded
The hook is fired immediately.
Invariants
The import hook system guarentees certain invariants. XXX
Sample Python implementation
A Python implemenation may look like:
def notify(name):
try:
module = sys.modules[name]
except KeyError:
raise ImportError("Module %s has not been imported" % (name,))
if module.__notified__:
return
try:
module.__notified__ = True
if '.' in name:
notify(name[:name.rfind('.')])
for callback in post_import_hooks[name]:
callback(module)
finally:
post_import_hooks[name] = None
XXX
C API
New C API functions
- PyObject* PyImport_GetPostImportHooks(void)
- Returns the dict sys.post_import_hooks or NULL
- PyObject* PyImport_NotifyLoadedByModule(PyObject *module)
- Notify the post import system that a module was requested. Returns the a borrowed reference to the same module object or NULL if an error has occured. The function calls only the hooks for the module itself an not its parents. The function must be called with the import lock acquired.
- PyObject* PyImport_NotifyLoadedByName(const char *name)
- PyImport_NotifyLoadedByName("a.b.c") calls PyImport_NotifyLoadedByModule() for a, a.b and a.b.c in that particular order. The modules are retrieved from sys.modules. If a module can't be retrieved, an exception is raised otherwise the a borrowed reference to modname is returned. The hook calls always start with the prime parent module. The caller of PyImport_NotifyLoadedByName() must hold the import lock!
- PyObject* PyImport_RegisterPostImportHook(PyObject *callable, PyObject *mod_name)
- Register a new hook callable for the module mod_name
- int PyModule_GetNotified(PyObject *module)
- Returns the status of the __notified__ slot / attribute.
- int PyModule_SetNotified(PyObject *module, int status)
- Set the status of the __notified__ slot / attribute.
The PyImport_NotifyLoadedByModule() method is called inside import_submodule(). The import system makes sure that the import lock is acquired and the hooks for the parent modules are already called.
Python API
The import hook registry and two new API methods are exposed through the sys and imp module.
- sys.post_import_hooks
The dict contains the post import hooks:
{"name" : [hook1, hook2], ...}- imp.register_post_import_hook(hook: "callable", name: str)
- Register a new hook hook for the module name
- imp.notify_module_loaded(module: "module instance") -> module
- Notify the system that a module has been loaded. The method is provided for compatibility with existing lazy / deferred import extensions.
- module.__notified__
- A slot of a module instance. XXX
The when_imported function decorator is also in the imp module, which is equivalent to:
def when_imported(name):
def register(hook):
register_post_import_hook(hook, name)
return register
- imp.when_imported(name) -> decorator function
- for @when_imported(name) def hook(module): pass
Open issues
The when_imported decorator hasn't been written.
The code contains several XXX comments. They are mostly about error handling in edge cases.
Backwards Compatibility
The new features and API don't conflict with old import system of Python and don't cause any backward compatibility issues for most software. However systems like PEAK and Zope which implement their own lazy import magic need to follow some rules.
The post import hooks carefully designed to cooperate with existing deferred and lazy import systems. It's the suggestion of the PEP author to replace own on-load-hooks with the new hook API. The alternative lazy or deferred imports will still work but the implementations must call the imp.notify_module_loaded function.
Reference Implementation
A reference implementation is already written and is available in the py3k-importhook branch. [4] It still requires some cleanups, documentation updates and additional unit tests.
Acknowledgments
Nick Coghlan, for proof reading and the initial discussion Phillip J. Eby, for his implementation in PEAK and help with my own implementation
Copyright
This document has been placed in the public domain.
References
| [1] | PEP: Lazy module imports and post import hook http://permalink.gmane.org/gmane.comp.python.devel/90949 |
| [2] | Interest in PEP for callbacks on module import http://permalink.gmane.org/gmane.comp.python.python-3000.devel/11126 |
| [3] | peak.utils.imports http://svn.eby-sarna.com/Importing/peak/util/imports.py?view=markup |
| [4] | py3k-importhook branch http://svn.python.org/view/python/branches/py3k-importhook/ |
pep-0370 Per user site-packages directory
| PEP: | 370 |
|---|---|
| Title: | Per user site-packages directory |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christian Heimes <christian at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 11-Jan-2008 |
| Python-Version: | 2.6, 3.0 |
| Post-History: |
Contents
Abstract
This PEP proposes a new a per user site-packages directory to allow users the local installation of Python packages in their home directory.
Rationale
Current Python versions don't have a unified way to install packages into the home directory of a user (except for Mac Framework builds). Users are either forced to ask the system administrator to install or update a package for them or to use one of the many workarounds like Virtual Python [1], Working Env [2] or Virtual Env [3].
It's not the goal of the PEP to replace the tools or to implement isolated installations of Python. It only implements the most common use case of an additional site-packages directory for each user.
The feature can't be implemented using the environment variable PYTHONPATH. The env var just inserts a new directory to the beginning of sys.path but it doesn't parse the pth files in the directory. A full blown site-packages path is required for several applications and Python eggs.
Specification
site directory (site-packages)
A directory in sys.path. In contrast to ordinary directories the pth files in the directory are processed, too.
user site directory
A site directory inside the users' home directory. A user site directory is specific to a Python version. The path contains the version number (major and minor only).
- Unix (including Mac OS X)
- ~/.local/lib/python2.6/site-packages
- Windows
- %APPDATA%/Python/Python26/site-packages
user data directory
Usually the parent directory of the user site directory. It's meant for Python version specific data like config files, docs, images and translations.
- Unix (including Mac)
- ~/.local/lib/python2.6
- Windows
- %APPDATA%/Python/Python26
user base directory
It's located inside the user's home directory. The user site and use config directory are inside the base directory. On some systems the directory may be shared with 3rd party apps.
- Unix (including Mac)
- ~/.local
- Windows
- %APPDATA%/Python
user script directory
A directory for binaries and scripts. [10] It's shared across Python versions and the destination directory for scripts.
- Unix (including Mac)
- ~/.local/bin
- Windows
- %APPDATA%/Python/Scripts
Windows Notes
On Windows the Application Data directory (aka APPDATA) was chosen because it is the most designated place for application data. Microsoft recommands that software doesn't write to USERPROFILE [5] and My Documents is not suited for application data, either. [8] The code doesn't query the Win32 API, instead it uses the environment variable %APPDATA%.
The application data directory is part of the roaming profile. In networks with domain logins the application data may be copied from and to the a central server. This can slow down log-in and log-off. Users can keep the data on the server by e.g. setting PYTHONUSERBASE to the value "%HOMEDRIVE%%HOMEPATH%Applicata Data". Users should consult their local adminstrator for more information. [13]
Unix Notes
On Unix ~/.local was chosen in favor over ~/.python because the directory is already used by several other programs in analogy to /usr/local. [7] [11]
Mac OS X Notes
On Mac OS X Python uses ~/.local directory as well. [12] Framework builds of Python include ~/Library/Python/2.6/site-packages as an additional search path.
Implementation
The site module gets a new method adduserpackage() which adds the appropriate directory to the search path. The directory is not added if it doesn't exist when Python is started. However the location of the user site directory and user base directory is stored in an internal variable for distutils.
The user site directory is added before the system site directories but after Python's search paths and PYTHONPATH. This setup allows the user to install a different version of a package than the system administrator but it prevents the user from accidently overwriting a stdlib module. Stdlib modules can still be overwritten with PYTHONPATH.
For security reasons the user site directory is not added to sys.path when the effective user id or group id is not equal to the process uid / gid [9]. It's an additional barrier against code injection into suid apps. However Python suid scripts must always use the -E and -s option or users can sneak in their own code.
The user site directory can be suppressed with a new option -s or the environment variable PYTHONNOUSERSITE. The feature can be disabled globally by setting site.ENABLE_USER_SITE to the value False. It must be set by editing site.py. It can't be altered in sitecustomize.py or later.
The path to the user base directory can be overwritten with the environment variable PYTHONUSERBASE. The default location is used when PYTHONUSERBASE is not set or empty.
distutils.command.install (setup.py install) gets a new argument --user to install packages in the user site directory. The required directories are created on demand.
distutils.command.build_ext (setup.py build_ext) gets a new argument --user which adds the include/ and lib/ directories in the user base dirctory to the search paths for header files and libraries. It also adds the lib/ directory to rpath.
The site module gets two arguments --user-base and --user-site to print the path to the user base or user site directory to the standard output. The feature is intended for scripting, e.g. ./configure --prefix $(python2.5 -m site --user-base)
distutils.sysconfig will get methods to access the private variables of site. (not yet implemented)
The Windows updater needs to be updated, too. It should create an menu item which opens the user site directory in a new explorer windows.
Reference Implementation
A reference implementation is available in the bug tracker. [4]
Copyright
This document has been placed in the public domain.
References
| [1] | Virtual Python http://peak.telecommunity.com/DevCenter/EasyInstall#creating-a-virtual-python |
| [2] | Working Env http://pypi.python.org/pypi/workingenv.py http://blog.ianbicking.org/workingenv-revisited.html |
| [3] | Virtual Env http://pypi.python.org/pypi/virtualenv |
| [4] | reference implementation http://bugs.python.org/issue1799 http://svn.python.org/view/sandbox/trunk/pep370 |
| [5] | MSDN: CSIDL http://msdn2.microsoft.com/en-us/library/bb762494.aspx |
| [6] | Initial suggestion for a per user site-packages directory http://permalink.gmane.org/gmane.comp.python.devel/90902 |
| [7] | Suggestion of ~/.local/ http://permalink.gmane.org/gmane.comp.python.devel/90925 |
| [8] | APPDATA discussion http://permalink.gmane.org/gmane.comp.python.devel/90932 |
| [9] | Security concerns and -s option http://permalink.gmane.org/gmane.comp.python.devel/91063 |
| [10] | Discussion about the bin directory http://permalink.gmane.org/gmane.comp.python.devel/91095 |
| [11] | freedesktop.org XGD basedir specs mentions ~/.local http://www.freedesktop.org/wiki/Specifications/basedir-spec |
| [12] | ~/.local for Mac and usercustomize file http://permalink.gmane.org/gmane.comp.python.devel/91167 |
| [13] | Roaming profile on Windows http://permalink.gmane.org/gmane.comp.python.devel/91187 |
pep-0371 Addition of the multiprocessing package to the standard library
| PEP: | 371 |
|---|---|
| Title: | Addition of the multiprocessing package to the standard library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jesse Noller <jnoller at gmail.com>, Richard Oudkerk <r.m.oudkerk at googlemail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 06-May-2008 |
| Python-Version: | 2.6 / 3.0 |
| Post-History: |
Abstract
This PEP proposes the inclusion of the pyProcessing [1] package
into the Python standard library, renamed to "multiprocessing".
The processing package mimics the standard library threading
module functionality to provide a process-based approach to
threaded programming allowing end-users to dispatch multiple
tasks that effectively side-step the global interpreter lock.
The package also provides server and client functionality
(processing.Manager) to provide remote sharing and management of
objects and tasks so that applications may not only leverage
multiple cores on the local machine, but also distribute objects
and tasks across a cluster of networked machines.
While the distributed capabilities of the package are beneficial,
the primary focus of this PEP is the core threading-like API and
capabilities of the package.
Rationale
The current CPython interpreter implements the Global Interpreter
Lock (GIL) and barring work in Python 3000 or other versions
currently planned [2], the GIL will remain as-is within the
CPython interpreter for the foreseeable future. While the GIL
itself enables clean and easy to maintain C code for the
interpreter and extensions base, it is frequently an issue for
those Python programmers who are leveraging multi-core machines.
The GIL itself prevents more than a single thread from running
within the interpreter at any given point in time, effectively
removing Python's ability to take advantage of multi-processor
systems.
The pyprocessing package offers a method to side-step the GIL
allowing applications within CPython to take advantage of
multi-core architectures without asking users to completely change
their programming paradigm (i.e.: dropping threaded programming
for another "concurrent" approach - Twisted, Actors, etc).
The Processing package offers CPython a "known API" which mirrors
albeit in a PEP 8 compliant manner, that of the threading API,
with known semantics and easy scalability.
In the future, the package might not be as relevant should the
CPython interpreter enable "true" threading, however for some
applications, forking an OS process may sometimes be more
desirable than using lightweight threads, especially on those
platforms where process creation is fast and optimized.
For example, a simple threaded application:
from threading import Thread as worker
def afunc(number):
print number * 3
t = worker(target=afunc, args=(4,))
t.start()
t.join()
The pyprocessing package mirrored the API so well, that with a
simple change of the import to:
from processing import process as worker
The code would now execute through the processing.process class.
Obviously, with the renaming of the API to PEP 8 compliance there
would be additional renaming which would need to occur within
user applications, however minor.
This type of compatibility means that, with a minor (in most cases)
change in code, users' applications will be able to leverage all
cores and processors on a given machine for parallel execution.
In many cases the pyprocessing package is even faster than the
normal threading approach for I/O bound programs. This of course,
takes into account that the pyprocessing package is in optimized C
code, while the threading module is not.
The "Distributed" Problem
In the discussion on Python-Dev about the inclusion of this
package [3] there was confusion about the intentions this PEP with
an attempt to solve the "Distributed" problem - frequently
comparing the functionality of this package with other solutions
like MPI-based communication [4], CORBA, or other distributed
object approaches [5].
The "distributed" problem is large and varied. Each programmer
working within this domain has either very strong opinions about
their favorite module/method or a highly customized problem for
which no existing solution works.
The acceptance of this package does not preclude or recommend that
programmers working on the "distributed" problem not examine other
solutions for their problem domain. The intent of including this
package is to provide entry-level capabilities for local
concurrency and the basic support to spread that concurrency
across a network of machines - although the two are not tightly
coupled, the pyprocessing package could in fact, be used in
conjunction with any of the other solutions including MPI/etc.
If necessary - it is possible to completely decouple the local
concurrency abilities of the package from the
network-capable/shared aspects of the package. Without serious
concerns or cause however, the author of this PEP does not
recommend that approach.
Performance Comparison
As we all know - there are "lies, damned lies, and benchmarks".
These speed comparisons, while aimed at showcasing the performance
of the pyprocessing package, are by no means comprehensive or
applicable to all possible use cases or environments. Especially
for those platforms with sluggish process forking timing.
All benchmarks were run using the following:
* 4 Core Intel Xeon CPU @ 3.00GHz
* 16 GB of RAM
* Python 2.5.2 compiled on Gentoo Linux (kernel 2.6.18.6)
* pyProcessing 0.52
All of the code for this can be downloaded from:
http://jessenoller.com/code/bench-src.tgz
The basic method of execution for these benchmarks is in the
run_benchmarks.py script, which is simply a wrapper to execute a
target function through a single threaded (linear), multi-threaded
(via threading), and multi-process (via pyprocessing) function for
a static number of iterations with increasing numbers of execution
loops and/or threads.
The run_benchmarks.py script executes each function 100 times,
picking the best run of that 100 iterations via the timeit module.
First, to identify the overhead of the spawning of the workers, we
execute an function which is simply a pass statement (empty):
cmd: python run_benchmarks.py empty_func.py
Importing empty_func
Starting tests ...
non_threaded (1 iters) 0.000001 seconds
threaded (1 threads) 0.000796 seconds
processes (1 procs) 0.000714 seconds
non_threaded (2 iters) 0.000002 seconds
threaded (2 threads) 0.001963 seconds
processes (2 procs) 0.001466 seconds
non_threaded (4 iters) 0.000002 seconds
threaded (4 threads) 0.003986 seconds
processes (4 procs) 0.002701 seconds
non_threaded (8 iters) 0.000003 seconds
threaded (8 threads) 0.007990 seconds
processes (8 procs) 0.005512 seconds
As you can see, process forking via the pyprocessing package is
faster than the speed of building and then executing the threaded
version of the code.
The second test calculates 50000 Fibonacci numbers inside of each
thread (isolated and shared nothing):
cmd: python run_benchmarks.py fibonacci.py
Importing fibonacci
Starting tests ...
non_threaded (1 iters) 0.195548 seconds
threaded (1 threads) 0.197909 seconds
processes (1 procs) 0.201175 seconds
non_threaded (2 iters) 0.397540 seconds
threaded (2 threads) 0.397637 seconds
processes (2 procs) 0.204265 seconds
non_threaded (4 iters) 0.795333 seconds
threaded (4 threads) 0.797262 seconds
processes (4 procs) 0.206990 seconds
non_threaded (8 iters) 1.591680 seconds
threaded (8 threads) 1.596824 seconds
processes (8 procs) 0.417899 seconds
The third test calculates the sum of all primes below 100000,
again sharing nothing.
cmd: run_benchmarks.py crunch_primes.py
Importing crunch_primes
Starting tests ...
non_threaded (1 iters) 0.495157 seconds
threaded (1 threads) 0.522320 seconds
processes (1 procs) 0.523757 seconds
non_threaded (2 iters) 1.052048 seconds
threaded (2 threads) 1.154726 seconds
processes (2 procs) 0.524603 seconds
non_threaded (4 iters) 2.104733 seconds
threaded (4 threads) 2.455215 seconds
processes (4 procs) 0.530688 seconds
non_threaded (8 iters) 4.217455 seconds
threaded (8 threads) 5.109192 seconds
processes (8 procs) 1.077939 seconds
The reason why tests two and three focused on pure numeric
crunching is to showcase how the current threading implementation
does hinder non-I/O applications. Obviously, these tests could be
improved to use a queue for coordination of results and chunks of
work but that is not required to show the performance of the
package and core processing.process module.
The next test is an I/O bound test. This is normally where we see
a steep improvement in the threading module approach versus a
single-threaded approach. In this case, each worker is opening a
descriptor to lorem.txt, randomly seeking within it and writing
lines to /dev/null:
cmd: python run_benchmarks.py file_io.py
Importing file_io
Starting tests ...
non_threaded (1 iters) 0.057750 seconds
threaded (1 threads) 0.089992 seconds
processes (1 procs) 0.090817 seconds
non_threaded (2 iters) 0.180256 seconds
threaded (2 threads) 0.329961 seconds
processes (2 procs) 0.096683 seconds
non_threaded (4 iters) 0.370841 seconds
threaded (4 threads) 1.103678 seconds
processes (4 procs) 0.101535 seconds
non_threaded (8 iters) 0.749571 seconds
threaded (8 threads) 2.437204 seconds
processes (8 procs) 0.203438 seconds
As you can see, pyprocessing is still faster on this I/O operation
than using multiple threads. And using multiple threads is slower
than the single threaded execution itself.
Finally, we will run a socket-based test to show network I/O
performance. This function grabs a URL from a server on the LAN
that is a simple error page from tomcat. It gets the page 100
times. The network is silent, and a 10G connection:
cmd: python run_benchmarks.py url_get.py
Importing url_get
Starting tests ...
non_threaded (1 iters) 0.124774 seconds
threaded (1 threads) 0.120478 seconds
processes (1 procs) 0.121404 seconds
non_threaded (2 iters) 0.239574 seconds
threaded (2 threads) 0.146138 seconds
processes (2 procs) 0.138366 seconds
non_threaded (4 iters) 0.479159 seconds
threaded (4 threads) 0.200985 seconds
processes (4 procs) 0.188847 seconds
non_threaded (8 iters) 0.960621 seconds
threaded (8 threads) 0.659298 seconds
processes (8 procs) 0.298625 seconds
We finally see threaded performance surpass that of
single-threaded execution, but the pyprocessing package is still
faster when increasing the number of workers. If you stay with
one or two threads/workers, then the timing between threads and
pyprocessing is fairly close.
One item of note however, is that there is an implicit overhead
within the pyprocessing package's Queue implementation due to the
object serialization.
Alec Thomas provided a short example based on the
run_benchmarks.py script to demonstrate this overhead versus the
default Queue implementation:
cmd: run_bench_queue.py
non_threaded (1 iters) 0.010546 seconds
threaded (1 threads) 0.015164 seconds
processes (1 procs) 0.066167 seconds
non_threaded (2 iters) 0.020768 seconds
threaded (2 threads) 0.041635 seconds
processes (2 procs) 0.084270 seconds
non_threaded (4 iters) 0.041718 seconds
threaded (4 threads) 0.086394 seconds
processes (4 procs) 0.144176 seconds
non_threaded (8 iters) 0.083488 seconds
threaded (8 threads) 0.184254 seconds
processes (8 procs) 0.302999 seconds
Additional benchmarks can be found in the pyprocessing package's
source distribution's examples/ directory. The examples will be
included in the package's documentation.
Maintenance
Richard M. Oudkerk - the author of the pyprocessing package has
agreed to maintain the package within Python SVN. Jesse Noller
has volunteered to also help maintain/document and test the
package.
API Naming
While the aim of the package's API is designed to closely mimic that of
the threading and Queue modules as of python 2.x, those modules are not
PEP 8 compliant. It has been decided that instead of adding the package
"as is" and therefore perpetuating the non-PEP 8 compliant naming, we
will rename all APIs, classes, etc to be fully PEP 8 compliant.
This change does affect the ease-of-drop in replacement for those using
the threading module, but that is an acceptable side-effect in the view
of the authors, especially given that the threading module's own API
will change.
Issue 3042 in the tracker proposes that for Python 2.6 there will be
two APIs for the threading module - the current one, and the PEP 8
compliant one. Warnings about the upcoming removal of the original
java-style API will be issued when -3 is invoked.
In Python 3000, the threading API will become PEP 8 compliant, which
means that the multiprocessing module and the threading module will
again have matching APIs.
Timing/Schedule
Some concerns have been raised about the timing/lateness of this
PEP for the 2.6 and 3.0 releases this year, however it is felt by
both the authors and others that the functionality this package
offers surpasses the risk of inclusion.
However, taking into account the desire not to destabilize
Python-core, some refactoring of pyprocessing's code "into"
Python-core can be withheld until the next 2.x/3.x releases. This
means that the actual risk to Python-core is minimal, and largely
constrained to the actual package itself.
Open Issues
* Confirm no "default" remote connection capabilities, if needed
enable the remote security mechanisms by default for those
classes which offer remote capabilities.
* Some of the API (Queue methods qsize(), task_done() and join())
either need to be added, or the reason for their exclusion needs
to be identified and documented clearly.
Closed Issues
* The PyGILState bug patch submitted in issue 1683 by roudkerk
must be applied for the package unit tests to work.
* Existing documentation has to be moved to ReST formatting.
* Reliance on ctypes: The pyprocessing package's reliance on
ctypes prevents the package from functioning on platforms where
ctypes is not supported. This is not a restriction of this
package, but rather of ctypes.
* DONE: Rename top-level package from "pyprocessing" to
"multiprocessing".
* DONE: Also note that the default behavior of process spawning
does not make it compatible with use within IDLE as-is, this
will be examined as a bug-fix or "setExecutable" enhancement.
* DONE: Add in "multiprocessing.setExecutable()" method to override the
default behavior of the package to spawn processes using the
current executable name rather than the Python interpreter. Note
that Mark Hammond has suggested a factory-style interface for
this[7].
References
[1] PyProcessing home page
http://pyprocessing.berlios.de/
[2] See Adam Olsen's "safe threading" project
http://code.google.com/p/python-safethread/
[3] See: Addition of "pyprocessing" module to standard lib.
http://mail.python.org/pipermail/python-dev/2008-May/079417.html
[4] http://mpi4py.scipy.org/
[5] See "Cluster Computing"
http://wiki.python.org/moin/ParallelProcessing
[6] The original run_benchmark.py code was published in Python
Magazine in December 2007: "Python Threads and the Global
Interpreter Lock" by Jesse Noller. It has been modified for
this PEP.
[7] http://groups.google.com/group/python-dev2/msg/54cf06d15cbcbc34
[8] Addition Python-Dev discussion
http://mail.python.org/pipermail/python-dev/2008-June/080011.html
Copyright
This document has been placed in the public domain.
pep-0372 Adding an ordered dictionary to collections
| PEP: | 372 |
|---|---|
| Title: | Adding an ordered dictionary to collections |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Armin Ronacher <armin.ronacher at active-4.com> Raymond Hettinger <python at rcn.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Jun-2008 |
| Python-Version: | 2.7, 3.1 |
| Post-History: |
Contents
Abstract
This PEP proposes an ordered dictionary as a new data structure for the collections module, called "OrderedDict" in this PEP. The proposed API incorporates the experiences gained from working with similar implementations that exist in various real-world applications and other programming languages.
Patch
A working Py3.1 patch including tests and documentation is at:
OrderedDict patch
The check-in was in revisions: 70101 and 70102
Rationale
In current Python versions, the widely used built-in dict type does not specify an order for the key/value pairs stored. This makes it hard to use dictionaries as data storage for some specific use cases.
Some dynamic programming languages like PHP and Ruby 1.9 guarantee a certain order on iteration. In those languages, and existing Python ordered-dict implementations, the ordering of items is defined by the time of insertion of the key. New keys are appended at the end, but keys that are overwritten are not moved to the end.
The following example shows the behavior for simple assignments:
>>> d = OrderedDict()
>>> d['parrot'] = 'dead'
>>> d['penguin'] = 'exploded'
>>> d.items()
[('parrot', 'dead'), ('penguin', 'exploded')]
That the ordering is preserved makes an OrderedDict useful for a couple of situations:
XML/HTML processing libraries currently drop the ordering of attributes, use a list instead of a dict which makes filtering cumbersome, or implement their own ordered dictionary. This affects ElementTree, html5lib, Genshi and many more libraries.
There are many ordered dict implementations in various libraries and applications, most of them subtly incompatible with each other. Furthermore, subclassing dict is a non-trivial task and many implementations don't override all the methods properly which can lead to unexpected results.
Additionally, many ordered dicts are implemented in an inefficient way, making many operations more complex then they have to be.
PEP 3115 allows metaclasses to change the mapping object used for the class body. An ordered dict could be used to create ordered member declarations similar to C structs. This could be useful, for example, for future ctypes releases as well as ORMs that define database tables as classes, like the one the Django framework ships. Django currently uses an ugly hack to restore the ordering of members in database models.
The RawConfigParser class accepts a dict_type argument that allows an application to set the type of dictionary used internally. The motivation for this addition was expressly to allow users to provide an ordered dictionary. [1]
Code ported from other programming languages such as PHP often depends on an ordered dict. Having an implementation of an ordering-preserving dictionary in the standard library could ease the transition and improve the compatibility of different libraries.
Ordered Dict API
The ordered dict API would be mostly compatible with dict and existing ordered dicts. Note: this PEP refers to the 2.7 and 3.0 dictionary API as described in collections.Mapping abstract base class.
The constructor and update() both accept iterables of tuples as well as mappings like a dict does. Unlike a regular dictionary, the insertion order is preserved.
>>> d = OrderedDict([('a', 'b'), ('c', 'd')])
>>> d.update({'foo': 'bar'})
>>> d
collections.OrderedDict([('a', 'b'), ('c', 'd'), ('foo', 'bar')])
If ordered dicts are updated from regular dicts, the ordering of new keys is of course undefined.
All iteration methods as well as keys(), values() and items() return the values ordered by the time the key was first inserted:
>>> d['spam'] = 'eggs'
>>> d.keys()
['a', 'c', 'foo', 'spam']
>>> d.values()
['b', 'd', 'bar', 'eggs']
>>> d.items()
[('a', 'b'), ('c', 'd'), ('foo', 'bar'), ('spam', 'eggs')]
New methods not available on dict:
- OrderedDict.__reversed__()
- Supports reverse iteration by key.
Questions and Answers
What happens if an existing key is reassigned?
The key is not moved but assigned a new value in place. This is consistent with existing implementations.
What happens if keys appear multiple times in the list passed to the constructor?
The same as for regular dicts -- the latter item overrides the former. This has the side-effect that the position of the first key is used because only the value is actually overwritten:
>>> OrderedDict([('a', 1), ('b', 2), ('a', 3)]) collections.OrderedDict([('a', 3), ('b', 2)])This behavior is consistent with existing implementations in Python, the PHP array and the hashmap in Ruby 1.9.
Is the ordered dict a dict subclass? Why?
Yes. Like defaultdict, an ordered dictionary `` subclasses dict. Being a dict subclass make some of the methods faster (like __getitem__ and __len__). More importantly, being a dict subclass lets ordered dictionaries be usable with tools like json that insist on having dict inputs by testing isinstance(d, dict).
Do any limitations arise from subclassing dict?
Yes. Since the API for dicts is different in Py2.x and Py3.x, the OrderedDict API must also be different. So, the Py2.7 version will need to override iterkeys, itervalues, and iteritems.
Does OrderedDict.popitem() return a particular key/value pair?
Yes. It pops-off the most recently inserted new key and its corresponding value. This corresponds to the usual LIFO behavior exhibited by traditional push/pop pairs. It is semantically equivalent to k=list(od)[-1]; v=od[k]; del od[k]; return (k,v). The actual implementation is more efficient and pops directly from a sorted list of keys.
Does OrderedDict support indexing, slicing, and whatnot?
As a matter of fact, OrderedDict does not implement the Sequence interface. Rather, it is a MutableMapping that remembers the order of key insertion. The only sequence-like addition is support for reversed.
An further advantage of not allowing indexing is that it leaves open the possibility of a fast C implementation using linked lists.
Does OrderedDict support alternate sort orders such as alphabetical?
No. Those wanting different sort orders really need to be using another technique. The OrderedDict is all about recording insertion order. If any other order is of interest, then another structure (like an in-memory dbm) is likely a better fit.
How well does OrderedDict work with the json module, PyYAML, and ConfigParser?
For json, the good news is that json's encoder respects OrderedDict's iteration order:
>>> items = [('one', 1), ('two', 2), ('three',3), ('four',4), ('five',5)] >>> json.dumps(OrderedDict(items)) '{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}'In Py2.6, the object_hook for json decoders passes-in an already built dictionary so order is lost before the object hook sees it. This problem is being fixed for Python 2.7/3.1 by adding a new hook that preserves order (see http://bugs.python.org/issue5381 ). With the new hook, order can be preserved:
>>> jtext = '{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}' >>> json.loads(jtext, object_pairs_hook=OrderedDict) OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})For PyYAML, a full round-trip is problem free:
>>> ytext = yaml.dump(OrderedDict(items)) >>> print ytext !!python/object/apply:collections.OrderedDict - - [one, 1] - [two, 2] - [three, 3] - [four, 4] - [five, 5] >>> yaml.load(ytext) OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})For the ConfigParser module, round-tripping is also problem free. Custom dicts were added in Py2.6 specifically to support ordered dictionaries:
>>> config = ConfigParser(dict_type=OrderedDict) >>> config.read('myconfig.ini') >>> config.remove_option('Log', 'error') >>> config.write(open('myconfig.ini', 'w'))
How does OrderedDict handle equality testing?
Comparing two ordered dictionaries implies that the test will be order-sensitive so that list (od1.items())==list(od2.items()).
When ordered dicts are compared with other Mappings, their order insensitive comparison is used. This allows ordered dictionaries to be substituted anywhere regular dictionaries are used.
How __repr__ format will maintain order during an repr/eval round-trip?
OrderedDict([('a', 1), ('b', 2)])
What are the trade-offs of the possible underlying data structures?
- Keeping a sorted list of keys is fast for all operations except __delitem__() which becomes an O(n) exercise. This data structure leads to very simple code and little wasted space.
- Keeping a separate dictionary to record insertion sequence numbers makes the code a little bit more complex. All of the basic operations are O(1) but the constant factor is increased for __setitem__() and __delitem__() meaning that every use case will have to pay for this speedup (since all buildup go through __setitem__). Also, the first traveral incurs a one-time O(n log n) sorting cost. The storage costs are double that for the sorted-list-of-keys approach.
- A version written in C could use a linked list. The code would be more complex than the other two approaches but it would conserve space and would keep the same big-oh performance as regular dictionaries. It is the fastest and most space efficient.
Reference Implementation
An implementation with tests and documentation is at:
OrderedDict patch
The proposed version has several merits:
- Strict compliance with the MutableMapping API and no new methods so that the learning curve is near zero. It is simply a dictionary that remembers insertion order.
- Generally good performance. The big-oh times are the same as regular dictionaries except that key deletion is O(n).
Other implementations of ordered dicts in various Python projects or standalone libraries, that inspired the API proposed here, are:
- odict in Python [2]
- odict in Babel [3]
- OrderedDict in Django [4]
- The odict module [5]
- ordereddict [6] (a C implementation of the odict module)
- StableDict [7]
- Armin Rigo's OrderedDict [8]
Future Directions
With the availability of an ordered dict in the standard library, other libraries may take advantage of that. For example, ElementTree could return odicts in the future that retain the attribute ordering of the source file.
References
| [1] | http://bugs.python.org/issue1371075 |
| [2] | http://dev.pocoo.org/hg/sandbox/raw-file/tip/odict.py |
| [3] | http://babel.edgewall.org/browser/trunk/babel/util.py?rev=374#L178 |
| [4] | http://code.djangoproject.com/browser/django/trunk/django/utils/datastructures.py?rev=7140#L53 |
| [5] | http://www.voidspace.org.uk/python/odict.html |
| [6] | http://www.xs4all.nl/~anthon/Python/ordereddict/ |
| [7] | http://pypi.python.org/pypi/StableDict/0.2 |
| [8] | http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py |
Copyright
This document has been placed in the public domain.
pep-0373 Python 2.7 Release Schedule
| PEP: | 373 |
|---|---|
| Title: | Python 2.7 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benjamin Peterson <benjamin at python.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 3-Nov-2008 |
| Python-Version: | 2.7 |
Contents
Abstract
This document describes the development and release schedule for Python 2.7. The schedule primarily concerns itself with PEP-sized items. Small features may be added up to and including the first beta release. Bugs may be fixed until the final release.
Update
The End Of Life date (EOL, sunset date) for Python 2.7 has been moved five years into the future, to 2020. This decision was made to clarify the status of Python 2.7 and relieve worries for those users who cannot yet migrate to Python 3. See also PEP 466.
This declaration does not guarantee that bugfix releases will be made on a regular basis, but it should enable volunteers who want to contribute bugfixes for Python 2.7 and it should satisfy vendors who still have to support Python 2 for years to come.
There will be no Python 2.8 (see PEP 404).
Release Manager and Crew
| Position | Name |
|---|---|
| 2.7 Release Manager | Benjamin Peterson |
| Windows installers | Steve Dower |
| Mac installers | Ned Deily |
Maintenance releases
Being the last of the 2.x series, 2.7 will have an extended period of maintenance. The current plan is to support it for at least 10 years from the initial 2.7 release. This means there will be bugfix releases until 2020.
Planned future release dates:
- 2.7.10rc1 2015-05-09
- 2.7.10 2015-05-23
- 2.7.11 December, 2015
- beyond this date, releases as needed
Dates of previous maintenance releases:
- 2.7.1 2010-11-27
- 2.7.2 2011-07-21
- 2.7.3rc1 2012-02-23
- 2.7.3rc2 2012-03-15
- 2.7.3 2012-03-09
- 2.7.4rc1 2013-03-23
- 2.7.4 2013-04-06
- 2.7.5 2013-05-12
- 2.7.6rc1 2013-10-26
- 2.7.6 2013-11-10
- 2.7.7rc1 2014-05-17
- 2.7.7 2014-05-31
- 2.7.8 2014-06-30
- 2.7.9rc1 2014-11-26
- 2.7.9 2014-12-10
2.7.0 Release Schedule
The release schedule for 2.7.0 was:
- 2.7 alpha 1 2009-12-05
- 2.7 alpha 2 2010-01-09
- 2.7 alpha 3 2010-02-06
- 2.7 alpha 4 2010-03-06
- 2.7 beta 1 2010-04-03
- 2.7 beta 2 2010-05-08
- 2.7 rc1 2010-06-05
- 2.7 rc2 2010-06-19
- 2.7 final 2010-07-03
Possible features for 2.7
Nothing here. [Note that a moratorium on core language changes is in effect.]
References
None yet!
Copyright
This document has been placed in the public domain.
pep-0374 Choosing a distributed VCS for the Python project
| PEP: | 374 |
|---|---|
| Title: | Choosing a distributed VCS for the Python project |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org>, Stephen J. Turnbull <stephen at xemacs.org>, Alexandre Vassalotti <alexandre at peadrop.com>, Barry Warsaw <barry at python.org>, Dirkjan Ochtman <dirkjan at ochtman.nl> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 07-Nov-2008 |
| Post-History: | 07-Nov-2008 22-Jan-2009 |
Contents
Rationale
Python has been using a centralized version control system (VCS; first CVS, now Subversion) for years to great effect. Having a master copy of the official version of Python provides people with a single place to always get the official Python source code. It has also allowed for the storage of the history of the language, mostly for help with development, but also for posterity. And of course the V in VCS is very helpful when developing.
But a centralized version control system has its drawbacks. First and foremost, in order to have the benefits of version control with Python in a seamless fashion, one must be a "core developer" (i.e. someone with commit privileges on the master copy of Python). People who are not core developers but who wish to work with Python's revision tree, e.g. anyone writing a patch for Python or creating a custom version, do not have direct tool support for revisions. This can be quite a limitation, since these non-core developers cannot easily do basic tasks such as reverting changes to a previously saved state, creating branches, publishing one's changes with full revision history, etc. For non-core developers, the last safe tree state is one the Python developers happen to set, and this prevents safe development. This second-class citizenship is a hindrance to people who wish to contribute to Python with a patch of any complexity and want a way to incrementally save their progress to make their development lives easier.
There is also the issue of having to be online to be able to commit one's work. Because centralized VCSs keep a central copy that stores all revisions, one must have Internet access in order for their revisions to be stored; no Net, no commit. This can be annoying if you happen to be traveling and lack any Internet. There is also the situation of someone wishing to contribute to Python but having a bad Internet connection where committing is time-consuming and expensive and it might work out better to do it in a single step.
Another drawback to a centralized VCS is that a common use case is for a developer to revise patches in response to review comments. This is more difficult with a centralized model because there's no place to contain intermediate work. It's either all checked in or none of it is checked in. In the centralized VCS, it's also very difficult to track changes to the trunk as they are committed, while you're working on your feature or bug fix branch. This increases the risk that such branches will grow stale, out-dated, or that merging them into the trunk will generate too may conflicts to be easily resolved.
Lastly, there is the issue of maintenance of Python. At any one time there is at least one major version of Python under development (at the time of this writing there are two). For each major version of Python under development there is at least the maintenance version of the last minor version and the in-development minor version (e.g. with 2.6 just released, that means that both 2.6 and 2.7 are being worked on). Once a release is done, a branch is created between the code bases where changes in one version do not (but could) belong in the other version. As of right now there is no natural support for this branch in time in central VCSs; you must use tools that simulate the branching. Tracking merges is similarly painful for developers, as revisions often need to be merged between four active branches (e.g. 2.6 maintenance, 3.0 maintenance, 2.7 development, 3.1 development). In this case, VCSs such as Subversion only handle this through arcane third party tools.
Distributed VCSs (DVCSs) solve all of these problems. While one can keep a master copy of a revision tree, anyone is free to copy that tree for their own use. This gives everyone the power to commit changes to their copy, online or offline. It also more naturally ties into the idea of branching in the history of a revision tree for maintenance and the development of new features bound for Python. DVCSs also provide a great many additional features that centralized VCSs don't or can't provide.
This PEP explores the possibility of changing Python's use of Subversion to any of the currently popular DVCSs, in order to gain the benefits outlined above. This PEP does not guarantee that a switch to a DVCS will occur at the conclusion of this PEP. It is quite possible that no clear winner will be found and that svn will continue to be used. If this happens, this PEP will be revisited and revised in the future as the state of DVCSs evolves.
Terminology
Agreeing on a common terminology is surprisingly difficult, primarily because each VCS uses these terms when describing subtly different tasks, objects, and concepts. Where possible, we try to provide a generic definition of the concepts, but you should consult the individual system's glossaries for details. Here are some basic references for terminology, from some of the standard web-based references on each VCS. You can also refer to glossaries for each DVCS:
- Subversion : http://svnbook.red-bean.com/en/1.5/svn.basic.html
- Bazaar : http://bazaar-vcs.org/BzrGlossary
- Mercurial : http://www.selenic.com/mercurial/wiki/index.cgi/UnderstandingMercurial
- git : http://book.git-scm.com/1_the_git_object_model.html
- branch
- A line of development; a collection of revisions, ordered by time.
- checkout/working copy/working tree
- A tree of code the developer can edit, linked to a branch.
- index
- A "staging area" where a revision is built (unique to git).
- repository
- A collection of revisions, organized into branches.
- clone
- A complete copy of a branch or repository.
- commit
- To record a revision in a repository.
- merge
- Applying all the changes and history from one branch/repository to another.
- pull
- To update a checkout/clone from the original branch/repository, which can be remote or local
- push/publish
- To copy a revision, and all revisions it depends on, from a one repository to another.
- cherry-pick
- To merge one or more specific revisions from one branch to another, possibly in a different repository, possibly without its dependent revisions.
- rebase
- To "detach" a branch, and move it to a new branch point; move commits to the beginning of a branch instead of where they happened in time.
Typical Workflow
At the moment, the typical workflow for a Python core developer is:
- Edit code in a checkout until it is stable enough to commit/push.
- Commit to the master repository.
It is a rather simple workflow, but it has drawbacks. For one, because any work that involves the repository takes time thanks to the network, commits/pushes tend to not necessarily be as atomic as possible. There is also the drawback of there not being a necessarily cheap way to create new checkouts beyond a recursive copy of the checkout directory.
A DVCS would lead to a workflow more like this:
- Branch off of a local clone of the master repository.
- Edit code, committing in atomic pieces.
- Merge the branch into the mainline, and
- Push all commits to the master repository.
While there are more possible steps, the workflow is much more independent of the master repository than is currently possible. By being able to commit locally at the speed of your disk, a core developer is able to do atomic commits much more frequently, minimizing having commits that do multiple things to the code. Also by using a branch, the changes are isolated (if desired) from other changes being made by other developers. Because branches are cheap, it is easy to create and maintain many smaller branches that address one specific issue, e.g. one bug or one new feature. More sophisticated features of DVCSs allow the developer to more easily track long running development branches as the official mainline progresses.
Contenders
| Name | Short Name | Version | 2.x Trunk Mirror | 3.x Trunk Mirror |
|---|---|---|---|---|
| Bazaar [1] | bzr | 1.12 | http://code.python.org/python/trunk | http://code.python.org/python/3.0 |
| Mercurial [2] | hg | 1.2.0 | http://code.python.org/hg/trunk/ | http://code.python.org/hg/branches/py3k/ |
| git [3] | N/A | 1.6.1 | git://code.python.org/python/trunk | git://code.python.org/python/branches/py3k |
This PEP does not consider darcs, arch, or monotone. The main problem with these DVCSs is that they are simply not popular enough to bother supporting when they do not provide some very compelling features that the other DVCSs provide. Arch and darcs also have significant performance problems which seem unlikely to be addressed in the near future.
Interoperability
For those who have already decided which DVCSs they want to use, and are willing to maintain local mirrors themselves, all three DVCSs support interchange via the git "fast-import" changeset format. git does so natively, of course, and native support for Bazaar is under active development, and getting good early reviews as of mid-February 2009. Mercurial has idiosyncratic support for importing via its hg convert command, and third-party fast-import support [4] is available for exporting. Also, the Tailor [5] tool supports automatic maintenance of mirrors based on an official repository in any of the candidate formats with a local mirror in any format.
Usage Scenarios
Probably the best way to help decide on whether/which DVCS should replace Subversion is to see what it takes to perform some real-world usage scenarios that developers (core and non-core) have to work with. Each usage scenario outlines what it is, a bullet list of what the basic steps are (which can vary slightly per VCS), and how to perform the usage scenario in the various VCSs (including Subversion).
Each VCS had a single author in charge of writing implementations for each scenario (unless otherwise noted).
| Name | VCS |
|---|---|
| Brett | svn |
| Barry | bzr |
| Alexandre | hg |
| Stephen | git |
Initial Setup
Some DVCSs have some perks if you do some initial setup upfront. This section covers what can be done before any of the usage scenarios are run in order to take better advantage of the tools.
All of the DVCSs support configuring your project identification. Unlike the centralized systems, they use your email address to identify your commits. (Access control is generally done by mechanisms external to the DVCS, such as ssh or console login). This identity may be associated with a full name.
All of the DVCSs will query the system to get some approximation to this information, but that may not be what you want. They also support setting this information on a per-user basis, and on a per- project basis. Convenience commands to set these attributes vary, but all allow direct editing of configuration files.
Some VCSs support end-of-line (EOL) conversions on checkout/checkin.
svn
None required, but it is recommended you follow the guidelines in the dev FAQ.
bzr
No setup is required, but for much quicker and space-efficient local branching, you should create a shared repository to hold all your Python branches. A shared repository is really just a parent directory containing a .bzr directory. When bzr commits a revision, it searches from the local directory on up the file system for a .bzr directory to hold the revision. By sharing revisions across multiple branches, you cut down on the amount of disk space used. Do this:
cd ~/projects bzr init-repo python cd python
Now, all your Python branches should be created inside of ~/projects/python.
There are also some settings you can put in your ~/.bzr/bazaar.conf and ~/.bzr/locations.conf file to set up defaults for interacting with Python code. None of them are required, although some are recommended. E.g. I would suggest gpg signing all commits, but that might be too high a barrier for developers. Also, you can set up default push locations depending on where you want to push branches by default. If you have write access to the master branches, that push location could be code.python.org. Otherwise, it might be a free Bazaar code hosting service such as Launchpad. If Bazaar is chosen, we should decide what the policies and recommendations are.
At a minimum, I would set up your email address:
bzr whoami "Firstname Lastname <email.address@example.com>"
As with hg and git below, there are ways to set your email address (or really, just about any parameter) on a per-repository basis. You do this with settings in your $HOME/.bazaar/locations.conf file, which has an ini-style format as does the other DVCSs. See the Bazaar documentation for details, which mostly aren't relevant for this discussion.
hg
Minimally, you should set your user name. To do so, create the file .hgrc in your home directory and add the following:
[ui] username = Firstname Lastname <email.address@example.com>
If you are using Windows and your tools do not support Unix-style newlines, you can enable automatic newline translation by adding to your configuration:
[extensions] win32text =
These options can also be set locally to a given repository by customizing <repo>/.hg/hgrc, instead of ~/.hgrc.
git
None needed. However, git supports a number of features that can smooth your work, with a little preparation. git supports setting defaults at the workspace, user, and system levels. The system level is out of scope of this PEP. The user configuration file is $HOME/.gitconfig on Unix-like systems, and the workspace configuration file is $REPOSITORY/.git/config.
You can use the git-config tool to set preferences for user.name and user.email either globally (for your system login account) or locally (to a given git working copy), or you can edit the configuration files (which have the same format as shown in the Mercurial section above).:
# my full name doesn't change # note "--global" flag means per user # (system-wide configuration is set with "--system") git config --global user.name 'Firstname Lastname' # but use my Pythonic email address cd /path/to/python/repository git config user.email email.address@python.example.com
If you are using Windows, you probably want to set the core.autocrlf and core.safecrlf preferences to true using git-config.:
# check out files with CRLF line endings rather than Unix-style LF only git config --global core.autocrlf true # scream if a transformation would be ambiguous # (eg, a working file contains both naked LF and CRLF) # and check them back in with the reverse transformation git config --global core.safecrlf true
Although the repository will usually contain a .gitignore file specifying file names that rarely if ever should be registered in the VCS, you may have personal conventions (e.g., always editing log messages in a temporary file named ".msg") that you may wish to specify.:
# tell git where my personal ignores are git config --global core.excludesfile ~/.gitignore # I use .msg for my long commit logs, and Emacs makes backups in # files ending with ~ # these are globs, not regular expressions echo '*~' >> ~/.gitignore echo '.msg' >> ~/.gitignore
If you use multiple branches, as with the other VCSes, you can save a lot of space by putting all objects in a common object store. This also can save download time, if the origins of the branches were in different repositories, because objects are shared across branches in your repository even if they were not present in the upstream repositories. git is very space- and time-efficient and applies a number of optimizations automatically, so this configuration is optional. (Examples are omitted.)
One-Off Checkout
As a non-core developer, I want to create and publish a one-off patch that fixes a bug, so that a core developer can review it for inclusion in the mainline.
- Checkout/branch/clone trunk.
- Edit some code.
- Generate a patch (based on what is best supported by the VCS, e.g. branch history).
- Receive reviewer comments and address the issues.
- Generate a second patch for the core developer to commit.
svn
svn checkout http://svn.python.org/projects/python/trunk cd trunk # Edit some code. echo "The cake is a lie!" > README # Since svn lacks support for local commits, we fake it with patches. svn diff >> commit-1.diff svn diff >> patch-1.diff # Upload the patch-1 to bugs.python.org. # Receive reviewer comments. # Edit some code. echo "The cake is real!" > README # Since svn lacks support for local commits, we fake it with patches. svn diff >> commit-2.diff svn diff >> patch-2.diff # Upload patch-2 to bugs.python.org
bzr
bzr branch http://code.python.org/python/trunk cd trunk # Edit some code. bzr commit -m 'Stuff I did' bzr send -o bundle # Upload bundle to bugs.python.org # Receive reviewer comments # Edit some code bzr commit -m 'Respond to reviewer comments' bzr send -o bundle # Upload updated bundle to bugs.python.org
The bundle file is like a super-patch. It can be read by patch(1) but it contains additional metadata so that it can be fed to bzr merge to produce a fully usable branch completely with history. See Patch Review section below.
hg
hg clone http://code.python.org/hg/trunk cd trunk # Edit some code. hg commit -m "Stuff I did" hg outgoing -p > fixes.patch # Upload patch to bugs.python.org # Receive reviewer comments # Edit some code hg commit -m "Address reviewer comments." hg outgoing -p > additional-fixes.patch # Upload patch to bugs.python.org
While hg outgoing does not have the flag for it, most Mercurial commands support git's extended patch format through a --git command. This can be set in one's .hgrc file so that all commands that generate a patch use the extended format.
git
The patches could be created with git diff master > stuff-i-did.patch, too, but git format-patch | git am knows some tricks (empty files, renames, etc) that ordinary patch can't handle. git grabs "Stuff I did" out of the the commit message to create the file name 0001-Stuff-I-did.patch. See Patch Review below for a description of the git-format-patch format.
# Get the mainline code. git clone git://code.python.org/python/trunk cd trunk # Edit some code. git commit -a -m 'Stuff I did.' # Create patch for my changes (i.e, relative to master). git format-patch master git tag stuff-v1 # Upload 0001-Stuff-I-did.patch to bugs.python.org. # Time passes ... receive reviewer comments. # Edit more code. git commit -a -m 'Address reviewer comments.' # Make an add-on patch to apply on top of the original. git format-patch stuff-v1 # Upload 0001-Address-reviewer-comments.patch to bugs.python.org.
Backing Out Changes
As a core developer, I want to undo a change that was not ready for inclusion in the mainline.
- Back out the unwanted change.
- Push patch to server.
svn
# Assume the change to revert is in revision 40 svn merge -c -40 . # Resolve conflicts, if any. svn commit -m "Reverted revision 40"
bzr
# Assume the change to revert is in revision 40 bzr merge -r 40..39 # Resolve conflicts, if any. bzr commit -m "Reverted revision 40"
Note that if the change you want revert is the last one that was made, you can just use bzr uncommit.
hg
# Assume the change to revert is in revision 9150dd9c6d30 hg backout --merge -r 9150dd9c6d30 # Resolve conflicts, if any. hg commit -m "Reverted changeset 9150dd9c6d30" hg push
Note, you can use "hg rollback" and "hg strip" to revert changes you committed in your local repository, but did not yet push to other repositories.
git
# Assume the change to revert is the grandfather of a revision tagged "newhotness". git revert newhotness~2 # Resolve conflicts if any. If there are no conflicts, the commit # will be done automatically by "git revert", which prompts for a log. git commit -m "Reverted changeset 9150dd9c6d30." git push
Patch Review
As a core developer, I want to review patches submitted by other people, so that I can make sure that only approved changes are added to Python.
Core developers have to review patches as submitted by other people. This requires applying the patch, testing it, and then tossing away the changes. The assumption can be made that a core developer already has a checkout/branch/clone of the trunk.
- Branch off of trunk.
- Apply patch w/o any comments as generated by the patch submitter.
- Push patch to server.
- Delete now-useless branch.
svn
Subversion does not exactly fit into this development style very well as there are no such thing as a "branch" as has been defined in this PEP. Instead a developer either needs to create another checkout for testing a patch or create a branch on the server. Up to this point, core developers have not taken the "branch on the server" approach to dealing with individual patches. For this scenario the assumption will be the developer creates a local checkout of the trunk to work with.:
cp -r trunk issue0000 cd issue0000 patch -p0 < __patch__ # Review patch. svn commit -m "Some patch." cd .. rm -r issue0000
Another option is to only have a single checkout running at any one time and use svn diff along with svn revert -R to store away independent changes you may have made.
bzr
bzr branch trunk issueNNNN # Download `patch` bundle from Roundup bzr merge patch # Review patch bzr commit -m'Patch NNN by So N. So' --fixes python:NNNN bzr push bzr+ssh://me@code.python.org/trunk rm -rf ../issueNNNN
Alternatively, since you're probably going to commit these changes to the trunk, you could just do a checkout. That would give you a local working tree while the branch (i.e. all revisions) would continue to live on the server. This is similar to the svn model and might allow you to more quickly review the patch. There's no need for the push in this case.:
bzr checkout trunk issueNNNN # Download `patch` bundle from Roundup bzr merge patch # Review patch bzr commit -m'Patch NNNN by So N. So' --fixes python:NNNN rm -rf ../issueNNNN
hg
hg clone trunk issue0000 cd issue0000 # If the patch was generated using hg export, the user name of the # submitter is automatically recorded. Otherwise, # use hg import --no-commit submitted.diff and commit with # hg commit -u "Firstname Lastname <email.address@example.com>" hg import submitted.diff # Review patch. hg push ssh://alexandre@code.python.org/hg/trunk/
git
We assume a patch created by git-format-patch. This is a Unix mbox file containing one or more patches, each formatted as an RFC 2822 message. git-am interprets each message as a commit as follows. The author of the patch is taken from the From: header, the date from the Date header. The commit log is created by concatenating the content of the subject line, a blank line, and the message body up to the start of the patch.:
cd trunk # Create a branch in case we don't like the patch. # This checkout takes zero time, since the workspace is left in # the same state as the master branch. git checkout -b patch-review # Download patch from bugs.python.org to submitted.patch. git am < submitted.patch # Review and approve patch. # Merge into master and push. git checkout master git merge patch-review git push
Backport
As a core developer, I want to apply a patch to 2.6, 2.7, 3.0, and 3.1 so that I can fix a problem in all three versions.
Thanks to always having the cutting-edge and the latest release version under development, Python currently has four branches being worked on simultaneously. That makes it important for a change to propagate easily through various branches.
svn
Because of Python's use of svnmerge, changes start with the trunk (2.7) and then get merged to the release version of 2.6. To get the change into the 3.x series, the change is merged into 3.1, fixed up, and then merged into 3.0 (2.7 -> 2.6; 2.7 -> 3.1 -> 3.0).
This is in contrast to a port-forward strategy where the patch would have been added to 2.6 and then pulled forward into newer versions (2.6 -> 2.7 -> 3.0 -> 3.1).
# Assume patch applied to 2.7 in revision 0000. cd release26-maint svnmerge merge -r 0000 # Resolve merge conflicts and make sure patch works. svn commit -F svnmerge-commit-message.txt # revision 0001. cd ../py3k svnmerge merge -r 0000 # Same as for 2.6, except Misc/NEWS changes are reverted. svn revert Misc/NEWS svn commit -F svnmerge-commit-message.txt # revision 0002. cd ../release30-maint svnmerge merge -r 0002 svn commit -F svnmerge-commit-message.txt # revision 0003.
bzr
Bazaar is pretty straightforward here, since it supports cherry picking revisions manually. In the example below, we could have given a revision id instead of a revision number, but that's usually not necessary. Martin Pool suggests "We'd generally recommend doing the fix first in the oldest supported branch, and then merging it forward to the later releases.":
# Assume patch applied to 2.7 in revision 0000 cd release26-maint bzr merge ../trunk -c 0000 # Resolve conflicts and make sure patch works bzr commit -m 'Back port patch NNNN' bzr push bzr+ssh://me@code.python.org/trunk cd ../py3k bzr merge ../trunk -r 0000 # Same as for 2.6 except Misc/NEWS changes are reverted bzr revert Misc/NEWS bzr commit -m 'Forward port patch NNNN' bzr push bzr+ssh://me@code.python.org/py3k
hg
Mercurial, like other DVCS, does not well support the current workflow used by Python core developers to backport patches. Right now, bug fixes are first applied to the development mainline (i.e., trunk), then back-ported to the maintenance branches and forward-ported, as necessary, to the py3k branch. This workflow requires the ability to cherry-pick individual changes. Mercurial's transplant extension provides this ability. Here is an example of the scenario using this workflow:
cd release26-maint # Assume patch applied to 2.7 in revision 0000 hg transplant -s ../trunk 0000 # Resolve conflicts, if any. cd ../py3k hg pull ../trunk hg merge hg revert Misc/NEWS hg commit -m "Merged trunk" hg push
In the above example, transplant acts much like the current svnmerge command. When transplant is invoked without the revision, the command launches an interactive loop useful for transplanting multiple changes. Another useful feature is the --filter option which can be used to modify changesets programmatically (e.g., it could be used for removing changes to Misc/NEWS automatically).
Alternatively to the traditional workflow, we could avoid transplanting changesets by committing bug fixes to the oldest supported release, then merge these fixes upward to the more recent branches.
cd release25-maint hg import fix_some_bug.diff # Review patch and run test suite. Revert if failure. hg push cd ../release26-maint hg pull ../release25-maint hg merge # Resolve conflicts, if any. Then, review patch and run test suite. hg commit -m "Merged patches from release25-maint." hg push cd ../trunk hg pull ../release26-maint hg merge # Resolve conflicts, if any, then review. hg commit -m "Merged patches from release26-maint." hg push
Although this approach makes the history non-linear and slightly more difficult to follow, it encourages fixing bugs across all supported releases. Furthermore, it scales better when there is many changes to backport, because we do not need to seek the specific revision IDs to merge.
git
In git I would have a workspace which contains all of the relevant master repository branches. git cherry-pick doesn't work across repositories; you need to have the branches in the same repository.
# Assume patch applied to 2.7 in revision release27~3 (4th patch back from tip). cd integration git checkout release26 git cherry-pick release27~3 # If there are conflicts, resolve them, and commit those changes. # git commit -a -m "Resolve conflicts." # Run test suite. If fixes are necessary, record as a separate commit. # git commit -a -m "Fix code causing test failures." git checkout master git cherry-pick release27~3 # Do any conflict resolution and test failure fixups. # Revert Misc/NEWS changes. git checkout HEAD^ -- Misc/NEWS git commit -m 'Revert cherry-picked Misc/NEWS changes.' Misc/NEWS # Push both ports. git push release26 master
If you are regularly merging (rather than cherry-picking) from a given branch, then you can block a given commit from being accidentally merged in the future by merging, then reverting it. This does not prevent a cherry-pick from pulling in the unwanted patch, and this technique requires blocking everything that you don't want merged. I'm not sure if this differs from svn on this point.
cd trunk # Merge in the alpha tested code. git merge experimental-branch # We don't want the 3rd-to-last commit from the experimental-branch, # and we don't want it to ever be merged. # The notation "^N" means Nth parent of the current commit. Thus HEAD^2^1^1 # means the first parent of the first parent of the second parent of HEAD. git revert HEAD^2^1^1 # Propagate the merge and the prohibition to the public repository. git push
Coordinated Development of a New Feature
Sometimes core developers end up working on a major feature with several developers. As a core developer, I want to be able to publish feature branches to a common public location so that I can collaborate with other developers.
This requires creating a branch on a server that other developers can access. All of the DVCSs support creating new repositories on hosts where the developer is already able to commit, with appropriate configuration of the repository host. This is similar in concept to the existing sandbox in svn, although details of repository initialization may differ.
For non-core developers, there are various more-or-less public-access repository-hosting services. Bazaar has Launchpad [6], Mercurial has bitbucket.org [7], and git has GitHub [8]. All also have easy-to-use CGI interfaces for developers who maintain their own servers.
- Branch trunk.
- Pull from branch on the server.
- Pull from trunk.
- Push merge to trunk.
svn
# Create branch. svn copy svn+ssh://pythondev@svn.python.org/python/trunk svn+ssh://pythondev@svn.python.org/python/branches/NewHotness svn checkout svn+ssh://pythondev@svn.python.org/python/branches/NewHotness cd NewHotness svnmerge init svn commit -m "Initialize svnmerge." # Pull in changes from other developers. svn update # Pull in trunk and merge to the branch. svnmerge merge svn commit -F svnmerge-commit-message.txt
This scenario is incomplete as the decision for what DVCS to go with was made before the work was complete.
Separation of Issue Dependencies
Sometimes, while working on an issue, it becomes apparent that the problem being worked on is actually a compound issue of various smaller issues. Being able to take the current work and then begin working on a separate issue is very helpful to separate out issues into individual units of work instead of compounding them into a single, large unit.
- Create a branch A (e.g. urllib has a bug).
- Edit some code.
- Create a new branch B that branch A depends on (e.g. the urllib bug exposes a socket bug).
- Edit some code in branch B.
- Commit branch B.
- Edit some code in branch A.
- Commit branch A.
- Clean up.
svn
To make up for svn's lack of cheap branching, it has a changelist option to associate a file with a single changelist. This is not as powerful as being able to associate at the commit level. There is also no way to express dependencies between changelists.
cp -r trunk issue0000 cd issue0000 # Edit some code. echo "The cake is a lie!" > README svn changelist A README # Edit some other code. echo "I own Python!" > LICENSE svn changelist B LICENSE svn ci -m "Tell it how it is." --changelist B # Edit changelist A some more. svn ci -m "Speak the truth." --changelist A cd .. rm -rf issue0000
bzr
Here's an approach that uses bzr shelf (now a standard part of bzr) to squirrel away some changes temporarily while you take a detour to fix the socket bugs.
bzr branch trunk bug-0000 cd bug-0000 # Edit some code. Dang, we need to fix the socket module. bzr shelve --all # Edit some code. bzr commit -m "Socket module fixes" # Detour over, now resume fixing urllib bzr unshelve # Edit some code
Another approach uses the loom plugin. Looms can greatly simplify working on dependent branches because they automatically take care of the stacking dependencies for you. Imagine looms as a stack of dependent branches (called "threads" in loom parlance), with easy ways to move up and down the stack of threads, merge changes up the stack to descendant threads, create diffs between threads, etc. Occasionally, you may need or want to export your loom threads into separate branches, either for review or commit. Higher threads incorporate all the changes in the lower threads, automatically.
bzr branch trunk bug-0000 cd bug-0000 bzr loomify --base trunk bzr create-thread fix-urllib # Edit some code. Dang, we need to fix the socket module first. bzr commit -m "Checkpointing my work so far" bzr down-thread bzr create-thread fix-socket # Edit some code bzr commit -m "Socket module fixes" bzr up-thread # Manually resolve conflicts if necessary bzr commit -m 'Merge in socket fixes' # Edit me some more code bzr commit -m "Now that socket is fixed, complete the urllib fixes" bzr record done
For bonus points, let's say someone else fixes the socket module in exactly the same way you just did. Perhaps this person even grabbed your fix-socket thread and applied just that to the trunk. You'd like to be able to merge their changes into your loom and delete your now-redundant fix-socket thread.
bzr down-thread trunk # Get all new revisions to the trunk. If you've done things # correctly, this will succeed without conflict. bzr pull bzr up-thread # See? The fix-socket thread is now identical to the trunk bzr commit -m 'Merge in trunk changes' bzr diff -r thread: | wc -l # returns 0 bzr combine-thread bzr up-thread # Resolve any conflicts bzr commit -m 'Merge trunk' # Now our top-thread has an up-to-date trunk and just the urllib fix.
hg
One approach is to use the shelve extension; this extension is not included with Mercurial, but it is easy to install. With shelve, you can select changes to put temporarily aside.
hg clone trunk issue0000 cd issue0000 # Edit some code (e.g. urllib). hg shelve # Select changes to put aside # Edit some other code (e.g. socket). hg commit hg unshelve # Complete initial fix. hg commit cd ../trunk hg pull ../issue0000 hg merge hg commit rm -rf ../issue0000
Several other way to approach this scenario with Mercurial. Alexander Solovyov presented a few alternative approaches [9] on Mercurial's mailing list.
git
cd trunk # Edit some code in urllib. # Discover a bug in socket, want to fix that first. # So save away our current work. git stash # Edit some code, commit some changes. git commit -a -m "Completed fix of socket." # Restore the in-progress work on urllib. git stash apply # Edit me some more code, commit some more fixes. git commit -a -m "Complete urllib fixes." # And push both patches to the public repository. git push
Bonus points: suppose you took your time, and someone else fixes socket in the same way you just did, and landed that in the trunk. In that case, your push will fail because your branch is not up-to-date. If the fix was a one-liner, there's a very good chance that it's exactly the same, character for character. git would notice that, and you are done; git will silently merge them.
Suppose we're not so lucky:
# Update your branch. git pull git://code.python.org/public/trunk master # git has fetched all the necessary data, but reports that the # merge failed. We discover the nearly-duplicated patch. # Neither our version of the master branch nor the workspace has # been touched. Revert our socket patch and pull again: git revert HEAD^ git pull git://code.python.org/public/trunk master
Like Bazaar and Mercurial, git has extensions to manage stacks of patches. You can use the original Quilt by Andrew Morton, or there is StGit ("stacked git") which integrates patch-tracking for large sets of patches into the VCS in a way similar to Mercurial Queues or Bazaar looms.
Doing a Python Release
How does PEP 101 change when using a DVCS?
bzr
It will change, but not substantially so. When doing the maintenance branch, we'll just push to the new location instead of doing an svn cp. Tags are totally different, since in svn they are directory copies, but in bzr (and I'm guessing hg), they are just symbolic names for revisions on a particular branch. The release.py script will have to change to use bzr commands instead. It's possible that because DVCS (in particular, bzr) does cherry picking and merging well enough that we'll be able to create the maint branches sooner. It would be a useful exercise to try to do a release off the bzr/hg mirrors.
hg
Clearly, details specific to Subversion in PEP 101 and in the release script will need to be updated. In particular, release tagging and maintenance branches creation process will have to be modified to use Mercurial's features; this will simplify and streamline certain aspects of the release process. For example, tagging and re-tagging a release will become a trivial operation since a tag, in Mercurial, is simply a symbolic name for a given revision.
git
It will change, but not substantially so. When doing the maintenance branch, we'll just git push to the new location instead of doing an svn cp. Tags are totally different, since in svn they are directory copies, but in git they are just symbolic names for revisions, as are branches. (The difference between a tag and a branch is that tags refer to a particular commit, and will never change unless you use git tag -f to force them to move. The checked-out branch, on the other hand, is automatically updated by git commit.) The release.py script will have to change to use git commands instead. With git I would create a (local) maintenance branch as soon as the release engineer is chosen. Then I'd "git pull" until I didn't like a patch, when it would be "git pull; git revert ugly-patch", until it started to look like the sensible thing is to fork off, and start doing "git cherry-pick" on the good patches.
Platform/Tool Support
Operating Systems
| DVCS | Windows | OS X | UNIX |
|---|---|---|---|
| bzr | yes (installer) w/ tortoise | yes (installer, fink or MacPorts) | yes (various package formats) |
| hg | yes (third-party installer) w/ tortoise | yes (third-party installer, fink or MacPorts) | yes (various package formats) |
| git | yes (third-party installer) | yes (third-party installer, fink or MacPorts) | yes (.deb or .rpm) |
As the above table shows, all three DVCSs are available on all three major OS platforms. But what it also shows is that Bazaar is the only DVCS that directly supports Windows with a binary installer while Mercurial and git require you to rely on a third-party for binaries. Both bzr and hg have a tortoise version while git does not.
Bazaar and Mercurial also has the benefit of being available in pure Python with optional extensions available for performance.
CRLF -> LF Support
- bzr
- My understanding is that support for this is being worked on as I type, landing in a version RSN. I will try to dig up details.
- hg
- Supported via the win32text extension.
- git
- I can't say from personal experience, but it looks like there's pretty good support via the core.autocrlf and core.safecrlf configuration attributes.
Case-insensitive filesystem support
- bzr
- Should be OK. I share branches between Linux and OS X all the time. I've done case changes (e.g. bzr mv Mailman mailman) and as long as I did it on Linux (obviously), when I pulled in the changes on OS X everything was hunky dory.
- hg
- Mercurial uses a case safe repository mechanism and detects case folding collisions.
- git
- Since OS X preserves case, you can do case changes there too. git does not have a problem with renames in either direction. However, case-insensitive filesystem support is usually taken to mean complaining about collisions on case-sensitive files systems. git does not do that.
Tools
In terms of code review tools such as Review Board [10] and Rietveld [11], the former supports all three while the latter supports hg and git but not bzr. Bazaar does not yet have an online review board, but it has several ways to manage email based reviews and trunk merging. There's Bundle Buggy [12], Patch Queue Manager [13] (PQM), and Launchpad's code reviews.
All three have some web site online that provides basic hosting support for people who want to put a repository online. Bazaar has Launchpad, Mercurial has bitbucket.org, and git has GitHub. Google Code also has instructions on how to use git with the service, both to hold a repository and how to act as a read-only mirror.
All three also appear to be supported by Buildbot [14].
Usage On Top Of Subversion
| DVCS | svn support |
|---|---|
| bzr | bzr-svn [15] (third-party) |
| hg | multiple third-parties |
| git | git-svn [16] |
All three DVCSs have svn support, although git is the only one to come with that support out-of-the-box.
Server Support
| DVCS | Web page interface |
|---|---|
| bzr | loggerhead [17] |
| hg | hgweb [18] |
| git | gitweb [19] |
All three DVCSs support various hooks on the client and server side for e.g. pre/post-commit verifications.
Development
All three projects are under active development. Git seems to be on a monthly release schedule. Bazaar is on a time-released monthly schedule. Mercurial is on a 4-month, timed release schedule.
Special Features
bzr
Martin Pool adds: "bzr has a stable Python scripting interface, with a distinction between public and private interfaces and a deprecation window for APIs that are changing. Some plugins are listed in https://edge.launchpad.net/bazaar and http://bazaar-vcs.org/Documentation".
hg
Alexander Solovyov comments:
Mercurial has easy to use extensive API with hooks for main events and ability to extend commands. Also there is the mq (mercurial queues) extension, distributed with Mercurial, which simplifies work with patches.
git
git has a cvsserver mode, ie, you can check out a tree from git using CVS. You can even commit to the tree, but features like merging are absent, and branches are handled as CVS modules, which is likely to shock a veteran CVS user.
Tests/Impressions
As I (Brett Cannon) am left with the task of of making the final decision of which/any DVCS to go with and not my co-authors, I felt it only fair to write down what tests I ran and my impressions as I evaluate the various tools so as to be as transparent as possible.
Barrier to Entry
The amount of time and effort it takes to get a checkout of Python's repository is critical. If the difficulty or time is too great then a person wishing to contribute to Python may very well give up. That cannot be allowed to happen.
I measured the checking out of the 2.x trunk as if I was a non-core developer. Timings were done using the time command in zsh and space was calculated with du -c -h.
| DVCS | San Francisco | Vancouver | Space |
|---|---|---|---|
| svn | 1:04 | 2:59 | 139 M |
| bzr | 10:45 | 16:04 | 276 M |
| hg | 2:30 | 5:24 | 171 M |
| git | 2:54 | 5:28 | 134 M |
When comparing these numbers to svn, it is important to realize that it is not a 1:1 comparison. Svn does not pull down the entire revision history like all of the DVCSs do. That means svn can perform an initial checkout much faster than the DVCS purely based on the fact that it has less information to download for the network.
Performance of basic information functionality
To see how the tools did for performing a command that required querying the history, the log for the README file was timed.
| DVCS | Time |
|---|---|
| bzr | 4.5 s |
| hg | 1.1 s |
| git | 1.5 s |
One thing of note during this test was that git took longer than the other three tools to figure out how to get the log without it using a pager. While the pager use is a nice touch in general, not having it automatically turn on took some time (turns out the main git command has a --no-pager flag to disable use of the pager).
Figuring out what command to use from built-in help
I ended up trying to find out what the command was to see what URL the repository was cloned from. To do this I used nothing more than the help provided by the tool itself or its man pages.
Bzr was the easiest: bzr info. Running bzr help didn't show what I wanted, but mentioned bzr help commands. That list had the command with a description that made sense.
Git was the second easiest. The command git help didn't show much and did not have a way of listing all commands. That is when I viewed the man page. Reading through the various commands I discovered git remote. The command itself spit out nothing more than origin. Trying git remote origin said it was an error and printed out the command usage. That is when I noticed git remote show. Running git remote show origin gave me the information I wanted.
For hg, I never found the information I wanted on my own. It turns out I wanted hg paths, but that was not obvious from the description of "show definition of symbolic path names" as printed by hg help (it should be noted that reporting this in the PEP did lead to the Mercurial developers to clarify the wording to make the use of the hg paths command clearer).
Updating a checkout
To see how long it takes to update an outdated repository I timed both updating a repository 700 commits behind and 50 commits behind (three weeks stale and 1 week stale, respectively).
| DVCS | 700 commits | 50 commits |
|---|---|---|
| bzr | 39 s | 7 s |
| hg | 17 s | 3 s |
| git | N/A | 4 s |
Note
Git lacks a value for the 700 commits scenario as it does not seem to allow checking out a repository at a specific revision.
Git deserves special mention for its output from git pull. It not only lists the delta change information for each file but also color-codes the information.
Decision
At PyCon 2009 the decision was made to go with Mercurial.
Why Mercurial over Subversion
While svn has served the development team well, it needs to be admitted that svn does not serve the needs of non-committers as well as a DVCS does. Because svn only provides its features such as version control, branching, etc. to people with commit privileges on the repository it can be a hinderance for people who lack commit privileges. But DVCSs have no such limitiation as anyone can create a local branch of Python and perform their own local commits without the burden that comes with cloning the entire svn repository. Allowing anyone to have the same workflow as the core developers was the key reason to switch from svn to hg.
Orthogonal to the benefits of allowing anyone to easily commit locally to their own branches is offline, fast operations. Because hg stores all data locally there is no need to send requests to a server remotely and instead work off of the local disk. This improves response times tremendously. It also allows for offline usage for when one lacks an Internet connection. But this benefit is minor and considered simply a side-effect benefit instead of a driving factor for switching off of Subversion.
Why Mercurial over other DVCSs
Git was not chosen for three key reasons (see the PyCon 2009 lightning talk where Brett Cannon lists these exact reasons; talk started at 3:45). First, git's Windows support is the weakest out of the three DVCSs being considered which is unacceptable as Python needs to support development on any platform it runs on. Since Python runs on Windows and some people do develop on the platform it needs solid support. And while git's support is improving, as of this moment it is the weakest by a large enough margin to warrant considering it a problem.
Second, and just as important as the first issue, is that the Python core developers liked git the least out of the three DVCS options by a wide margin. If you look at the following table you will see the results of a survey taken of the core developers and how by a large margin git is the least favorite version control system.
| DVCS | ++ | equal | -- | Uninformed |
|---|---|---|---|---|
| git | 5 | 1 | 8 | 13 |
| bzr | 10 | 3 | 2 | 12 |
| hg | 15 | 1 | 1 | 10 |
Lastly, all things being equal (which they are not as shown by the previous two issues), it is preferable to use and support a tool written in Python and not one written in C and shell. We are pragmatic enough to not choose a tool simply because it is written in Python, but we do see the usefulness in promoting tools that do use it when it is reasonable to do so as it is in this case.
As for why Mercurial was chosen over Bazaar, it came down to popularity. As the core developer survey shows, hg was preferred over bzr. But the community also appears to prefer hg as was shown at PyCon after git's removal from consideration was announced. Many people came up to Brett and said in various ways that they wanted hg to be chosen. While no one said they did not want bzr chosen, no one said they did either.
Based on all of this information, Guido and Brett decided Mercurial was to be the next version control system for Python.
Transition Plan
PEP 385 outlines the transition from svn to hg.
References
| [1] | http://bazaar-vcs.org/ |
| [2] | http://www.selenic.com/mercurial/ |
| [3] | http://www.git-scm.com/ |
| [4] | http://repo.or.cz/r/fast-export.git/.git/description |
| [5] | http://progetti.arstecnica.it/tailor/ |
| [6] | http://www.launchpad.net/ |
| [7] | http://www.bitbucket.org/ |
| [8] | http://www.github.com/ |
| [9] | http://selenic.com/pipermail/mercurial/2009-January/023710.html |
| [10] | http://www.review-board.org/ |
| [11] | http://code.google.com/p/rietveld/ |
| [12] | http://code.aaronbentley.com/bundlebuggy/ |
| [13] | http://bazaar-vcs.org/PatchQueueManager |
| [14] | http://buildbot.net |
| [15] | http://bazaar-vcs.org/BzrForeignBranches/Subversion |
| [16] | http://www.kernel.org/pub/software/scm/git/docs/git-svn.html |
| [17] | https://launchpad.net/loggerhead |
| [18] | http://www.selenic.com/mercurial/wiki/index.cgi/HgWebDirStepByStep |
| [19] | http://git.or.cz/gitwiki/Gitweb |
Copyright
This document has been placed in the public domain.
pep-0375 Python 3.1 Release Schedule
| PEP: | 375 |
|---|---|
| Title: | Python 3.1 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benjamin Peterson <benjamin at python.org> |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 8-Feb-2009 |
| Python-Version: | 3.1 |
Contents
Abstract
This document describes the development and release schedule for Python 3.1. The schedule primarily concerns itself with PEP-sized items. Small features may be added up to and including the first beta release. Bugs may be fixed until the final release.
Release Manager and Crew
| Position | Name |
|---|---|
| 3.1 Release Manager | Benjamin Peterson |
| Windows installers | Martin v. Loewis |
| Mac installers | Ronald Oussoren |
Release Schedule
- 3.1a1 March 7, 2009
- 3.1a2 April 4, 2009
- 3.1b1 May 6, 2009
- 3.1rc1 May 30, 2009
- 3.1rc2 June 13, 2009
- 3.1 final June 27, 2009
Maintenance Releases
3.1 is no longer maintained. 3.1 received security fixes until June 2012.
Previous maintenance releases are:
- v3.1.1rc1 2009-08-13
- v3.1.1 2009-08-16
- v3.1.2rc1 2010-03-06
- v3.1.2 2010-03-20
- v3.1.3rc1 2010-11-13
- v3.1.3 2010-11-27
- v3.1.4rc1 2011-05-29
- v3.1.4 2011-06-11
- v3.1.5rc1 2012-02-23
- v3.1.5rc2 2012-03-15
- v3.1.5 2012-04-06
Footnotes
| [1] | http://bugs.python.org/issue4136 |
| [2] | PEP 372 http://www.python.org/dev/peps/pep-0372/ |
| [3] | http://bugs.python.org/issue5237 |
Copyright
This document has been placed in the public domain.
pep-0376 Database of Installed Python Distributions
| PEP: | 376 |
|---|---|
| Title: | Database of Installed Python Distributions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tarek ZiadĂŠ <tarek at ziade.org> |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 22-Feb-2009 |
| Python-Version: | 2.7, 3.2 |
| Post-History: |
Contents
Abstract
The goal of this PEP is to provide a standard infrastructure to manage project distributions installed on a system, so all tools that are installing or removing projects are interoperable.
To achieve this goal, the PEP proposes a new format to describe installed distributions on a system. It also describes a reference implementation for the standard library.
In the past an attempt was made to create an installation database (see PEP 262 [3]).
Combined with PEP 345, the current proposal supersedes PEP 262.
Rationale
There are two problems right now in the way distributions are installed in Python:
- There are too many ways to do it and this makes interoperation difficult.
- There is no API to get information on installed distributions.
How distributions are installed
Right now, when a distribution is installed in Python, every element can be installed in a different directory.
For instance, Distutils installs the pure Python code in the purelib directory, which is lib\python2.6\site-packages for unix-like systems and Mac OS X, or Lib/site-packages under Python's installation directory for Windows.
Additionally, the install_egg_info subcommand of the Distutils install command adds an .egg-info file for the project into the purelib directory.
For example, for the docutils distribution, which contains one package an extra module and executable scripts, three elements are installed in site-packages:
- docutils: The docutils package.
- roman.py: An extra module used by docutils.
- docutils-0.5-py2.6.egg-info: A file containing the distribution metadata as described in PEP 314 [4]. This file corresponds to the file called PKG-INFO, built by the sdist command.
Some executable scripts, such as rst2html.py, are also added in the bin directory of the Python installation.
Another project called setuptools [5] has two other formats to install distributions, called EggFormats [8]:
- a self-contained .egg directory, that contains all the distribution files and the distribution metadata in a file called PKG-INFO in a subdirectory called EGG-INFO. setuptools creates other files in that directory that can be considered as complementary metadata.
- an .egg-info directory installed in site-packages, that contains the same files EGG-INFO has in the .egg format.
The first format is automatically used when you install a distribution that uses the setuptools.setup function in its setup.py file, instead of the distutils.core.setup one.
setuptools also add a reference to the distribution into an easy-install.pth file.
Last, the setuptools project provides an executable script called easy_install [6] that installs all distributions, including distutils-based ones in self-contained .egg directories.
If you want to have standalone .egg-info directories for your distributions, e.g. the second setuptools format, you have to force it when you work with a setuptools-based distribution or with the easy_install script. You can force it by using the -âsingle-version-externally-managed option or the --root option. This will make the setuptools project install the project like distutils does.
This option is used by :
Uninstall information
Distutils doesn't provide an uninstall command. If you want to uninstall a distribution, you have to be a power user and remove the various elements that were installed, and then look over the .pth file to clean them if necessary.
And the process differs depending on the tools you have used to install the distribution and if the distribution's setup.py uses Distutils or Setuptools.
Under some circumstances, you might not be able to know for sure that you have removed everything, or that you didn't break another distribution by removing a file that is shared among several distributions.
But there's a common behavior: when you install a distribution, files are copied in your system. And it's possible to keep track of these files for later removal.
Moreover, the Pip project has gained an uninstall feature lately. It records all installed files, using the record option of the install command.
What this PEP proposes
To address those issues, this PEP proposes a few changes:
- A new .dist-info structure using a directory, inspired on one format of the EggFormats standard from setuptools.
- New APIs in pkgutil to be able to query the information of installed distributions.
- An uninstall function and an uninstall script in Distutils.
One .dist-info directory per installed distribution
This PEP proposes an installation format inspired by one of the options in the EggFormats standard, the one that uses a distinct directory located in the site-packages directory.
This distinct directory is named as follows:
name + '-' + version + '.dist-info'
This .dist-info directory can contain these files:
METADATA: contains metadata, as described in PEP 345, PEP 314 and PEP 241.
RECORD: records the list of installed files
INSTALLER: records the name of the tool used to install the project
- REQUESTED: the presence of this file indicates that the project
installation was explicitly requested (i.e., not installed as a dependency).
The METADATA, RECORD and INSTALLER files are mandatory, while REQUESTED may be missing.
This proposal will not impact Python itself because the metadata files are not used anywhere yet in the standard library besides Distutils.
It will impact the setuptools and pip projects but, given the fact that they already work with a directory that contains a PKG-INFO file, the change will have no deep consequences.
RECORD
A RECORD file is added inside the .dist-info directory at installation time when installing a source distribution using the install command. Notice that when installing a binary distribution created with bdist command or a bdist-based command, the RECORD file will be installed as well since these commands use the install command to create binary distributions.
The RECORD file holds the list of installed files. These correspond to the files listed by the record option of the install command, and will be generated by default. This allows the implementation of an uninstallation feature, as explained later in this PEP. The install command also provides an option to prevent the RECORD file from being written and this option should be used when creating system packages.
Third-party installation tools also should not overwrite or delete files that are not in a RECORD file without prompting or warning.
This RECORD file is inspired from PEP 262 FILES [3].
The RECORD file is a CSV file, composed of records, one line per installed file. The csv module is used to read the file, with these options:
- field delimiter : ,
- quoting char : ".
- line terminator : os.linesep (so \r\n or \n)
When a distribution is installed, files can be installed under:
- the base location: path defined by the --install-lib option, which defaults to the site-packages directory.
- the installation prefix: path defined by the --prefix option, which defaults to sys.prefix.
- any other path on the system.
Each record is composed of three elements:
the file's path
- a '/'-separated path, relative to the base location, if the file is under the base location.
- a '/'-separated path, relative to the base location, if the file is under the installation prefix AND if the base location is a subpath of the installation prefix.
- an absolute path, using the local platform separator
a hash of the file's contents. Notice that pyc and pyo generated files don't have any hash because they are automatically produced from py files. So checking the hash of the corresponding py file is enough to decide if the file and its associated pyc or pyo files have changed.
The hash is either the empty string or the hash algorithm as named in hashlib.algorithms_guaranteed, followed by the equals character =, followed by the urlsafe-base64-nopad encoding of the digest (base64.urlsafe_b64encode(digest) with trailing = removed).
the file's size in bytes
The csv module is used to generate this file, so the field separator is ",". Any "," character found within a field is escaped automatically by csv.
When the file is read, the U option is used so the universal newline support (see PEP 278 [10]) is activated, avoiding any trouble reading a file produced on a platform that uses a different new line terminator.
Here's an example of a RECORD file (extract):
lib/python2.6/site-packages/docutils/__init__.py,md5=nWt-Dge1eug4iAgqLS_uWg,9544 lib/python2.6/site-packages/docutils/__init__.pyc,, lib/python2.6/site-packages/docutils/core.py,md5=X90C_JLIcC78PL74iuhPnA,66188 lib/python2.6/site-packages/docutils/core.pyc,, lib/python2.6/site-packages/roman.py,md5=7YhfNczihNjOY0FXlupwBg,234 lib/python2.6/site-packages/roman.pyc,, /usr/local/bin/rst2html.py,md5=g22D3amDLJP-FhBzCi7EvA,234 /usr/local/bin/rst2html.pyc,, python2.6/site-packages/docutils-0.5.dist-info/METADATA,md5=ovJyUNzXdArGfmVyb0onyA,195 lib/python2.6/site-packages/docutils-0.5.dist-info/RECORD,,
Notice that the RECORD file can't contain a hash of itself and is just mentioned here
A project that installs a config.ini file in /etc/myapp will be added like this:
/etc/myapp/config.ini,md5=gLfd6IANquzGLhOkW4Mfgg,9544
For a windows platform, the drive letter is added for the absolute paths, so a file that is copied in c:MyAppwill be:
c:\etc\myapp\config.ini,md5=gLfd6IANquzGLhOkW4Mfgg,9544
INSTALLER
The install command has a new option called installer. This option is the name of the tool used to invoke the installation. It's an normalized lower-case string matching [a-z0-9_-.].
$ python setup.py install --installer=pkg-system
It defaults to distutils if not provided.
When a distribution is installed, the INSTALLER file is generated in the .dist-info directory with this value, to keep track of who installed the distribution. The file is a single-line text file.
REQUESTED
Some install tools automatically detect unfulfilled dependencies and install them. In these cases, it is useful to track which distributions were installed purely as a dependency, so if their dependent distribution is later uninstalled, the user can be alerted of the orphaned dependency.
If a distribution is installed by direct user request (the usual case), a file REQUESTED is added to the .dist-info directory of the installed distribution. The REQUESTED file may be empty, or may contain a marker comment line beginning with the "#" character.
If an install tool installs a distribution automatically, as a dependency of another distribution, the REQUESTED file should not be created.
The install command of distutils by default creates the REQUESTED file. It accepts --requested and --no-requested options to explicitly specify whether the file is created.
If a distribution that was already installed on the system as a dependency is later installed by name, the distutils install command will create the REQUESTED file in the .dist-info directory of the existing installation.
Implementation details
New functions and classes in pkgutil
To use the .dist-info directory content, we need to add in the standard library a set of APIs. The best place to put these APIs is pkgutil.
Functions
The new functions added in the pkgutil module are :
distinfo_dirname(name, version) -> directory name
name is converted to a standard distribution name by replacing any runs of non-alphanumeric characters with a single '-'.
version is converted to a standard version string. Spaces become dots, and all other non-alphanumeric characters (except dots) become dashes, with runs of multiple dashes condensed to a single dash.
Both attributes are then converted into their filename-escaped form, i.e. any '-' characters are replaced with '_' other than the one in 'dist-info' and the one separating the name from the version number.
get_distributions() -> iterator of Distribution instances.
Provides an iterator that looks for .dist-info directories in sys.path and returns Distribution instances for each one of them.
get_distribution(name) -> Distribution or None.
obsoletes_distribution(name, version=None) -> iterator of Distribution instances.
Iterates over all distributions to find which distributions obsolete name. If a version is provided, it will be used to filter the results.
provides_distribution(name, version=None) -> iterator of Distribution instances.
Iterates over all distributions to find which distributions provide name. If a version is provided, it will be used to filter the results. Scans all elements in sys.path and looks for all directories ending with .dist-info. Returns a Distribution corresponding to the .dist-info directory that contains a METADATA that matches name for the name metadata.
This function only returns the first result founded, since no more than one values are expected. If the directory is not found, returns None.
get_file_users(path) -> iterator of Distribution instances.
Iterates over all distributions to find out which distributions uses path. path can be a local absolute path or a relative '/'-separated path.
A local absolute path is an absolute path in which occurrences of '/' have been replaced by the system separator given by os.sep.
Distribution class
A new class called Distribution is created with the path of the .dist-info directory provided to the constructor. It reads the metadata contained in METADATA when it is instantiated.
Distribution(path) -> instance
Creates a Distribution instance for the given path.
Distribution provides the following attributes:
- name: The name of the distribution.
- metadata: A DistributionMetadata instance loaded with the distribution's METADATA file.
- requested: A boolean that indicates whether the REQUESTED metadata file is present (in other words, whether the distribution was installed by user request).
And following methods:
get_installed_files(local=False) -> iterator of (path, hash, size)
Iterates over the RECORD entries and return a tuple (path, hash, size) for each line. If local is True, the path is transformed into a local absolute path. Otherwise the raw value from RECORD is returned.
A local absolute path is an absolute path in which occurrences of '/' have been replaced by the system separator given by os.sep.
uses(path) -> Boolean
Returns True if path is listed in RECORD. path can be a local absolute path or a relative '/'-separated path.
get_distinfo_file(path, binary=False) -> file object
Returns a file located under the .dist-info directory.
Returns a file instance for the file pointed by path.
path has to be a '/'-separated path relative to the .dist-info directory or an absolute path.
If path is an absolute path and doesn't start with the .dist-info directory path, a DistutilsError is raised.
If binary is True, opens the file in read-only binary mode (rb), otherwise opens it in read-only mode (r).
get_distinfo_files(local=False) -> iterator of paths
Iterates over the RECORD entries and returns paths for each line if the path is pointing to a file located in the .dist-info directory or one of its subdirectories.
If local is True, each path is transformed into a local absolute path. Otherwise the raw value from RECORD is returned.
Notice that the API is organized in five classes that work with directories and Zip files (so it works with files included in Zip files, see PEP 273 for more details [9]). These classes are described in the documentation of the prototype implementation for interested readers [13].
Examples
Let's use some of the new APIs with our docutils example:
>>> from pkgutil import get_distribution, get_file_users, distinfo_dirname
>>> dist = get_distribution('docutils')
>>> dist.name
'docutils'
>>> dist.metadata.version
'0.5'
>>> distinfo_dirname('docutils', '0.5')
'docutils-0.5.dist-info'
>>> distinfo_dirname('python-ldap', '2.5')
'python_ldap-2.5.dist-info'
>>> distinfo_dirname('python-ldap', '2.5 a---5')
'python_ldap-2.5.a_5.dist-info'
>>> for path, hash, size in dist.get_installed_files()::
... print '%s %s %d' % (path, hash, size)
...
python2.6/site-packages/docutils/__init__.py,b690274f621402dda63bf11ba5373bf2,9544
python2.6/site-packages/docutils/core.py,9c4b84aff68aa55f2e9bf70481b94333,66188
python2.6/site-packages/roman.py,a4b84aff68aa55f2e9bf70481b943D3,234
/usr/local/bin/rst2html.py,a4b84aff68aa55f2e9bf70481b943D3,234
python2.6/site-packages/docutils-0.5.dist-info/METADATA,6fe57de576d749536082d8e205b77748,195
python2.6/site-packages/docutils-0.5.dist-info/RECORD
>>> dist.uses('docutils/core.py')
True
>>> dist.uses('/usr/local/bin/rst2html.py')
True
>>> dist.get_distinfo_file('METADATA')
<open file at ...>
>>> dist.requested
True
New functions in Distutils
Distutils already provides a very basic way to install a distribution, which is running the install command over the setup.py script of the distribution.
Distutils2 [3] will provide a very basic uninstall function, that is added in distutils2.util and takes the name of the distribution to uninstall as its argument. uninstall uses the APIs described earlier and remove all unique files, as long as their hash didn't change. Then it removes empty directories left behind.
uninstall returns a list of uninstalled files:
>>> from distutils2.util import uninstall
>>> uninstall('docutils')
['/opt/local/lib/python2.6/site-packages/docutils/core.py',
...
'/opt/local/lib/python2.6/site-packages/docutils/__init__.py']
If the distribution is not found, a DistutilsUninstallError is raised.
Filtering
To make it a reference API for third-party projects that wish to control how uninstall works, a second callable argument can be used. It's called for each file that is removed. If the callable returns True, the file is removed. If it returns False, it's left alone.
Examples:
>>> def _remove_and_log(path):
... logging.info('Removing %s' % path)
... return True
...
>>> uninstall('docutils', _remove_and_log)
>>> def _dry_run(path):
... logging.info('Removing %s (dry run)' % path)
... return False
...
>>> uninstall('docutils', _dry_run)
Of course, a third-party tool can use lower-level pkgutil APIs to implement its own uninstall feature.
Installer marker
As explained earlier in this PEP, the install command adds an INSTALLER file in the .dist-info directory with the name of the installer.
To avoid removing distributions that were installed by another packaging system, the uninstall function takes an extra argument installer which defaults to distutils2.
When called, uninstall controls that the INSTALLER file matches this argument. If not, it raises a DistutilsUninstallError:
>>> uninstall('docutils')
Traceback (most recent call last):
...
DistutilsUninstallError: docutils was installed by 'cool-pkg-manager'
>>> uninstall('docutils', installer='cool-pkg-manager')
This allows a third-party application to use the uninstall function and strongly suggest that no other program remove a distribution it has previously installed. This is useful when a third-party program that relies on Distutils APIs does extra steps on the system at installation time, it has to undo at uninstallation time.
Adding an Uninstall script
An uninstall script is added in Distutils2. and is used like this:
$ python -m distutils2.uninstall projectname
Notice that script doesn't control if the removal of a distribution breaks another distribution. Although it makes sure that all the files it removes are not used by any other distribution, by using the uninstall function.
Also note that this uninstall script pays no attention to the REQUESTED metadata; that is provided only for use by external tools to provide more advanced dependency management.
Backward compatibility and roadmap
These changes don't introduce any compatibility problems since they will be implemented in:
- pkgutil in new functions
- distutils2
The plan is to include the functionality outlined in this PEP in pkgutil for Python 3.2, and in Distutils2.
Distutils2 will also contain a backport of the new pgkutil, and can be used for 2.4 onward.
Distributions installed using existing, pre-standardization formats do not have the necessary metadata available for the new API, and thus will be ignored. Third-party tools may of course to continue to support previous formats in addition to the new format, in order to ease the transition.
References
| [1] | http://docs.python.org/distutils |
| [2] | http://hg.python.org/distutils2 |
| [3] | (1, 2, 3) http://www.python.org/dev/peps/pep-0262 |
| [4] | http://www.python.org/dev/peps/pep-0314 |
| [5] | http://peak.telecommunity.com/DevCenter/setuptools |
| [6] | http://peak.telecommunity.com/DevCenter/EasyInstall |
| [7] | http://pypi.python.org/pypi/pip |
| [8] | http://peak.telecommunity.com/DevCenter/EggFormats |
| [9] | http://www.python.org/dev/peps/pep-0273 |
| [10] | http://www.python.org/dev/peps/pep-0278 |
| [11] | http://fedoraproject.org/wiki/Packaging/Python/Eggs#Providing_Eggs_using_Setuptools |
| [12] | http://wiki.debian.org/DebianPython/NewPolicy |
| [13] | http://bitbucket.org/tarek/pep376/ |
Acknowledgements
Jim Fulton, Ian Bicking, Phillip Eby, Rafael Villar Burke, and many people at Pycon and Distutils-SIG.
Copyright
This document has been placed in the public domain.
pep-0377 Allow __enter__() methods to skip the statement body
| PEP: | 377 |
|---|---|
| Title: | Allow __enter__() methods to skip the statement body |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 8-Mar-2009 |
| Python-Version: | 2.7, 3.1 |
| Post-History: | 8-Mar-2009 |
Contents
Abstract
This PEP proposes a backwards compatible mechanism that allows __enter__() methods to skip the body of the associated with statement. The lack of this ability currently means the contextlib.contextmanager decorator is unable to fulfil its specification of being able to turn arbitrary code into a context manager by moving it into a generator function with a yield in the appropriate location. One symptom of this is that contextlib.nested will currently raise RuntimeError in situations where writing out the corresponding nested with statements would not [1].
The proposed change is to introduce a new flow control exception SkipStatement, and skip the execution of the with statement body if __enter__() raises this exception.
PEP Rejection
This PEP was rejected by Guido [4] as it imposes too great an increase in complexity without a proportional increase in expressiveness and correctness. In the absence of compelling use cases that need the more complex semantics proposed by this PEP the existing behaviour is considered acceptable.
Proposed Change
The semantics of the with statement will be changed to include a new try/except/else block around the call to __enter__(). If SkipStatement is raised by the __enter__() method, then the main section of the with statement (now located in the else clause) will not be executed. To avoid leaving the names in any as clause unbound in this case, a new StatementSkipped singleton (similar to the existing NotImplemented singleton) will be assigned to all names that appear in the as clause.
The components of the with statement remain as described in PEP 343 [2]:
with EXPR as VAR:
BLOCK
After the modification, the with statement semantics would be as follows:
mgr = (EXPR)
exit = mgr.__exit__ # Not calling it yet
try:
value = mgr.__enter__()
except SkipStatement:
VAR = StatementSkipped
# Only if "as VAR" is present and
# VAR is a single name
# If VAR is a tuple of names, then StatementSkipped
# will be assigned to each name in the tuple
else:
exc = True
try:
try:
VAR = value # Only if "as VAR" is present
BLOCK
except:
# The exceptional case is handled here
exc = False
if not exit(*sys.exc_info()):
raise
# The exception is swallowed if exit() returns true
finally:
# The normal and non-local-goto cases are handled here
if exc:
exit(None, None, None)
With the above change in place for the with statement semantics, contextlib.contextmanager() will then be modified to raise SkipStatement instead of RuntimeError when the underlying generator doesn't yield.
Rationale for Change
Currently, some apparently innocuous context managers may raise RuntimeError when executed. This occurs when the context manager's __enter__() method encounters a situation where the written out version of the code corresponding to the context manager would skip the code that is now the body of the with statement. Since the __enter__() method has no mechanism available to signal this to the interpreter, it is instead forced to raise an exception that not only skips the body of the with statement, but also jumps over all code until the nearest exception handler. This goes against one of the design goals of the with statement, which was to be able to factor out arbitrary common exception handling code into a single context manager by putting into a generator function and replacing the variant part of the code with a yield statement.
Specifically, the following examples behave differently if cmB().__enter__() raises an exception which cmA().__exit__() then handles and suppresses:
with cmA():
with cmB():
do_stuff()
# This will resume here without executing "do_stuff()"
@contextlib.contextmanager
def combined():
with cmA():
with cmB():
yield
with combined():
do_stuff()
# This will raise a RuntimeError complaining that the context
# manager's underlying generator didn't yield
with contextlib.nested(cmA(), cmB()):
do_stuff()
# This will raise the same RuntimeError as the contextmanager()
# example (unsurprising, given that the nested() implementation
# uses contextmanager())
# The following class based version shows that the issue isn't
# specific to contextlib.contextmanager() (it also shows how
# much simpler it is to write context managers as generators
# instead of as classes!)
class CM(object):
def __init__(self):
self.cmA = None
self.cmB = None
def __enter__(self):
if self.cmA is not None:
raise RuntimeError("Can't re-use this CM")
self.cmA = cmA()
self.cmA.__enter__()
try:
self.cmB = cmB()
self.cmB.__enter__()
except:
self.cmA.__exit__(*sys.exc_info())
# Can't suppress in __enter__(), so must raise
raise
def __exit__(self, *args):
suppress = False
try:
if self.cmB is not None:
suppress = self.cmB.__exit__(*args)
except:
suppress = self.cmA.__exit__(*sys.exc_info()):
if not suppress:
# Exception has changed, so reraise explicitly
raise
else:
if suppress:
# cmB already suppressed the exception,
# so don't pass it to cmA
suppress = self.cmA.__exit__(None, None, None):
else:
suppress = self.cmA.__exit__(*args):
return suppress
With the proposed semantic change in place, the contextlib based examples above would then "just work", but the class based version would need a small adjustment to take advantage of the new semantics:
class CM(object):
def __init__(self):
self.cmA = None
self.cmB = None
def __enter__(self):
if self.cmA is not None:
raise RuntimeError("Can't re-use this CM")
self.cmA = cmA()
self.cmA.__enter__()
try:
self.cmB = cmB()
self.cmB.__enter__()
except:
if self.cmA.__exit__(*sys.exc_info()):
# Suppress the exception, but don't run
# the body of the with statement either
raise SkipStatement
raise
def __exit__(self, *args):
suppress = False
try:
if self.cmB is not None:
suppress = self.cmB.__exit__(*args)
except:
suppress = self.cmA.__exit__(*sys.exc_info()):
if not suppress:
# Exception has changed, so reraise explicitly
raise
else:
if suppress:
# cmB already suppressed the exception,
# so don't pass it to cmA
suppress = self.cmA.__exit__(None, None, None):
else:
suppress = self.cmA.__exit__(*args):
return suppress
There is currently a tentative suggestion [3] to add import-style syntax to the with statement to allow multiple context managers to be included in a single with statement without needing to use contextlib.nested. In that case the compiler has the option of simply emitting multiple with statements at the AST level, thus allowing the semantics of actual nested with statements to be reproduced accurately. However, such a change would highlight rather than alleviate the problem the current PEP aims to address: it would not be possible to use contextlib.contextmanager to reliably factor out such with statements, as they would exhibit exactly the same semantic differences as are seen with the combined() context manager in the above example.
Performance Impact
Implementing the new semantics makes it necessary to store the references to the __enter__ and __exit__ methods in temporary variables instead of on the stack. This results in a slight regression in with statement speed relative to Python 2.6/3.1. However, implementing a custom SETUP_WITH opcode would negate any differences between the two approaches (as well as dramatically improving speed by eliminating more than a dozen unnecessary trips around the eval loop).
Reference Implementation
Patch attached to Issue 5251 [1]. That patch uses only existing opcodes (i.e. no SETUP_WITH).
Acknowledgements
James William Pye both raised the issue and suggested the basic outline of the solution described in this PEP.
References
| [1] | (1, 2) Issue 5251: contextlib.nested inconsistent with nested with statements (http://bugs.python.org/issue5251) |
| [2] | PEP 343: The "with" Statement (http://www.python.org/dev/peps/pep-0343/) |
| [3] | Import-style syntax to reduce indentation of nested with statements (http://mail.python.org/pipermail/python-ideas/2009-March/003188.html) |
| [4] | Guido's rejection of the PEP (http://mail.python.org/pipermail/python-dev/2009-March/087263.html) |
Copyright
This document has been placed in the public domain.
pep-0378 Format Specifier for Thousands Separator
| PEP: | 378 |
|---|---|
| Title: | Format Specifier for Thousands Separator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Raymond Hettinger <python at rcn.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Mar-2009 |
| Python-Version: | 2.7 and 3.1 |
| Post-History: | 12-Mar-2009 |
Contents
Motivation
Provide a simple, non-locale aware way to format a number with a thousands separator.
Adding thousands separators is one of the simplest ways to humanize a program's output, improving its professional appearance and readability.
In the finance world, output with thousands separators is the norm. Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious.
The locale module presents two other challenges. First, it is a global setting and not suitable for multi-threaded apps that need to serve-up requests in multiple locales. Second, the name of a relevant locale (such as "de_DE") can vary from platform to platform or may not be defined at all. The docs for the locale module describe these and many other challenges [1] in detail.
It is not the goal to replace the locale module, to perform internationalization tasks, or accommodate every possible convention. Such tasks are better suited to robust tools like Babel [2]. Instead, the goal is to make a common, everyday task easier for many users.
Main Proposal (from Nick Coghlan, originally called Proposal I)
A comma will be added to the format() specifier mini-language:
[[fill]align][sign][#][0][width][,][.precision][type]
The ',' option indicates that commas should be included in the output as a thousands separator. As with locales which do not use a period as the decimal point, locales which use a different convention for digit separation will need to use the locale module to obtain appropriate formatting.
The proposal works well with floats, ints, and decimals. It also allows easy substitution for other separators. For example:
format(n, "6,d").replace(",", "_")
This technique is completely general but it is awkward in the one case where the commas and periods need to be swapped:
format(n, "6,f").replace(",", "X").replace(".", ",").replace("X", ".")
The width argument means the total length including the commas and decimal point:
format(1234, "08,d") --> '0001,234' format(1234.5, "08,.1f") --> '01,234.5'
The ',' option is defined as shown above for types 'd', 'e', 'f', 'g', 'E', 'G', '%', 'F' and ''. To allow future extensions, it is undefined for other types: binary, octal, hex, character, etc.
This proposal has the virtue of being simpler than the alternative proposal but is much less flexible and meets the needs of fewer users right out of the box. It is expected that some other solution will arise for specifying alternative separators.
Current Version of the Mini-Language
- Python 2.6 docs [3]
- PEP 3101 Advanced String Formatting
Research into what Other Languages Do
Scanning the web, I've found that thousands separators are usually one of COMMA, DOT, SPACE, APOSTROPHE or UNDERSCORE.
C-Sharp [4] provides both styles (picture formatting and type specifiers). The type specifier approach is locale aware. The picture formatting only offers a COMMA as a thousands separator:
String.Format("{0:n}", 12400) ==> "12,400"
String.Format("{0:0,0}", 12400) ==> "12,400"
Common Lisp [5] uses a COLON before the ~D decimal type specifier to emit a COMMA as a thousands separator. The general form of ~D is ~mincol,padchar,commachar,commaintervalD. The padchar defaults to SPACE. The commachar defaults to COMMA. The commainterval defaults to three.
(format nil "~:D" 229345007) => "229,345,007"
- The ADA language [6] allows UNDERSCORES in its numeric literals.
Visual Basic and its brethren (like MS Excel [7]) use a completely different style and have ultra-flexible custom format specifiers like:
"_($* #,##0_)".
COBOL [8] uses picture clauses like:
PICTURE $***,**9.99CR
Java offers a Decimal.Format Class [9] that uses picture patterns (one for positive numbers and an optional one for negatives) such as: "#,##0.00;(#,##0.00)". It allows arbitrary groupings including hundreds and ten-thousands and uneven groupings. The special patten characters are non-localized (using a DOT for a decimal separator and a COMMA for a grouping separator). The user can supply an alternate set of symbols using the formatter's DecimalFormatSymbols object.
Alternative Proposal (from Eric Smith, originally called Proposal II)
Make both the thousands separator and decimal separator user specifiable but not locale aware. For simplicity, limit the choices to a COMMA, DOT, SPACE, APOSTROPHE or UNDERSCORE. The SPACE can be either U+0020 or U+00A0.
Whenever a separator is followed by a precision, it is a decimal separator and an optional separator preceding it is a thousands separator. When the precision is absent, a lone specifier means a thousands separator:
[[fill]align][sign][#][0][width][tsep][dsep precision][type]
Examples:
format(1234, "8.1f") --> ' 1234.0' format(1234, "8,1f") --> ' 1234,0' format(1234, "8.,1f") --> ' 1.234,0' format(1234, "8 ,f") --> ' 1 234,0' format(1234, "8d") --> ' 1234' format(1234, "8,d") --> ' 1,234' format(1234, "8_d") --> ' 1_234'
This proposal meets mosts needs, but it comes at the expense of taking a bit more effort to parse. Not every possible convention is covered, but at least one of the options (spaces or underscores) should be readable, understandable, and useful to folks from many diverse backgrounds.
As shown in the examples, the width argument means the total length including the thousands separators and decimal separators.
No change is proposed for the locale module.
The thousands separator is defined as shown above for types 'd', 'e', 'f', 'g', '%', 'E', 'G' and 'F'. To allow future extensions, it is undefined for other types: binary, octal, hex, character, etc.
The drawback to this alternative proposal is the difficulty of mentally parsing whether a single separator is a thousands separator or decimal separator. Perhaps it is too arcane to link the decimal separator with the precision specifier.
Commentary
- Some commenters do not like the idea of format strings at all and find them to be unreadable. Suggested alternatives include the COBOL style PICTURE approach or a convenience function with keyword arguments for every possible combination.
- Some newsgroup respondants think there is no place for any scripts that are not internationalized and that it is a step backwards to provide a simple way to hardwire a particular choice (thus reducing incentive to use a locale sensitive approach).
- Another thought is that embedding some particular convention in individual format strings makes it hard to change that convention later. No workable alternative was suggested but the general idea is to set the convention once and have it apply everywhere (others commented that locale already provides a way to do this).
- There are some precedents for grouping digits in the fractional part of a floating point number, but this PEP does not venture into that territory. Only digits to the left of the decimal point are grouped. This does not preclude future extensions; it just focuses on a single, generally useful extension to the formatting language.
- James Knight observed that Indian/Pakistani numbering systems group by hundreds. Ben Finney noted that Chinese group by ten-thousands. Eric Smith pointed-out that these are already handled by the "n" specifier in the locale module (albeit only for integers). This PEP does not attempt to support all of those possibilities. It focues on a single, relatively common grouping convention that offers a quick way to improve readability in many (though not all) contexts.
References
| [1] | http://www.python.org/doc/2.6.1/library/locale.html#background-details-hints-tips-and-caveats |
| [2] | http://babel.edgewall.org/ |
| [3] | http://www.python.org/doc/2.6.1/library/string.html#formatstrings |
| [4] | http://blog.stevex.net/index.php/string-formatting-in-csharp/ |
| [5] | http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node200.html |
| [6] | http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html |
| [7] | http://www.brainbell.com/tutorials/ms-office/excel/Create_Custom_Number_Formats.htm |
| [8] | http://en.wikipedia.org/wiki/Cobol#Syntactic_features |
| [9] | http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html |
Copyright
This document has been placed in the public domain.
pep-0379 Adding an Assignment Expression
| PEP: | 379 |
|---|---|
| Title: | Adding an Assignment Expression |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jervis Whitley <jervisau at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 14-Mar-2009 |
| Python-Version: | 2.7, 3.2 |
| Post-History: |
Abstract
This PEP adds a new assignment expression to the Python language
to make it possible to assign the result of an expression in
almost any place. The new expression will allow the assignment of
the result of an expression at first use (in a comparison for
example).
Motivation and Summary
Issue1714448 "if something as x:" [1] describes a feature to allow
assignment of the result of an expression in an if statement to a
name. It supposed that the 'as' syntax could be borrowed for this
purpose. Many times it is not the expression itself that is
interesting, rather one of the terms that make up the
expression. To be clear, something like this:
if (f_result() == [1, 2, 3]) as res:
seems awfully limited, when this:
if (f_result() as res) == [1, 2, 3]:
is probably the desired result.
Use Cases
See the Examples section near the end.
Specification
A new expression is proposed with the (nominal) syntax:
EXPR -> VAR
This single expression does the following:
- Evaluate the value of EXPR, an arbitrary expression;
- Assign the result to VAR, a single assignment target; and
- Leave the result of EXPR on the Top of Stack (TOS)
Here '->' or (RARROW) has been used to illustrate the concept that
the result of EXPR is assigned to VAR.
The translation of the proposed syntax is:
VAR = (EXPR)
(EXPR)
The assignment target can be either an attribute, a subscript or
name:
f() -> name[0] # where 'name' exists previously.
f() -> name.attr # again 'name' exists prior to this
expression.
f() -> name
This expression should be available anywhere that an expression is
currently accepted.
All exceptions that are currently raised during invalid
assignments will continue to be raised when using the assignment
expression. For example, a NameError will be raised when in
example 1 and 2 above if 'name' is not previously defined, or an
IndexError if index 0 was out of range.
Examples from the Standard Library
The following two examples were chosen after a brief search
through the standard library, specifically both are from ast.py
which happened to be open at the time of the search.
Original:
def walk(node):
from collections import deque
todo = deque([node])
while todo:
node = todo.popleft()
todo.extend(iter_child_nodes(node))
yield node
Using assignment expression:
def walk(node):
from collections import deque
todo = deque([node])
while todo:
todo.extend(iter_child_nodes(todo.popleft() -> node))
yield node
Original:
def get_docstring(node, clean=True):
if not isinstance(node, (FunctionDef, ClassDef, Module)):
raise TypeError("%r can't have docstrings"
% node.__class__.__name__)
if node.body and isinstance(node.body[0], Expr) and \
isinstance(node.body[0].value, Str):
if clean:
import inspect
return inspect.cleandoc(node.body[0].value.s)
return node.body[0].value.s
Using assignment expresion:
def get_docstring(node, clean=True):
if not isinstance(node, (FunctionDef, ClassDef, Module)):
raise TypeError("%r can't have docstrings"
% node.__class__.__name__)
if node.body -> body and isinstance(body[0] -> elem, Expr) and \
isinstance(elem.value -> value, Str):
if clean:
import inspect
return inspect.cleandoc(value.s)
return value.s
Examples
The examples shown below highlight some of the desirable features
of the assignment expression, and some of the possible corner
cases.
1. Assignment in an if statement for use later.
def expensive():
import time; time.sleep(1)
return 'spam'
if expensive() -> res in ('spam', 'eggs'):
dosomething(res)
2. Assignment in a while loop clause.
while len(expensive() -> res) == 4:
dosomething(res)
3. Keep the iterator object from the for loop.
for ch in expensive() -> res:
sell_on_internet(res)
4. Corner case.
for ch -> please_dont in expensive():
pass
# who would want to do this? Not I.
References
[1] Issue1714448 "if something as x:", k0wax
http://bugs.python.org/issue1714448
Copyright
This document has been placed in the public domain.
pep-0380 Syntax for Delegating to a Subgenerator
| PEP: | 380 |
|---|---|
| Title: | Syntax for Delegating to a Subgenerator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gregory Ewing <greg.ewing at canterbury.ac.nz> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 13-Feb-2009 |
| Python-Version: | 3.3 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2011-June/112010.html |
Contents
Abstract
A syntax is proposed for a generator to delegate part of its operations to another generator. This allows a section of code containing 'yield' to be factored out and placed in another generator. Additionally, the subgenerator is allowed to return with a value, and the value is made available to the delegating generator.
The new syntax also opens up some opportunities for optimisation when one generator re-yields values produced by another.
PEP Acceptance
Guido officially accepted the PEP [1] on 26th June, 2011.
Motivation
A Python generator is a form of coroutine, but has the limitation that it can only yield to its immediate caller. This means that a piece of code containing a yield cannot be factored out and put into a separate function in the same way as other code. Performing such a factoring causes the called function to itself become a generator, and it is necessary to explicitly iterate over this second generator and re-yield any values that it produces.
If yielding of values is the only concern, this can be performed without much difficulty using a loop such as
for v in g:
yield v
However, if the subgenerator is to interact properly with the caller in the case of calls to send(), throw() and close(), things become considerably more difficult. As will be seen later, the necessary code is very complicated, and it is tricky to handle all the corner cases correctly.
A new syntax will be proposed to address this issue. In the simplest use cases, it will be equivalent to the above for-loop, but it will also handle the full range of generator behaviour, and allow generator code to be refactored in a simple and straightforward way.
Proposal
The following new expression syntax will be allowed in the body of a generator:
yield from <expr>
where <expr> is an expression evaluating to an iterable, from which an iterator is extracted. The iterator is run to exhaustion, during which time it yields and receives values directly to or from the caller of the generator containing the yield from expression (the "delegating generator").
Furthermore, when the iterator is another generator, the subgenerator is allowed to execute a return statement with a value, and that value becomes the value of the yield from expression.
The full semantics of the yield from expression can be described in terms of the generator protocol as follows:
- Any values that the iterator yields are passed directly to the caller.
- Any values sent to the delegating generator using send() are passed directly to the iterator. If the sent value is None, the iterator's __next__() method is called. If the sent value is not None, the iterator's send() method is called. If the call raises StopIteration, the delegating generator is resumed. Any other exception is propagated to the delegating generator.
- Exceptions other than GeneratorExit thrown into the delegating generator are passed to the throw() method of the iterator. If the call raises StopIteration, the delegating generator is resumed. Any other exception is propagated to the delegating generator.
- If a GeneratorExit exception is thrown into the delegating generator, or the close() method of the delegating generator is called, then the close() method of the iterator is called if it has one. If this call results in an exception, it is propagated to the delegating generator. Otherwise, GeneratorExit is raised in the delegating generator.
- The value of the yield from expression is the first argument to the StopIteration exception raised by the iterator when it terminates.
- return expr in a generator causes StopIteration(expr) to be raised upon exit from the generator.
Enhancements to StopIteration
For convenience, the StopIteration exception will be given a value attribute that holds its first argument, or None if there are no arguments.
Formal Semantics
Python 3 syntax is used in this section.
The statement
RESULT = yield from EXPR
is semantically equivalent to
_i = iter(EXPR)
try:
_y = next(_i)
except StopIteration as _e:
_r = _e.value
else:
while 1:
try:
_s = yield _y
except GeneratorExit as _e:
try:
_m = _i.close
except AttributeError:
pass
else:
_m()
raise _e
except BaseException as _e:
_x = sys.exc_info()
try:
_m = _i.throw
except AttributeError:
raise _e
else:
try:
_y = _m(*_x)
except StopIteration as _e:
_r = _e.value
break
else:
try:
if _s is None:
_y = next(_i)
else:
_y = _i.send(_s)
except StopIteration as _e:
_r = _e.value
break
RESULT = _r
In a generator, the statement
return value
is semantically equivalent to
raise StopIteration(value)
except that, as currently, the exception cannot be caught by except clauses within the returning generator.
The StopIteration exception behaves as though defined thusly:
class StopIteration(Exception): def __init__(self, *args): if len(args) > 0: self.value = args[0] else: self.value = None Exception.__init__(self, *args)
Rationale
The Refactoring Principle
The rationale behind most of the semantics presented above stems from the desire to be able to refactor generator code. It should be possible to take a section of code containing one or more yield expressions, move it into a separate function (using the usual techniques to deal with references to variables in the surrounding scope, etc.), and call the new function using a yield from expression.
The behaviour of the resulting compound generator should be, as far as reasonably practicable, the same as the original unfactored generator in all situations, including calls to __next__(), send(), throw() and close().
The semantics in cases of subiterators other than generators has been chosen as a reasonable generalization of the generator case.
The proposed semantics have the following limitations with regard to refactoring:
- A block of code that catches GeneratorExit without subsequently re-raising it cannot be factored out while retaining exactly the same behaviour.
- Factored code may not behave the same way as unfactored code if a StopIteration exception is thrown into the delegating generator.
With use cases for these being rare to non-existent, it was not considered worth the extra complexity required to support them.
Finalization
There was some debate as to whether explicitly finalizing the delegating generator by calling its close() method while it is suspended at a yield from should also finalize the subiterator. An argument against doing so is that it would result in premature finalization of the subiterator if references to it exist elsewhere.
Consideration of non-refcounting Python implementations led to the decision that this explicit finalization should be performed, so that explicitly closing a factored generator has the same effect as doing so to an unfactored one in all Python implementations.
The assumption made is that, in the majority of use cases, the subiterator will not be shared. The rare case of a shared subiterator can be accommodated by means of a wrapper that blocks throw() and close() calls, or by using a means other than yield from to call the subiterator.
Generators as Threads
A motivation for generators being able to return values concerns the use of generators to implement lightweight threads. When using generators in that way, it is reasonable to want to spread the computation performed by the lightweight thread over many functions. One would like to be able to call a subgenerator as though it were an ordinary function, passing it parameters and receiving a returned value.
Using the proposed syntax, a statement such as
y = f(x)
where f is an ordinary function, can be transformed into a delegation call
y = yield from g(x)
where g is a generator. One can reason about the behaviour of the resulting code by thinking of g as an ordinary function that can be suspended using a yield statement.
When using generators as threads in this way, typically one is not interested in the values being passed in or out of the yields. However, there are use cases for this as well, where the thread is seen as a producer or consumer of items. The yield from expression allows the logic of the thread to be spread over as many functions as desired, with the production or consumption of items occuring in any subfunction, and the items are automatically routed to or from their ultimate source or destination.
Concerning throw() and close(), it is reasonable to expect that if an exception is thrown into the thread from outside, it should first be raised in the innermost generator where the thread is suspended, and propagate outwards from there; and that if the thread is terminated from outside by calling close(), the chain of active generators should be finalised from the innermost outwards.
Syntax
The particular syntax proposed has been chosen as suggestive of its meaning, while not introducing any new keywords and clearly standing out as being different from a plain yield.
Optimisations
Using a specialised syntax opens up possibilities for optimisation when there is a long chain of generators. Such chains can arise, for instance, when recursively traversing a tree structure. The overhead of passing __next__() calls and yielded values down and up the chain can cause what ought to be an O(n) operation to become, in the worst case, O(n**2).
A possible strategy is to add a slot to generator objects to hold a generator being delegated to. When a __next__() or send() call is made on the generator, this slot is checked first, and if it is nonempty, the generator that it references is resumed instead. If it raises StopIteration, the slot is cleared and the main generator is resumed.
This would reduce the delegation overhead to a chain of C function calls involving no Python code execution. A possible enhancement would be to traverse the whole chain of generators in a loop and directly resume the one at the end, although the handling of StopIteration is more complicated then.
Use of StopIteration to return values
There are a variety of ways that the return value from the generator could be passed back. Some alternatives include storing it as an attribute of the generator-iterator object, or returning it as the value of the close() call to the subgenerator. However, the proposed mechanism is attractive for a couple of reasons:
- Using a generalization of the StopIteration exception makes it easy for other kinds of iterators to participate in the protocol without having to grow an extra attribute or a close() method.
- It simplifies the implementation, because the point at which the return value from the subgenerator becomes available is the same point at which the exception is raised. Delaying until any later time would require storing the return value somewhere.
Rejected Ideas
Some ideas were discussed but rejected.
Suggestion: There should be some way to prevent the initial call to __next__(), or substitute it with a send() call with a specified value, the intention being to support the use of generators wrapped so that the initial __next__() is performed automatically.
Resolution: Outside the scope of the proposal. Such generators should not be used with yield from.
Suggestion: If closing a subiterator raises StopIteration with a value, return that value from the close() call to the delegating generator.
The motivation for this feature is so that the end of a stream of values being sent to a generator can be signalled by closing the generator. The generator would catch GeneratorExit, finish its computation and return a result, which would then become the return value of the close() call.
Resolution: This usage of close() and GeneratorExit would be incompatible with their current role as a bail-out and clean-up mechanism. It would require that when closing a delegating generator, after the subgenerator is closed, the delegating generator be resumed instead of re-raising GeneratorExit. But this is not acceptable, because it would fail to ensure that the delegating generator is finalised properly in the case where close() is being called for cleanup purposes.
Signalling the end of values to a consumer is better addressed by other means, such as sending in a sentinel value or throwing in an exception agreed upon by the producer and consumer. The consumer can then detect the sentinel or exception and respond by finishing its computation and returning normally. Such a scheme behaves correctly in the presence of delegation.
Suggestion: If close() is not to return a value, then raise an exception if StopIteration with a non-None value occurs.
Resolution: No clear reason to do so. Ignoring a return value is not considered an error anywhere else in Python.
Criticisms
Under this proposal, the value of a yield from expression would be derived in a very different way from that of an ordinary yield expression. This suggests that some other syntax not containing the word yield might be more appropriate, but no acceptable alternative has so far been proposed. Rejected alternatives include call, delegate and gcall.
It has been suggested that some mechanism other than return in the subgenerator should be used to establish the value returned by the yield from expression. However, this would interfere with the goal of being able to think of the subgenerator as a suspendable function, since it would not be able to return values in the same way as other functions.
The use of an exception to pass the return value has been criticised as an "abuse of exceptions", without any concrete justification of this claim. In any case, this is only one suggested implementation; another mechanism could be used without losing any essential features of the proposal.
It has been suggested that a different exception, such as GeneratorReturn, should be used instead of StopIteration to return a value. However, no convincing practical reason for this has been put forward, and the addition of a value attribute to StopIteration mitigates any difficulties in extracting a return value from a StopIteration exception that may or may not have one. Also, using a different exception would mean that, unlike ordinary functions, 'return' without a value in a generator would not be equivalent to 'return None'.
Alternative Proposals
Proposals along similar lines have been made before, some using the syntax yield * instead of yield from. While yield * is more concise, it could be argued that it looks too similar to an ordinary yield and the difference might be overlooked when reading code.
To the author's knowledge, previous proposals have focused only on yielding values, and thereby suffered from the criticism that the two-line for-loop they replace is not sufficiently tiresome to write to justify a new syntax. By dealing with the full generator protocol, this proposal provides considerably more benefit.
Additional Material
Some examples of the use of the proposed syntax are available, and also a prototype implementation based on the first optimisation outlined above.
Examples and Implementation [2]
A version of the implementation updated for Python 3.3 is available from tracker issue #11682 [3]
References
| [1] | http://mail.python.org/pipermail/python-dev/2011-June/112010.html |
| [2] | http://www.cosc.canterbury.ac.nz/greg.ewing/python/yield-from/ |
| [3] | http://bugs.python.org/issue11682 |
Copyright
This document has been placed in the public domain.
pep-0381 Mirroring infrastructure for PyPI
| PEP: | 381 |
|---|---|
| Title: | Mirroring infrastructure for PyPI |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tarek ZiadĂŠ <tarek at ziade.org>, Martin v. LĂświs <martin at v.loewis.de> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 21-March-2009 |
| Python-Version: | N.A. |
| Post-History: |
Contents
Abstract
This PEP describes a mirroring infrastructure for PyPI.
Rationale
PyPI is hosting over 6000 projects and is used on a daily basis by people to build applications. Especially systems like easy_install and zc.buildout make intensive usage of PyPI.
For people making intensive use of PyPI, it can act as a single point of failure. People have started to set up some mirrors, both private and public. Those mirrors are active mirrors, which means that they are browsing PyPI to get synced.
In order to make the system more reliable, this PEP describes:
- the mirror listing and registering at PyPI
- the pages a public mirror should maintain. These pages will be used by PyPI, in order to get hit counts and the last modified date.
- how a mirror should synchronize with PyPI
- how a client can implement a fail-over mechanism
Mirror listing and registering
People that wants to mirror PyPI make a proposal on catalog-SIG. When a mirror is proposed on the mailing list, it is manually added in a mirror list in the PyPI application after it has been checked to be compliant with the mirroring rules.
The mirror list is provided as a list of host names of the form
X.pypi.python.org
The values of X are the sequence a,b,c,...,aa,ab,... a.pypi.python.org is the master server; the mirrors start with b. A CNAME record last.pypi.python.org points to the last host name. Mirror operators should use a static address, and report planned changes to that address in advance to distutils-sig.
The new mirror also appears at http://pypi.python.org/mirrors which is a human-readable page that gives the list of mirrors. This page also explains how to register a new mirror.
Statistics page
PyPI provides statistics on downloads at /stats. This page is calculated daily by PyPI, by reading all mirrors' local stats and summing them.
The stats are presented in daily or montly files, under /stats/days and /stats/months. Each file is a bzip2 file with these formats:
- YYYY-MM-DD.bz2 for daily files
- YYYY-MM.bz2 for monthly files
Examples:
- /stats/days/2008-11-06.bz2
- /stats/days/2008-11-07.bz2
- /stats/days/2008-11-08.bz2
- /stats/months/2008-11.bz2
- /stats/months/2008-10.bz2
Mirror Authenticity
With a distributed mirroring system, clients may want to verify that the mirrored copies are authentic. There are multiple threats to consider:
- the central index may get compromised
- the central index is assumed to be trusted, but the mirrors might be tampered.
- a man in the middle between the central index and the end user, or between a mirror and the end user might tamper with datagrams.
This specification only deals with the second threat. Some provisions are made to detect man-in-the-middle attacks. To detect the first attack, package authors need to sign their packages using PGP keys, so that users verify that the package comes from the author they trust.
The central index provides a DSA key at the URL /serverkey, in the PEM format as generated by "openssl dsa -pubout" (i.e. RFC 3280 SubjectPublicKeyInfo, with the algorithm 1.3.14.3.2.12). This URL must not be mirrored, and clients must fetch the official serverkey from PyPI directly, or use the copy that came with the PyPI client software. Mirrors should still download the key, to detect a key rollover.
For each package, a mirrored signature is provided at /serversig/<package>. This is the DSA signature of the parallel URL /simple/<package>, in DER form, using SHA-1 with DSA (i.e. as a RFC 3279 Dsa-Sig-Value, created by algorithm 1.2.840.10040.4.3)
Clients using a mirror need to perform the following steps to verify a package:
- download the /simple page, and compute its SHA-1 hash
- compute the DSA signature of that hash
- download the corresponding /serversig, and compare it (byte-for-byte) with the value computed in step 2.
- compute and verify (against the /simple page) the MD-5 hashes of all files they download from the mirror.
An implementation of the verification algorithm is available from https://svn.python.org/packages/trunk/pypi/tools/verify.py
Verification is not needed when downloading from central index, and should be avoided to reduce the computation overhead.
About once a year, the key will be replaced with a new one. Mirrors will have to re-fetch all /serversig pages. Clients using mirrors need to find a trusted copy of the new server key. One way to obtain one is to download it from https://pypi.python.org/serverkey. To detect man-in-the-middle attacks, clients need to verify the SSL server certificate, which will be signed by the CACert authority.
Special pages a mirror needs to provide
A mirror is a subset copy of PyPI, so it provides the same structure by copying it.
- simple: rest version of the package index
- packages: packages, stored by Python version, and letters
- serversig: signatures for the simple pages
It also needs to provide two specific elements:
- last-modified
- local-stats
Last modified date
CPAN uses a freshness date system where the mirror's last synchronisation date is made available.
For PyPI, each mirror needs to maintain a URL with simple text content that represents the last synchronisation date the mirror maintains.
The date is provided in GMT time, using the ISO 8601 format [3]. Each mirror will be responsible to maintain its last modified date.
This page must be located at : /last-modified and must be a text/plain page.
Local statistics
Each mirror is responsible to count all the downloads that where done via it. This is used by PyPI to sum up all downloads, to be able to display the grand total.
These statistics are in CSV-like form, with a header in the first line. It needs to obey PEP 305 [1]. Basically, it should be readable by Python's csv module.
The fields in this file are:
- package: the distutils id of the package.
- filename: the filename that has been downloaded.
- useragent: the User-Agent of the client that has downloaded the package.
- count: the number of downloads.
The content will look like this:
# package,filename,useragent,count zc.buildout,zc.buildout-1.6.0.tgz,MyAgent,142 ...
The counting starts the day the mirror is launched, and there is one file per day, compressed using the bzip2 format. Each file is named like the day. For example 2008-11-06.bz2 is the file for the 6th of November 2008.
They are then provided in a folder called days. For example:
- /local-stats/days/2008-11-06.bz2
- /local-stats/days/2008-11-07.bz2
- /local-stats/days/2008-11-08.bz2
This page must be located at /local-stats.
How a mirror should synchronize with PyPI
A mirroring protocol called Simple Index was described and implemented by Martin v. Loewis and Jim Fulton, based on how easy_install works. This section synthesizes it and gives a few relevant links, plus a small part about User-Agent.
The mirroring protocol
Mirrors must reduce the amount of data transfered between the central server and the mirror. To achieve that, they MUST use the changelog() PyPI XML-RPC call, and only refetch the packages that have been changed since the last time. For each package P, they MUST copy documents /simple/P/ and /serversig/P. If a package is deleted on the central server, they MUST delete the package and all associated files. To detect modification of package files, they MAY cache the file's ETag, and MAY request skipping it using the If-none-match header.
Each mirroring tool MUST identify itself using a descripte User-agent header.
The pep381client package [2] provides an application that respects this protocol to browse PyPI.
User-agent request header
In order to be able to differentiate actions taken by clients over PyPI, a specific user agent name should be provided by all mirroring softwares.
This is also true for all clients like:
XXX user agent registering mechanism at PyPI ?
How a client can use PyPI and its mirrors
Clients that are browsing PyPI should be able to use alternative mirrors, by getting the list of the mirrors using last.pypi.python.org.
Code example:
>>> import socket
>>> socket.gethostbyname_ex('last.pypi.python.org')[0]
'h.pypi.python.org'
The clients so far that could use this mechanism:
- setuptools
- zc.buildout (through setuptools)
- pip
Fail-over mechanism
Clients that are browsing PyPI should be able to use a fail-over mechanism when PyPI or the used mirror is not responding.
It is up to the client to decide wich mirror should be used, maybe by looking at its geographical location and its responsivness.
This PEP does not describe how this fail-over mechanism should work, but it is strongly encouraged that the clients try to use the nearest mirror.
The clients so far that could use this mechanism:
- setuptools
- zc.buildout (through setuptools)
- pip
Extra package indexes
It is obvious that some packages will not be uploaded to PyPI, whether because they are private or whether because the project maintainer runs his own server where people might get the project package. However, it is strongly encouraged that a public package index follows PyPI and Distutils protocols.
In other words, the register and upload command should be compatible with any package index server out there.
Softwares that are compatible with PyPI and Distutils so far:
An extra package index is not a mirror of PyPI, but can have some mirrors itself.
Merging several indexes
When a client needs to get some packages from several distinct indexes, it should be able to use each one of them as a potential source of packages. Different indexes should be defined as a sorted list for the client to look for a package.
Each independant index can of course provide a list of its mirrors.
XXX define how to get the hostname for the mirrors of an arbitrary index.
That permits all combinations at client level, for a reliable packaging system with all levels of privacy.
It is up the client to deal with the merging.
References
| [1] | http://www.python.org/dev/peps/pep-0305/#id19 |
| [2] | http://pypi.python.org/pypi/pep381client |
| [3] | http://en.wikipedia.org/wiki/ISO_8601 |
| [4] | http://pypi.python.org/pypi/zc.buildout |
| [5] | http://pypi.python.org/pypi/setuptools |
| [6] | http://pypi.python.org/pypi/pip |
| [7] | http://plone.org/products/plonesoftwarecenter |
| [8] | http://www.chrisarndt.de/projects/eggbasket |
Acknowledgments
Georg Brandl.
Copyright
This document has been placed in the public domain.
pep-0382 Namespace Packages
| PEP: | 382 |
|---|---|
| Title: | Namespace Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin v. Lรถwis <martin at v.loewis.de> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 02-Apr-2009 |
| Python-Version: | 3.2 |
| Post-History: |
Contents
Rejection Notice
On the first day of sprints at US PyCon 2012 we had a long and fruitful discussion about PEP 382 and PEP 402. We ended up rejecting both but a new PEP will be written to carry on in the spirit of PEP 402. Martin von Lรถwis wrote up a summary: [2].
Abstract
Namespace packages are a mechanism for splitting a single Python package across multiple directories on disk. In current Python versions, an algorithm to compute the packages __path__ must be formulated. With the enhancement proposed here, the import machinery itself will construct the list of directories that make up the package. An implementation of this PEP is available at [1].
Terminology
Within this PEP, the term package refers to Python packages as defined by Python's import statement. The term distribution refers to separately installable sets of Python modules as stored in the Python package index, and installed by distutils or setuptools. The term vendor package refers to groups of files installed by an operating system's packaging mechanism (e.g. Debian or Redhat packages install on Linux systems).
The term portion refers to a set of files in a single directory (possibly stored in a zip file) that contribute to a namespace package.
Namespace packages today
Python currently provides the pkgutil.extend_path to denote a package as a namespace package. The recommended way of using it is to put:
from pkgutil import extend_path __path__ = extend_path(__path__, __name__)
in the package's __init__.py. Every distribution needs to provide the same contents in its __init__.py, so that extend_path is invoked independent of which portion of the package gets imported first. As a consequence, the package's __init__.py cannot practically define any names as it depends on the order of the package fragments on sys.path which portion is imported first. As a special feature, extend_path reads files named <packagename>.pkg which allow to declare additional portions.
setuptools provides a similar function pkg_resources.declare_namespace that is used in the form:
import pkg_resources pkg_resources.declare_namespace(__name__)
In the portion's __init__.py, no assignment to __path__ is necessary, as declare_namespace modifies the package __path__ through sys.modules. As a special feature, declare_namespace also supports zip files, and registers the package name internally so that future additions to sys.path by setuptools can properly add additional portions to each package.
setuptools allows declaring namespace packages in a distribution's setup.py, so that distribution developers don't need to put the magic __path__ modification into __init__.py themselves.
Rationale
The current imperative approach to namespace packages has lead to multiple slightly-incompatible mechanisms for providing namespace packages. For example, pkgutil supports *.pkg files; setuptools doesn't. Likewise, setuptools supports inspecting zip files, and supports adding portions to its _namespace_packages variable, whereas pkgutil doesn't.
In addition, the current approach causes problems for system vendors. Vendor packages typically must not provide overlapping files, and an attempt to install a vendor package that has a file already on disk will fail or cause unpredictable behavior. As vendors might chose to package distributions such that they will end up all in a single directory for the namespace package, all portions would contribute conflicting __init__.py files.
Specification
Rather than using an imperative mechanism for importing packages, a declarative approach is proposed here: A directory whose name ends with .pyp (for Python package) contains a portion of a package.
The import statement is extended so that computes the package's __path__ attribute for a package named P as consisting of optionally a single directory name P containing a file __init__.py, plus all directories named P.pyp, in the order in which they are found in the parent's package __path__ (or sys.path). If either of these are found, search for additional portions of the package continues.
A directory may contain both a package in the P/__init__.py and the P.pyp form.
No other change to the importing mechanism is made; searching modules (including __init__.py) will continue to stop at the first module encountered. In summary, the process import a package foo works like this:
- sys.path is searched for directories foo or foo.pyp, or a file foo.<ext>. If a file is found and no directory, it is treated as a module, and imported.
- If a directory foo is found, a check is made whether it contains __init__.py. If so, the location of the __init__.py is remembered. Otherwise, the directory is skipped. Once an __init__.py is found, further directories called foo are skipped.
- For both directories foo and foo.pyp, the directories are added to the package's __path__.
- If an __init__ module was found, it is imported, with __path__ being initialized to the path computed all .pyp directories.
Impact on Import Hooks
Both loaders and finders as defined in PEP 302 will need to be changed to support namespace packages. Failure to comform to the protocol below might cause a package not being recognized as a namespace package; loaders and finders not supporting this protocol must raise AttributeError when the functions below get accessed.
Finders need to support looking for *.pth files in step 1 of above algorithm. To do so, a finder used as a path hook must support a method:
finder.find_package_portion(fullname)
This method will be called in the same manner as find_module, and it must return an string to be added to the package's __path__. If the finder doesn't find a portion of the package, it shall return None. Raising AttributeError from above call will be treated as non-conformance with this PEP, and the exception will be ignored. All other exceptions are reported.
A finder may report both success from find_module and from find_package_portion, allowing for both a package containing an __init__.py and a portion of the same package.
All strings returned from find_package_portion, along with all path names of .pyp directories are added to the new package's __path__.
Discussion
Original versions of this specification proposed the addition of *.pth files, similar to the way those files are used on sys.path. With a wildcard marker (*), a package could indicate that the entire path is derived by looking at the parent path, searching for properly-named subdirectories.
People then observed that the support for the full .pth syntax is inappropriate, and the .pth files were changed to be mere marker files, indicating that a directories is a package. Peter Trรถger suggested that .pth is an unsuitable file extension, as all file extensions related to Python should start with .py. Therefore, the marker file was renamed to be .pyp.
Dinu Gherman then observed that using a marker file is not necessary, and that a directoy extension could well serve as a such as a marker. This is what this PEP currently proposes.
Phillip Eby designed PEP 402 as an alternative approach to this PEP, after comparing Python's package syntax with that found in other languages. PEP 402 proposes not to use a marker file at all. At the discussion at PyCon DE 2011, people remarked that having an explicit declaration of a directory as contributing to a package is a desirable property, rather than an obstactle. In particular, Jython developers noticed that Jython could easily mistake a directory that is a Java package as being a Python package, if there is no need to declare Python packages.
Packages can stop filling out the namespace package's __init__.py. As a consequence, extend_path and declare_namespace become obsolete.
Namespace packages can start providing non-trivial __init__.py implementations; to do so, it is recommended that a single distribution provides a portion with just the namespace package's __init__.py (and potentially other modules that belong to the namespace package proper).
The mechanism is mostly compatible with the existing namespace mechanisms. extend_path will be adjusted to this specification; any other mechanism might cause portions to get added twice to __path__.
References
| [1] | PEP 382 branch (http://hg.python.org/features/pep-382-2#pep-382) |
| [2] | Namespace Packages resolution (http://mail.python.org/pipermail/import-sig/2012-March/000421.html) |
Copyright
This document has been placed in the public domain.
pep-0383 Non-decodable Bytes in System Character Interfaces
| PEP: | 383 |
|---|---|
| Title: | Non-decodable Bytes in System Character Interfaces |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin v. Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 22-Apr-2009 |
| Python-Version: | 3.1 |
| Post-History: |
Abstract
File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.
Rationale
The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding.
On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API (keeping a C-char-based one for backwards compatibility).
For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x.
With this PEP, a uniform treatment of these data as characters becomes possible. The uniformity is achieved by using specific encoding algorithms, meaning that the data can be converted back to bytes on POSIX systems only if the same encoding is used.
Being able to treat such strings uniformly will allow application writers to abstract from details specific to the operating system, and reduces the risk of one API failing when the other API would have worked.
Specification
On Windows, Python uses the wide character APIs to access character-oriented APIs, allowing direct conversion of the environmental data to Python str objects ([1]).
On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode, failing for characters that cannot be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions; see the discussion below.
To convert non-decodable bytes, a new error handler ([2]) "surrogateescape" is introduced, which produces these surrogates. On encoding, the error handler converts the surrogate back to the corresponding byte. This error handler will be used in any API that receives or produces file names, command line arguments, or environment variables.
The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again (also see the discussion below).
Byte-oriented interfaces that already exist in Python 3.0 are not affected by this specification. They are neither enhanced nor deprecated.
External libraries that operate on file names (such as GUI file choosers) should also encode them according to the PEP.
Discussion
This surrogateescape encoding is based on Markus Kuhn's idea that he called UTF-8b [3].
While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data get converted back to bytes with the surrogateescape error handler also. Encoding the data with the locale's encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data.
Data obtained from other sources may conflict with data produced by this PEP. Dealing with such conflicts is out of scope of the PEP.
This PEP allows the possibility of "smuggling" bytes in character strings. This would be a security risk if the bytes are security-critical when interpreted as characters on a target system, such as path name separators. For this reason, the PEP rejects smuggling bytes below 128. If the target system uses EBCDIC, such smuggled bytes may still be a security risk, allowing smuggling of e.g. square brackets or the backslash. Python currently does not support EBCDIC, so this should not be a problem in practice. Anybody porting Python to an EBCDIC system might want to adjust the error handlers, or come up with other approaches to address the security risks.
Encodings that are not compatible with ASCII are not supported by this specification; bytes in the ASCII range that fail to decode will cause an exception. It is widely agreed that such encodings should not be used as locale charsets.
For most applications, we assume that they eventually pass data received from a system interface back into the same system interfaces. For example, an application invoking os.listdir() will likely pass the result strings back into APIs like os.stat() or open(), which then encodes them back into their original byte representation. Applications that need to process the original byte strings can obtain them by encoding the character strings with the file system encoding, passing "surrogateescape" as the error handler name. For example, a function that works like os.listdir, except for accepting and returning bytes, would be written as:
def listdir_b(dirname):
fse = sys.getfilesystemencoding()
dirname = dirname.decode(fse, "surrogateescape")
for fn in os.listdir(dirname):
# fn is now a str object
yield fn.encode(fse, "surrogateescape")
The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'surrogateescape' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'surrogateescape' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes.
A few alternative approaches have been proposed:
- create a new string subclass that supports embedded bytes
- use different escape schemes, such as escaping with a NUL character, or mapping to infrequent characters.
Of these proposals, the approach of escaping each byte XX with the sequence U+0000 U+00XX has the disadvantage that encoding to UTF-8 will introduce a NUL byte in the UTF-8 sequence. As a consequence, C libraries may interpret this as a string termination, even though the string continues. In particular, the gtk libraries will truncate text in this case; other libraries may show similar problems.
References
| [1] | PEP 277 "Unicode file name support for Windows NT" http://www.python.org/dev/peps/pep-0277/ |
| [2] | PEP 293 "Codec Error Handling Callbacks" http://www.python.org/dev/peps/pep-0293/ |
| [3] | UTF-8b http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html |
Copyright
This document has been placed in the public domain.
pep-0384 Defining a Stable ABI
| PEP: | 384 |
|---|---|
| Title: | Defining a Stable ABI |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin v. Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 17-May-2009 |
| Python-Version: | 3.2 |
| Post-History: |
Contents
Abstract
Currently, each feature release introduces a new name for the Python DLL on Windows, and may cause incompatibilities for extension modules on Unix. This PEP proposes to define a stable set of API functions which are guaranteed to be available for the lifetime of Python 3, and which will also remain binary-compatible across versions. Extension modules and applications embedding Python can work with different feature releases as long as they restrict themselves to this stable ABI.
Rationale
The primary source of ABI incompatibility are changes to the lay-out of in-memory structures. For example, the way in which string interning works, or the data type used to represent the size of an object, have changed during the life of Python 2.x. As a consequence, extension modules making direct access to fields of strings, lists, or tuples, would break if their code is loaded into a newer version of the interpreter without recompilation: offsets of other fields may have changed, making the extension modules access the wrong data.
In some cases, the incompatibilities only affect internal objects of the interpreter, such as frame or code objects. For example, the way line numbers are represented has changed in the 2.x lifetime, as has the way in which local variables are stored (due to the introduction of closures). Even though most applications probably never used these objects, changing them had required to change the PYTHON_API_VERSION.
On Linux, changes to the ABI are often not much of a problem: the system will provide a default Python installation, and many extension modules are already provided pre-compiled for that version. If additional modules are needed, or additional Python versions, users can typically compile them themselves on the system, resulting in modules that use the right ABI.
On Windows, multiple simultaneous installations of different Python versions are common, and extension modules are compiled by their authors, not by end users. To reduce the risk of ABI incompatibilities, Python currently introduces a new DLL name pythonXY.dll for each feature release, whether or not ABI incompatibilities actually exist.
With this PEP, it will be possible to reduce the dependency of binary extension modules on a specific Python feature release, and applications embedding Python can be made work with different releases.
Specification
The ABI specification falls into two parts: an API specification, specifying what function (groups) are available for use with the ABI, and a linkage specification specifying what libraries to link with. The actual ABI (layout of structures in memory, function calling conventions) is not specified, but implied by the compiler. As a recommendation, a specific ABI is recommended for selected platforms.
During evolution of Python, new ABI functions will be added. Applications using them will then have a requirement on a minimum version of Python; this PEP provides no mechanism for such applications to fall back when the Python library is too old.
Terminology
Applications and extension modules that want to use this ABI are collectively referred to as "applications" from here on.
Header Files and Preprocessor Definitions
Applications shall only include the header file Python.h (before including any system headers), or, optionally, include pyconfig.h, and then Python.h.
During the compilation of applications, the preprocessor macro Py_LIMITED_API must be defined. Doing so will hide all definitions that are not part of the ABI.
Structures
Only the following structures and structure fields are accessible to applications:
- PyObject (ob_refcnt, ob_type)
- PyVarObject (ob_base, ob_size)
- PyMethodDef (ml_name, ml_meth, ml_flags, ml_doc)
- PyMemberDef (name, type, offset, flags, doc)
- PyGetSetDef (name, get, set, doc, closure)
- PyModuleDefBase (ob_base, m_init, m_index, m_copy)
- PyModuleDef (m_base, m_name, m_doc, m_size, m_methods, m_traverse, m_clear, m_free)
- PyStructSequence_Field (name, doc)
- PyStructSequence_Desc (name, doc, fields, sequence)
- PyType_Slot (see below)
- PyType_Spec (see below)
The accessor macros to these fields (Py_REFCNT, Py_TYPE, Py_SIZE) are also available to applications.
The following types are available, but opaque (i.e. incomplete):
- PyThreadState
- PyInterpreterState
- struct _frame
- struct symtable
- struct _node
- PyWeakReference
- PyLongObject
- PyTypeObject
Type Objects
The structure of type objects is not available to applications; declaration of "static" type objects is not possible anymore (for applications using this ABI). Instead, type objects get created dynamically. To allow an easy creation of types (in particular, to be able to fill out function pointers easily), the following structures and functions are available:
typedef struct{
int slot; /* slot id, see below */
void *pfunc; /* function pointer */
} PyType_Slot;
typedef struct{
const char* name;
const char* doc;
int basicsize;
int itemsize;
int flags;
PyType_Slot *slots; /* terminated by slot==0. */
} PyType_Spec;
PyObject* PyType_FromSpec(PyType_Spec*);
To specify a slot, a unique slot id must be provided. New Python versions may introduce new slot ids, but slot ids will never be recycled. Slots may get deprecated, but continue to be supported throughout Python 3.x.
The slot ids are named like the field names of the structures that hold the pointers in Python 3.1, with an added Py_ prefix (i.e. Py_tp_dealloc instead of just tp_dealloc):
- tp_dealloc, tp_getattr, tp_setattr, tp_repr, tp_hash, tp_call, tp_str, tp_getattro, tp_setattro, tp_doc, tp_traverse, tp_clear, tp_richcompare, tp_iter, tp_iternext, tp_methods, tp_base, tp_descr_get, tp_descr_set, tp_init, tp_alloc, tp_new, tp_is_gc, tp_bases, tp_del
- nb_add nb_subtract nb_multiply nb_remainder nb_divmod nb_power nb_negative nb_positive nb_absolute nb_bool nb_invert nb_lshift nb_rshift nb_and nb_xor nb_or nb_int nb_float nb_inplace_add nb_inplace_subtract nb_inplace_multiply nb_inplace_remainder nb_inplace_power nb_inplace_lshift nb_inplace_rshift nb_inplace_and nb_inplace_xor nb_inplace_or nb_floor_divide nb_true_divide nb_inplace_floor_divide nb_inplace_true_divide nb_index
- sq_length sq_concat sq_repeat sq_item sq_ass_item sq_contains sq_inplace_concat sq_inplace_repeat
- mp_length mp_subscript mp_ass_subscript
The following fields cannot be set during type definition: - tp_dict tp_mro tp_cache tp_subclasses tp_weaklist tp_print - tp_weaklistoffset tp_dictoffset
typedefs
In addition to the typedefs for structs listed above, the following typedefs are available. Their inclusion in the ABI means that the underlying type must not change on a platform (even though it may differ across platforms).
- Py_uintptr_t Py_intptr_t Py_ssize_t
- unaryfunc binaryfunc ternaryfunc inquiry lenfunc ssizeargfunc ssizessizeargfunc ssizeobjargproc ssizessizeobjargproc objobjargproc objobjproc visitproc traverseproc destructor getattrfunc getattrofunc setattrfunc setattrofunc reprfunc hashfunc richcmpfunc getiterfunc iternextfunc descrgetfunc descrsetfunc initproc newfunc allocfunc
- PyCFunction PyCFunctionWithKeywords PyNoArgsFunction PyCapsule_Destructor
- getter setter
- PyOS_sighandler_t
- PyGILState_STATE
- Py_UCS4
Most notably, Py_UNICODE is not available as a typedef, since the same Python version may use different definitions of it on the same platform (depending on whether it uses narrow or wide code units). Applications that need to access the contents of a Unicode string can convert it to wchar_t.
Functions and function-like Macros
All functions starting with _Py are not available to applications (see exceptions below). Also, all functions that expect parameter types that are unavailable to applications are excluded from the ABI, such as PyAST_FromNode (which expects a node*).
All other functions are available, unless excluded below.
Function-like macros (in particular, field access macros) remain available to applications, but get replaced by function calls (unless their definition only refers to features of the ABI, such as the various _Check macros)
ABI function declarations will not change their parameters or return types. If a change to the signature becomes necessary, a new function will be introduced. If the new function is source-compatible (e.g. if just the return type changes), an alias macro may get added to redirect calls to the new function when the applications is recompiled.
If continued provision of the old function is not possible, it may get deprecated, then removed, in accordance with PEP 7, causing applications that use that function to break.
Excluded Functions
Functions declared in the following header files are not part of the ABI:
- bytes_methods.h
- cellobject.h
- classobject.h
- code.h
- compile.h
- datetime.h
- dtoa.h
- frameobject.h
- funcobject.h
- genobject.h
- longintrepr.h
- parsetok.h
- pyarena.h
- pyatomic.h
- pyctype.h
- pydebug.h
- pytime.h
- symtable.h
- token.h
- ucnhash.h
In addition, functions expecting FILE* are not part of the ABI, to avoid depending on a specific version of the Microsoft C runtime DLL on Windows.
Module and type initalizer functions are not available (PyByteArray_Init, PyByteArray_Fini, PyBytes_Fini, PyCFunction_Fini, PyDict_Fini, PyFloat_ClearFreeList, PyFloat_Fini, PyFrame_Fini, PyList_Fini, PyMethod_Fini, PyOS_FiniInterrupts, PySet_Fini, PyTuple_Fini).
Several functions dealing with interpreter implementation details are not available:
- PyInterpreterState_Head, PyInterpreterState_Next, PyInterpreterState_ThreadHead, PyThreadState_Next
- Py_SubversionRevision, Py_SubversionShortBranch
PyStructSequence_InitType is not available, as it requires the caller to provide a static type object.
Py_FatalError will be moved from pydebug.h into some other header file (e.g. pyerrors.h).
The exact list of functions being available is given in the Windows module definition file for python3.dll [1].
Global Variables
Global variables representing types and exceptions are available to applications. In addition, selected global variables referenced in macros (such as Py_True and Py_False) are available.
A complete list of global variable definitions is given in the python3.def file [1]; those declared DATA denote variables.
Other Macros
All macros defining symbolic constants are available to applications; the numeric values will not change.
In addition, the following macros are available:
- Py_BEGIN_ALLOW_THREADS, Py_BLOCK_THREADS, Py_UNBLOCK_THREADS, Py_END_ALLOW_THREADS
The Buffer Interface
The buffer interface (type Py_buffer, type slots bf_getbuffer and bf_releasebuffer, etc) has been omitted from the ABI, since the stability of the Py_buffer structure is not clear at this time. Inclusion in the ABI can be considered in future releases.
Signature Changes
A number of functions currently expect a specific struct, even though callers typically have PyObject* available. These have been changed to expect PyObject* as the parameter; this will cause warnings in applications that currently explicitly cast to the parameter type. These functions are PySlice_GetIndices, PySlice_GetIndicesEx, PyUnicode_AsWideChar, and PyEval_EvalCode.
Linkage
On Windows, applications shall link with python3.dll; an import library python3.lib will be available. This DLL will redirect all of its API functions through /export linker options to the full interpreter DLL, i.e. python3y.dll.
On Unix systems, the ABI is typically provided by the python executable itself. PyModule_Create is changed to pass 3 as the API version if the extension module was compiled with Py_LIMITED_API; the version check for the API version will accept either 3 or the current PYTHON_API_VERSION as conforming. If Python is compiled as a shared library, it is installed as both libpython3.so, and libpython3.y.so; applications conforming to this PEP should then link to the former (extension modules can continue to link with no libpython shared object, but rather rely on runtime linking). The ABI version is symbolically available as PYTHON_ABI_VERSION.
Also on Unix, the PEP 3149 tag abi<PYTHON_ABI_VERSION> is accepted in file names of extension modules. No checking is performed that files named in this way are actually restricted to the limited API, and no support for building such files will be added to distutils due to the distutils code freeze.
Implementation Strategy
This PEP will be implemented in a branch [2], allowing users to check whether their modules conform to the ABI. To avoid users having to rewrite their type definitions, a script to convert C source code containing type definitions will be provided [3].
References
| [1] | (1, 2) "python3 module definition file": http://svn.python.org/projects/python/branches/pep-0384/PC/python3.def |
| [2] | "PEP 384 branch": http://svn.python.org/projects/python/branches/pep-0384/ |
| [3] | "ABI type conversion script": http://svn.python.org/projects/python/branches/pep-0384/Tools/scripts/abitype.py |
Copyright
This document has been placed in the public domain.
pep-0385 Migrating from Subversion to Mercurial
| PEP: | 385 |
|---|---|
| Title: | Migrating from Subversion to Mercurial |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Dirkjan Ochtman <dirkjan at ochtman.nl>, Antoine Pitrou <solipsis at pitrou.net>, Georg Brandl <georg at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 25-May-2009 |
Contents
Motivation
After having decided to switch to the Mercurial DVCS, the actual migration still has to be performed. In the case of an important piece of infrastructure like the version control system for a large, distributed project like Python, this is a significant effort. This PEP is an attempt to describe the steps that must be taken for further discussion. It's somewhat similar to PEP 347 [3], which discussed the migration to SVN.
To make the most of hg, we would like to make a high-fidelity conversion, such that (a) as much of the svn metadata as possible is retained, and (b) all metadata is converted to formats that are common in Mercurial. This way, tools written for Mercurial can be optimally used. In order to do this, we want to use the hgsubversion [4] software to do an initial conversion. This hg extension is focused on providing high-quality conversion from Subversion to Mercurial for use in two-way correspondence, meaning it doesn't throw away as much available metadata as other solutions.
Such a conversion also seems like a good time to reconsider the contents of the repository and determine if some things are still valuable. In this spirit, the following sections also propose discarding some of the older metadata.
Timeline
The current schedule for conversion milestones:
2011-02-24: availability of a test repo at hg.python.org
Test commits will be allowed (and encouraged) from all committers to the Subversion repository. The test repository and all test commits will be removed once the final conversion is done. The server-side hooks will be installed for the test repository, in order to test buildbot, diff-email and whitespace checking integration.
2011-03-05: final conversion (tentative)
Commits to the Subversion branches now maintained in Mercurial will be blocked. Developers should refrain from pushing to the Mercurial repositories until all infrastructure is ensured to work after their switch over to the new repository.
Transition plan
Branch strategy
Mercurial has two basic ways of using branches: cloned branches, where each branch is kept in a separate repository, and named branches, where each revision keeps metadata to note on which branch it belongs. The former makes it easier to distinguish branches, at the expense of requiring more disk space on the client. The latter makes it a little easier to switch between branches, but all branch names are a persistent part of history. [1]
Differences between named branches and cloned branches:
- Tags in a different (maintenance) clone aren't available in the local clone
- Clones with named branches will be larger, since they contain more data
We propose to use named branches for release branches and adopt cloned branches for feature branches.
History management
In order to minimize the loss of information due to the conversion, we propose to provide several repositories as a conversion result:
A repository trimmed to the mainline trunk (and py3k), as well as past and present maintenance branches -- this is called the "working" repo and is where development continues. This repository has all the history needed for development work, including annotating source files with changes back up to 1990 and other common history-digging operations.
The default branch in that repo is what is known as py3k in Subversion, while the Subversion trunk lives on with the branch name legacy-trunk; however in Mercurial this branch will be closed. Release branches are named after their major.minor version, e.g. 3.2.
A repository with the full, unedited conversion of the Subversion repository (actually, its /python subdirectory) -- this is called the "historic" or "archive" repo and will be offered as a read-only resource. [2]
One more repository per active feature branch; "active" means that at least one core developer asks for the branch to be provided. Each such repository will contain both the feature branch and all ancestor changesets from mainline (coming from trunk and/or py3k in SVN).
Since all branches are present in the historic repo, they can later be extracted as separate repositories at any time should it prove to be necessary.
The final revision map between SVN revision numbers, Mercurial changesets and SVN branch names will be made available in a file stored in the Misc directory. Its format is as following:
[...] 88483 e65daae6cf4499a0863cb7645109a4798c28d83e issue10276-snowleopard 88484 835cb57abffeceaff0d85c2a3aa0625458dd3e31 py3k 88485 d880f9d8492f597a030772c7485a34aadb6c4ece release32-maint 88486 0c431b8c22f5dbeb591414c154acb7890c1809df py3k 88487 82cda1f21396bbd10db8083ea20146d296cb630b release32-maint 88488 8174d00d07972d6f109ed57efca8273a4d59302c release27-maint [...]
Converting tags
The SVN tags directory contains a lot of old stuff. Some of these are not, in fact, full tags, but contain only a smaller subset of the repository. All release tags will be kept; other tags will be included based on requests from the developer community. We propose to make the tag naming scheme consistent, in this style: v3.2.1a2.
Author map
In order to provide user names the way they are common in hg (in the 'First Last <user@example.org>' format), we need an author map to map cvs and svn user names to real names and their email addresses. We have a complete version of such a map in the migration tools repository (not publicly accessible to avoid leaking addresses to harvesters). The email addresses in it might be out of date; that's bound to happen, although it would be nice to try and have as many people as possible review it for addresses that are out of date. The current version also still seems to contain some encoding problems.
Generating .hgignore
The .hgignore file can be used in Mercurial repositories to help ignore files that are not eligible for version control. It does this by employing several possible forms of pattern matching. The current Python repository already includes a rudimentary .hgignore file to help with using the hg mirrors.
Since the current Python repository already includes a .hgignore file (for use with hg mirrors), we'll just use that. Generating full history of the file was debated but deemed impractical (because it's relatively hard with fairly little gain, since ignoring is less important for older revisions).
Repository size
A bare conversion result of the current Python repository weighs 1.9 GB; although this is smaller than the Subversion repository (2.7 GB) it is not feasible.
The size becomes more manageable by the trimming applied to the working repository, and by a process called "revlog reordering" that optimizes the layout of internal Mercurial storage very efficiently.
After all optimizations done, the size of the working repository is around 180 MB on disk. The amount of data transferred over the network when cloning is estimated to be around 80 MB.
Other repositories
There are a number of other projects hosted in svn.python.org's "projects" repository. The "peps" directory will be converted along with the main Python one. Richard Tew has indicated that he'd like the Stackless repository to also be converted. What other projects in the svn.python.org repository should be converted?
There's now an initial stab at converting the Jython repository. The current tip of hgsubversion unfortunately fails at some point. Pending investigation.
Other repositories that would like to converted to Mercurial can announce themselves to me after the main Python migration is done, and I'll take care of their needs.
Infrastructure
hg-ssh
Developers should access the repositories through ssh, similar to the current setup. Public keys can be used to grant people access to a shared hg@ account. A hgwebdir instance also has been set up at hg.python.org for easy browsing and read-only access. It is configured so that developers can trivially start new clones (for longer-term features that profit from development in a separate repository).
Also, direct creation of public repositories is allowed for core developers, although it is not yet decided which naming scheme will be enforced:
$ hg init ssh://hg@hg.python.org/sandbox/mywork repo created, public URL is http://hg.python.org/sandbox/mywork
Hooks
A number of hooks is currently in use. The hg equivalents for these should be developed and deployed. The following hooks are being used:
- check whitespace: a hook to reject commits in case the whitespace doesn't match the rules for the Python codebase. In a changegroup, only the tip is checked (this allows cleanup commits for changes pulled from third-party repos). We can also offer a whitespace hook for use with client-side repositories that people can use; it could either warn about whitespace issues and/or truncate trailing whitespace from changed lines.
- push mails: Emails will include diffs for each changeset pushed to the public repository, including the username which pushed the changesets (this is not necessarily the same as the author recorded in the changesets).
- buildbots: the python.org build master will be notified of each changeset pushed to the cpython repository, and will trigger an appropriate build on every build slave for the branch in which the changeset occurs.
The hooks repository [5] contains ports of these server-side hooks to Mercurial, as well as a couple additional ones:
- check branch heads: a hook to reject pushes which create a new head on an existing branch. The pusher then has to merge the excess heads and try pushing again.
- check branches: a hook to reject all changesets not on an allowed named branch. This hook's whitelist will have to be updated when we want to create new maintenance branches.
- check line endings: a hook, based on the eol extension [6], to reject all changesets committing files with the wrong line endings. The commits then have to be stripped and redone, possibly with the eol extension [6] enabled on the comitter's computer.
One additional hook could be beneficial:
- check contributors: in the current setup, all changesets bear the username of committers, who must have signed the contributor agreement. We might want to use a hook to check if the committer is a contributor if we keep a list of registered contributors. Then, the hook might warn users that push a group of revisions containing changesets from unknown contributors.
End-of-line conversions
Discussion about the lack of end-of-line conversion support in Mercurial, which was provided initially by the win32text extension [7], led to the development of the new eol extension [6] that supports a versioned management of line-ending conventions on a file-by-file basis, akin to Subversion's svn:eol-style properties. This information is kept in a versioned file called .hgeol, and such a file has already been checked into the Subversion repository.
A hook also exists on the server side to reject any changeset introducing inconsistent newline data (see above).
hgwebdir
A more or less stock hgwebdir installation should be set up. We might want to come up with a style to match the Python website.
A small WSGI application has been written that can look up Subversion revisions and redirect to the appropriate hgweb page for the given changeset, regardless in which repository the converted revision ended up (since one big Subversion repository is converted into several Mercurial repositories). It can also look up Mercurial changesets by their hexadecimal ID.
roundup
By pointing Roundup to the URL of the lookup script mentioned above, links to SVN revisions will continue to work, and links to Mercurial changesets can be created as well, without having to give repository and changeset ID.
After migration
Where to get code
After migration, the hgwebdir will live at hg.python.org. This is an accepted standard for many organizations, and an easy parallel to svn.python.org. The working repo might live at http://hg.python.org/cpython/, for example, with the archive repo at http://hg.python.org/cpython-archive/. For write access, developers will have to use ssh, which could be ssh://hg@hg.python.org/cpython/.
code.python.org was also proposed as the hostname. We think that using the VCS name in the hostname is good because it prevents confusion: it should be clear that you can't use svn or bzr for hg.python.org.
hgwebdir can already provide tarballs for every changeset. This obviates the need for daily snapshots; we can just point users to tip.tar.gz instead, meaning they will get the latest. If desired, we could even use buildbot results to point to the last good changeset.
Python-specific documentation
hg comes with good built-in documentation (available through hg help) and a wiki [10] that's full of useful information and recipes, not to mention a popular book [11] (readable online).
In addition to that, the recently overhauled Python Developer's Guide [8] already has a branch with instructions for Mercurial instead of Subversion; an online build of this branch [9] is also available.
Proposed workflow
We propose two workflows for the migration of patches between several branches.
For migration within 2.x or 3.x branches, we propose a patch always gets committed to the oldest branch where it applies first. Then, the resulting changeset can be merged using hg merge to all newer branches within that series (2.x or 3.x). If it does not apply as-is to the newer branch, hg revert can be used to easily revert to the new-branch-native head, patch in some alternative version of the patch (or none, if it's not applicable), then commit the merge. The premise here is that all changesets from an older branch within the series are eventually merged to all newer branches within the series.
The upshot is that this provides for the most painless merging procedure. This means that in the general case, people have to think about the oldest branch to which the patch should be applied before actually applying it. Usually, that is one of only two branches: the latest maintenance branch and the trunk, except for security fixes applicable to older branches in security-fix-only mode.
For merging bug fixes from the 3.x to the 2.7 maintenance branch (2.6 and 2.5 are in security-fix-only mode and their maintenance will continue in the Subversion repository), changesets should be transplanted (not merged) in some other way. The transplant extension, import/export and bundle/unbundle work equally well here.
Choosing this approach allows 3.x not to carry all of the 2.x history-since-it-was-branched, meaning the clone is not as big and the merges not as complicated.
The future of Subversion
What happens to the Subversion repositories after the migration? Since the svn server contains a bunch of repositories, not just the CPython one, it will probably live on for a bit as not every project may want to migrate or it takes longer for other projects to migrate. To prevent people from staying behind, we may want to move migrated projects from the repository to a new, read-only repository with a new name.
Build identification
Python currently provides the sys.subversion tuple to allow Python code to find out exactly what version of Python it's running against. The current version looks something like this:
- ('CPython', 'tags/r262', '71600')
- ('CPython', 'trunk', '73128M')
Another value is returned from Py_GetBuildInfo() in the C API, and available to Python code as part of sys.version:
- 'r262:71600, Jun 2 2009, 09:58:33'
- 'trunk:73128M, Jun 2 2009, 01:24:14'
I propose that the revision identifier will be the short version of hg's revision hash, for example 'dd3ebf81af43', augmented with '+' (instead of 'M') if the working directory from which it was built was modified. This mirrors the output of the hg id command, which is intended for this kind of usage. The sys.subversion value will also be renamed to sys.mercurial to reflect the change in VCS.
For the tag/branch identifier, I propose that hg will check for tags on the currently checked out revision, use the tag if there is one ('tip' doesn't count), and uses the branch name otherwise. sys.subversion becomes
- ('CPython', 'v2.6.2', 'dd3ebf81af43')
- ('CPython', 'default', 'af694c6a888c+')
and the build info string becomes
- 'v2.6.2:dd3ebf81af43, Jun 2 2009, 09:58:33'
- 'default:af694c6a888c+, Jun 2 2009, 01:24:14'
This reflects that the default branch in hg is called 'default' instead of Subversion's 'trunk', and reflects the proposed new tag format.
Mercurial also allows to find out the latest tag and the number of changesets separating the current changeset from that tag, allowing for a descriptive version string:
$ hg parent --template "{latesttag}+{latesttagdistance}-{node|short}\n"
v3.2+37-4b5d0d260e72
$ hg up 2.7
3316 files updated, 0 files merged, 379 files removed, 0 files unresolved
$ hg parent --template "{latesttag}+{latesttagdistance}-{node|short}\n"
v2.7.1+216-9619d21d8198
Footnotes
| [1] | The Mercurial book discourages the use of named branches, but it is, in this respect, somewhat outdated. Named branches have gotten much easier to use since that comment was written, due to improvements in hg. |
| [2] | Since the initial working repo is a subset of the archive repo, it would also be feasible to pull changes from the working repo into the archive repo periodically. |
References
| [3] | http://www.python.org/dev/peps/pep-0347/ |
| [4] | http://bitbucket.org/durin42/hgsubversion/ |
| [5] | http://hg.python.org/hooks/ |
| [6] | (1, 2, 3) http://mercurial.selenic.com/wiki/EolExtension |
| [7] | http://mercurial.selenic.com/wiki/Win32TextExtension |
| [8] | http://docs.python.org/devguide/ |
| [9] | http://potrou.net/hgdevguide/ |
| [10] | http://mercurial.selenic.com/wiki/ |
| [11] | http://hgbook.red-bean.com/ |
Copyright
This document has been placed in the public domain.
pep-0386 Changing the version comparison module in Distutils
| PEP: | 386 |
|---|---|
| Title: | Changing the version comparison module in Distutils |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tarek ZiadĂŠ <tarek at ziade.org> |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 4-June-2009 |
| Superseded-By: | 440 |
Contents
Abstract
Note: This PEP has been superseded by the version identification and dependency specification scheme defined in PEP 440.
This PEP proposed a new version comparison schema system in Distutils.
Motivation
In Python there are no real restrictions yet on how a project should manage its versions, and how they should be incremented.
Distutils provides a version distribution meta-data field but it is freeform and current users, such as PyPI usually consider the latest version pushed as the latest one, regardless of the expected semantics.
Distutils will soon extend its capabilities to allow distributions to express a dependency on other distributions through the Requires-Dist metadata field (see PEP 345) and it will optionally allow use of that field to restrict the dependency to a set of compatible versions. Notice that this field is replacing Requires that was expressing dependencies on modules and packages.
The Requires-Dist field will allow a distribution to define a dependency on another package and optionally restrict this dependency to a set of compatible versions, so one may write:
Requires-Dist: zope.interface (>3.5.0)
This means that the distribution requires zope.interface with a version greater than 3.5.0.
This also means that Python projects will need to follow the same convention as the tool that will be used to install them, so they are able to compare versions.
That is why this PEP proposes, for the sake of interoperability, a standard schema to express version information and its comparison semantics.
Furthermore, this will make OS packagers' work easier when repackaging standards compliant distributions, because as of now it can be difficult to decide how two distribution versions compare.
Requisites and current status
It is not in the scope of this PEP to provide a universal versioning schema intended to support all or even most of existing versioning schemas. There will always be competing grammars, either mandated by distro or project policies or by historical reasons that we cannot expect to change.
The proposed schema should be able to express the usual versioning semantics, so it's possible to parse any alternative versioning schema and transform it into a compliant one. This is how OS packagers usually deal with the existing version schemas and is a preferable alternative than supporting an arbitrary set of versioning schemas.
Conformance to usual practice and conventions, as well as a simplicity are a plus, to ease frictionless adoption and painless transition. Practicality beats purity, sometimes.
Projects have very different versioning needs, but the following are widely considered important semantics:
- it should be possible to express more than one versioning level (usually this is expressed as major and minor revision and, sometimes, also a micro revision).
- a significant number of projects need special meaning versions for "pre-releases" (such as "alpha", "beta", "rc"), and these have widely used aliases ("a" stands for "alpha", "b" for "beta" and "c" for "rc"). And these pre-release versions make it impossible to use a simple alphanumerical ordering of the version string components. (Example: 3.1a1 < 3.1)
- some projects also need "post-releases" of regular versions, mainly for installer work which can't be clearly expressed otherwise.
- development versions allow packagers of unreleased work to avoid version clash with later regular releases.
For people that want to go further and use a tool to manage their version numbers, the two major ones are:
Distutils
Distutils currently provides a StrictVersion and a LooseVersion class that can be used to manage versions.
The LooseVersion class is quite lax. From Distutils doc:
Version numbering for anarchists and software realists.
Implements the standard interface for version number classes as
described above. A version number consists of a series of numbers,
separated by either periods or strings of letters. When comparing
version numbers, the numeric components will be compared
numerically, and the alphabetic components lexically. The following
are all valid version numbers, in no particular order:
1.5.1
1.5.2b2
161
3.10a
8.02
3.4j
1996.07.12
3.2.pl0
3.1.1.6
2g6
11g
0.960923
2.2beta29
1.13++
5.5.kw
2.0b1pl0
In fact, there is no such thing as an invalid version number under
this scheme; the rules for comparison are simple and predictable,
but may not always give the results you want (for some definition
of "want").
This class makes any version string valid, and provides an algorithm to sort them numerically then lexically. It means that anything can be used to version your project:
>>> from distutils.version import LooseVersion as V
>>> v1 = V('FunkyVersion')
>>> v2 = V('GroovieVersion')
>>> v1 > v2
False
The problem with this is that while it allows expressing any nesting level it doesn't allow giving special meaning to versions (pre and post-releases as well as development versions), as expressed in requisites 2, 3 and 4.
The StrictVersion class is more strict. From the doc:
Version numbering for meticulous retentive and software idealists.
Implements the standard interface for version number classes as
described above. A version number consists of two or three
dot-separated numeric components, with an optional "pre-release" tag
on the end. The pre-release tag consists of the letter 'a' or 'b'
followed by a number. If the numeric components of two version
numbers are equal, then one with a pre-release tag will always
be deemed earlier (lesser) than one without.
The following are valid version numbers (shown in the order that
would be obtained by sorting according to the supplied cmp function):
0.4 0.4.0 (these two are equivalent)
0.4.1
0.5a1
0.5b3
0.5
0.9.6
1.0
1.0.4a3
1.0.4b1
1.0.4
The following are examples of invalid version numbers:
1
2.7.2.2
1.3.a4
1.3pl1
1.3c4
This class enforces a few rules, and makes a decent tool to work with version numbers:
>>> from distutils.version import StrictVersion as V
>>> v2 = V('GroovieVersion')
Traceback (most recent call last):
...
ValueError: invalid version number 'GroovieVersion'
>>> v2 = V('1.1')
>>> v3 = V('1.3')
>>> v2 < v3
True
It adds pre-release versions, and some structure, but lacks a few semantic elements to make it usable, such as development releases or post-release tags, as expressed in requisites 3 and 4.
Also, note that Distutils version classes have been present for years but are not really used in the community.
Setuptools
Setuptools provides another version comparison tool [3] which does not enforce any rules for the version, but tries to provide a better algorithm to convert the strings to sortable keys, with a parse_version function.
From the doc:
Convert a version string to a chronologically-sortable key This is a rough cross between Distutils' StrictVersion and LooseVersion; if you give it versions that would work with StrictVersion, then it behaves the same; otherwise it acts like a slightly-smarter LooseVersion. It is *possible* to create pathological version coding schemes that will fool this parser, but they should be very rare in practice. The returned value will be a tuple of strings. Numeric portions of the version are padded to 8 digits so they will compare numerically, but without relying on how numbers compare relative to strings. Dots are dropped, but dashes are retained. Trailing zeros between alpha segments or dashes are suppressed, so that e.g. "2.4.0" is considered the same as "2.4". Alphanumeric parts are lower-cased. The algorithm assumes that strings like "-" and any alpha string that alphabetically follows "final" represents a "patch level". So, "2.4-1" is assumed to be a branch or patch of "2.4", and therefore "2.4.1" is considered newer than "2.4-1", which in turn is newer than "2.4". Strings like "a", "b", "c", "alpha", "beta", "candidate" and so on (that come before "final" alphabetically) are assumed to be pre-release versions, so that the version "2.4" is considered newer than "2.4a1". Finally, to handle miscellaneous cases, the strings "pre", "preview", and "rc" are treated as if they were "c", i.e. as though they were release candidates, and therefore are not as new as a version string that does not contain them, and "dev" is replaced with an '@' so that it sorts lower than than any other pre-release tag.
In other words, parse_version will return a tuple for each version string, that is compatible with StrictVersion but also accept arbitrary version and deal with them so they can be compared:
>>> from pkg_resources import parse_version as V
>>> V('1.2')
('00000001', '00000002', '*final')
>>> V('1.2b2')
('00000001', '00000002', '*b', '00000002', '*final')
>>> V('FunkyVersion')
('*funkyversion', '*final')
In this schema practicality takes priority over purity, but as a result it doesn't enforce any policy and leads to very complex semantics due to the lack of a clear standard. It just tries to adapt to widely used conventions.
Caveats of existing systems
The major problem with the described version comparison tools is that they are too permissive and, at the same time, aren't capable of expressing some of the required semantics. Many of the versions on PyPI [4] are obviously not useful versions, which makes it difficult for users to grok the versioning that a particular package was using and to provide tools on top of PyPI.
Distutils classes are not really used in Python projects, but the Setuptools function is quite widespread because it's used by tools like easy_install [6], pip [5] or zc.buildout [7] to install dependencies of a given project.
While Setuptools does provide a mechanism for comparing/sorting versions, it is much preferable if the versioning spec is such that a human can make a reasonable attempt at that sorting without having to run it against some code.
Also there's a problem with the use of dates at the "major" version number (e.g. a version string "20090421") with RPMs: it means that any attempt to switch to a more typical "major.minor..." version scheme is problematic because it will always sort less than "20090421".
Last, the meaning of - is specific to Setuptools, while it is avoided in some packaging systems like the one used by Debian or Ubuntu.
The new versioning algorithm
During Pycon, members of the Python, Ubuntu and Fedora community worked on a version standard that would be acceptable for everyone.
It's currently called verlib and a prototype lives at [10].
The pseudo-format supported is:
N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]
The real regular expression is:
expr = r"""^
(?P<version>\d+\.\d+) # minimum 'N.N'
(?P<extraversion>(?:\.\d+)*) # any number of extra '.N' segments
(?:
(?P<prerel>[abc]|rc) # 'a' = alpha, 'b' = beta
# 'c' or 'rc' = release candidate
(?P<prerelversion>\d+(?:\.\d+)*)
)?
(?P<postdev>(\.post(?P<post>\d+))?(\.dev(?P<dev>\d+))?)?
$"""
Some examples probably make it clearer:
>>> from verlib import NormalizedVersion as V
>>> (V('1.0a1')
... < V('1.0a2.dev456')
... < V('1.0a2')
... < V('1.0a2.1.dev456')
... < V('1.0a2.1')
... < V('1.0b1.dev456')
... < V('1.0b2')
... < V('1.0b2.post345')
... < V('1.0c1.dev456')
... < V('1.0c1')
... < V('1.0.dev456')
... < V('1.0')
... < V('1.0.post456.dev34')
... < V('1.0.post456'))
True
The trailing .dev123 is for pre-releases. The .post123 is for post-releases -- which apparently are used by a number of projects out there (e.g. Twisted [8]). For example after a 1.2.0 release there might be a 1.2.0-r678 release. We used post instead of r because the r is ambiguous as to whether it indicates a pre- or post-release.
.post456.dev34 indicates a dev marker for a post release, that sorts before a .post456 marker. This can be used to do development versions of post releases.
Pre-releases can use a for "alpha", b for "beta" and c for "release candidate". rc is an alternative notation for "release candidate" that is added to make the version scheme compatible with Python's own version scheme. rc sorts after c:
>>> from verlib import NormalizedVersion as V
>>> (V('1.0a1')
... < V('1.0a2')
... < V('1.0b3')
... < V('1.0c1')
... < V('1.0rc2')
... < V('1.0'))
True
Note that c is the preferred marker for third party projects.
verlib provides a NormalizedVersion class and a suggest_normalized_version function.
NormalizedVersion
The NormalizedVersion class is used to hold a version and to compare it with others. It takes a string as an argument, that contains the representation of the version:
>>> from verlib import NormalizedVersion
>>> version = NormalizedVersion('1.0')
The version can be represented as a string:
>>> str(version) '1.0'
Or compared with others:
>>> NormalizedVersion('1.0') > NormalizedVersion('0.9')
True
>>> NormalizedVersion('1.0') < NormalizedVersion('1.1')
True
A class method called from_parts is available if you want to create an instance by providing the parts that composes the version.
Examples
>>> version = NormalizedVersion.from_parts((1, 0))
>>> str(version)
'1.0'
>>> version = NormalizedVersion.from_parts((1, 0), ('c', 4))
>>> str(version)
'1.0c4'
>>> version = NormalizedVersion.from_parts((1, 0), ('c', 4), ('dev', 34))
>>> str(version)
'1.0c4.dev34'
suggest_normalized_version
suggest_normalized_version is a function that suggests a normalized version close to the given version string. If you have a version string that isn't normalized (i.e. NormalizedVersion doesn't like it) then you might be able to get an equivalent (or close) normalized version from this function.
This does a number of simple normalizations to the given string, based on an observation of versions currently in use on PyPI.
Given a dump of those versions on January 6th 2010, the function has given those results out of the 8821 distributions PyPI had:
- 7822 (88.67%) already match NormalizedVersion without any change
- 717 (8.13%) match when using this suggestion method
- 282 (3.20%) don't match at all.
The 3.20% of projects that are incompatible with NormalizedVersion and cannot be transformed into a compatible form, are for most of them date-based version schemes, versions with custom markers, or dummy versions. Examples:
- working proof of concept
- 1 (first draft)
- unreleased.unofficialdev
- 0.1.alphadev
- 2008-03-29_r219
- etc.
When a tool needs to work with versions, a strategy is to use suggest_normalized_version on the versions string. If this function returns None, it means that the provided version is not close enough to the standard scheme. If it returns a version that slightly differs from the original version, it's a suggested normalized version. Last, if it returns the same string, it means that the version matches the scheme.
Here's an example of usage:
>>> from verlib import suggest_normalized_version, NormalizedVersion
>>> import warnings
>>> def validate_version(version):
... rversion = suggest_normalized_version(version)
... if rversion is None:
... raise ValueError('Cannot work with "%s"' % version)
... if rversion != version:
... warnings.warn('"%s" is not a normalized version.\n'
... 'It has been transformed into "%s" '
... 'for interoperability.' % (version, rversion))
... return NormalizedVersion(rversion)
...
>>> validate_version('2.4-rc1')
__main__:8: UserWarning: "2.4-rc1" is not a normalized version.
It has been transformed into "2.4c1" for interoperability.
NormalizedVersion('2.4c1')
>>> validate_version('2.4c1')
NormalizedVersion('2.4c1')
>>> validate_version('foo')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in validate_version
ValueError: Cannot work with "foo"
Roadmap
Distutils will deprecate its existing versions class in favor of NormalizedVersion. The verlib module presented in this PEP will be renamed to version and placed into the distutils package.
References
| [1] | http://docs.python.org/distutils |
| [2] | http://peak.telecommunity.com/DevCenter/setuptools |
| [3] | http://peak.telecommunity.com/DevCenter/setuptools#specifying-your-project-s-version |
| [4] | http://pypi.python.org/pypi |
| [5] | http://pypi.python.org/pypi/pip |
| [6] | http://peak.telecommunity.com/DevCenter/EasyInstall |
| [7] | http://pypi.python.org/pypi/zc.buildout |
| [8] | http://twistedmatrix.com/trac/ |
| [9] | http://peak.telecommunity.com/DevCenter/setuptools |
| [10] | http://bitbucket.org/tarek/distutilsversion/ |
Acknowledgments
Trent Mick, Matthias Klose, Phillip Eby, David Lyon, and many people at Pycon and Distutils-SIG.
Copyright
This document has been placed in the public domain.
pep-0387 Backwards Compatibility Policy
| PEP: | 387 |
|---|---|
| Title: | Backwards Compatibility Policy |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benjamin Peterson <benjamin at python.org> |
| Status: | Draft |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 18-Jun-2009 |
| Post-History: | 19-Jun-2009 |
Contents
Abstract
This PEP outlines Python's backwards compatibility policy.
Rationale
As one of the most used programming languages today [1], the Python core language and its standard library play a critcal role in thousands of applications and libraries. This is fantastic; it is probably one of a language designer's most wishful dreams. However, it means the development team must be very careful not to break this existing 3rd party code with new releases.
Backwards Compatibility Rules
This policy applies to all public APIs. These include:
- Syntax and behavior of these constructs as defined by the reference manual
- The C-API
- Function, class, module, attribute, and method names and types.
- Given a set of arguments, the return value, side effects, and raised exceptions of a function. This does not preclude changes from reasonable bug fixes.
- The position and expected types of arguments and returned values.
- Behavior of classes with regards to subclasses: the conditions under which overridden methods are called.
Others are explicity not part of the public API. They can change or be removed at any time in any way. These include:
- Function, class, module, attribute, method, and C-API names and types that are prefixed by "_" (except special names). The contents of these can also are not subject to the policy.
- Inheritance patterns of internal classes.
- Test suites. (Anything in the Lib/test directory or test subdirectories of packages.)
This is the basic policy for backwards compatibility:
- Unless it is going through the deprecation process below, the behavior of an API must not change between any two consecutive releases.
- Similarly a feature cannot be removed without notice between any two consecutive releases.
- Addition of a feature which breaks 3rd party libraries or applications should have a large benefit to breakage ratio, and/or the incompatibility should be trival to fix in broken code. For example, adding an stdlib module with the same name as a third party package is not acceptable. Adding a method or attribute that conflicts with 3rd party code through inheritance, however, is likely reasonable.
Making Incompatible Changes
It's a fact: design mistakes happen. Thus it is important to be able to change APIs or remove misguided features. This is accomplished through a gradual process over several releases:
- Discuss the change. Depending on the size of the incompatibility, this could be on the bug tracker, python-dev, python-list, or the appropriate SIG. A PEP or similar document may be written. Hopefully users of the affected API will pipe up to comment.
- Add a warning [2]. If behavior is changing, the API may gain a new function or method to perform the new behavior; old usage should raise the warning. If an API is being removed, simply warn whenever it is entered. DeprecationWarning is the usual warning category to use, but PendingDeprecationWarning may be used in special cases were the old and new versions of the API will coexist for many releases.
- Wait for a release of whichever branch contains the warning.
- See if there's any feedback. Users not involved in the original discussions may comment now after seeing the warning. Perhaps reconsider.
- The behavior change or feature removal may now be made default or permanent in the next release. Remove the old version and warning.
References
| [1] | TIOBE Programming Community Index http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html |
| [2] | The warnings module |
Copyright
This document has been placed in the public domain.
pep-0389 argparse - New Command Line Parsing Module
| PEP: | 389 |
|---|---|
| Title: | argparse - New Command Line Parsing Module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Steven Bethard <steven.bethard at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 25-Sep-2009 |
| Python-Version: | 2.7 and 3.2 |
| Post-History: | 27-Sep-2009, 24-Oct-2009 |
Contents
- Acceptance
- Abstract
- Motivation
- Why aren't getopt and optparse enough?
- Why isn't the functionality just being added to optparse?
- Deprecation of optparse
- Updates to getopt documentation
- Deferred: string formatting
- Rejected: getopt compatibility methods
- Out of Scope: Various Feature Requests
- Discussion: sys.stderr and sys.exit
- References
- Copyright
Acceptance
This PEP was approved by Guido on python-dev on February 21, 2010 [17].
Abstract
This PEP proposes inclusion of the argparse [1] module in the Python standard library in Python 2.7 and 3.2.
Motivation
The argparse module is a command line parsing library which provides more functionality than the existing command line parsing modules in the standard library, getopt [2] and optparse [3]. It includes support for positional arguments (not just options), subcommands, required options, options syntaxes like "/f" and "+rgb", zero-or-more and one-or-more style arguments, and many other features the other two lack.
The argparse module is also already a popular third-party replacement for these modules. It is used in projects like IPython (the Scipy Python shell) [4], is included in Debian testing and unstable [5], and since 2007 has had various requests for its inclusion in the standard library [6] [7] [8]. This popularity suggests it may be a valuable addition to the Python libraries.
Why aren't getopt and optparse enough?
One argument against adding argparse is that thare are "already two different option parsing modules in the standard library" [9]. The following is a list of features provided by argparse but not present in getopt or optparse:
- While it is true there are two option parsing libraries, there are no full command line parsing libraries -- both getopt and optparse support only options and have no support for positional arguments. The argparse module handles both, and as a result, is able to generate better help messages, avoiding redundancies like the usage= string usually required by optparse.
- The argparse module values practicality over purity. Thus, argparse allows required options and customization of which characters are used to identify options, while optparse explicitly states "the phrase 'required option' is self-contradictory" and that the option syntaxes -pf, -file, +f, +rgb, /f and /file "are not supported by optparse, and they never will be".
- The argparse module allows options to accept a variable number of arguments using nargs='?', nargs='*' or nargs='+'. The optparse module provides an untested recipe for some part of this functionality [10] but admits that "things get hairy when you want an option to take a variable number of arguments."
- The argparse module supports subcommands, where a main command line parser dispatches to other command line parsers depending on the command line arguments. This is a common pattern in command line interfaces, e.g. svn co and svn up.
Why isn't the functionality just being added to optparse?
Clearly all the above features offer improvements over what is available through optparse. A reasonable question then is why these features are not simply provided as patches to optparse, instead of introducing an entirely new module. In fact, the original development of argparse intended to do just that, but because of various fairly constraining design decisions of optparse, this wasn't really possible. Some of the problems included:
The optparse module exposes the internals of its parsing algorithm. In particular, parser.largs and parser.rargs are guaranteed to be available to callbacks [11]. This makes it extremely difficult to improve the parsing algorithm as was necessary in argparse for proper handling of positional arguments and variable length arguments. For example, nargs='+' in argparse is matched using regular expressions and thus has no notion of things like parser.largs.
The optparse extension APIs are extremely complex. For example, just to use a simple custom string-to-object conversion function, you have to subclass Option, hack class attributes, and then specify your custom option type to the parser, like this:
class MyOption(Option): TYPES = Option.TYPES + ("mytype",) TYPE_CHECKER = copy(Option.TYPE_CHECKER) TYPE_CHECKER["mytype"] = check_mytype parser = optparse.OptionParser(option_class=MyOption) parser.add_option("-m", type="mytype")For comparison, argparse simply allows conversion functions to be used as type= arguments directly, e.g.:
parser = argparse.ArgumentParser() parser.add_option("-m", type=check_mytype)But given the baroque customization APIs of optparse, it is unclear how such a feature should interact with those APIs, and it is quite possible that introducing the simple argparse API would break existing custom Option code.
Both optparse and argparse parse command line arguments and assign them as attributes to an object returned by parse_args. However, the optparse module guarantees that the take_action method of custom actions will always be passed a values object which provides an ensure_value method [12], while the argparse module allows attributes to be assigned to any object, e.g.:
foo_object = ... parser.parse_args(namespace=foo_object) foo_object.some_attribute_parsed_from_command_line
Modifying optparse to allow any object to be passed in would be difficult because simply passing the foo_object around instead of a Values instance will break existing custom actions that depend on the ensure_value method.
Because of issues like these, which made it unreasonably difficult for argparse to stay compatible with the optparse APIs, argparse was developed as an independent module. Given these issues, merging all the argparse features into optparse with no backwards incompatibilities seems unlikely.
Deprecation of optparse
Because all of optparse's features are available in argparse, the optparse module will be deprecated. However, because of the widespread use of optparse, the deprecation strategy contains only documentation changes and warnings that will not be visible by default:
Python 2.7+ and 3.2+ -- The following note will be added to the optparse documentation:
The optparse module is deprecated and will not be developed further; development will continue with the argparse module.
Python 2.7+ -- If the Python 3 compatibility flag, -3, is provided at the command line, then importing optparse will issue a DeprecationWarning. Otherwise no warnings will be issued.
Python 3.2+ -- Importing optparse will issue a PendingDeprecationWarning, which is not displayed by default.
Note that no removal date is proposed for optparse.
Updates to getopt documentation
The getopt module will not be deprecated. However, its documentation will be updated to point to argparse in a couple of places. At the top of the module, the following note will be added:
The getopt module is a parser for command line options whose API is designed to be familiar to users of the C getopt function. Users who are unfamiliar with the C getopt function or who would like to write less code and get better help and error messages should consider using the argparse module instead.
Additionally, after the final getopt example, the following note will be added:
Note that an equivalent command line interface could be produced with less code by using the argparse module:
import argparse if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-o', '--output') parser.add_argument('-v', dest='verbose', action='store_true') args = parser.parse_args() # ... do something with args.output ... # ... do something with args.verbose ..
Deferred: string formatting
The argparse module supports Python from 2.3 up through 3.2 and as a result relies on traditional %(foo)s style string formatting. It has been suggested that it might be better to use the new style {foo} string formatting [13]. There was some discussion about how best to do this for modules in the standard library [14] and several people are developing functions for automatically converting %-formatting to {}-formatting [15] [16]. When one of these is added to the standard library, argparse will use them to support both formatting styles.
Rejected: getopt compatibility methods
Previously, when this PEP was suggesting the deprecation of getopt as well as optparse, there was some talk of adding a method like:
ArgumentParser.add_getopt_arguments(options[, long_options])
However, this method will not be added for a number of reasons:
- The getopt module is not being deprecated, so there is less need.
- This method would not actually ease the transition for any getopt users who were already maintaining usage messages, because the API above gives no way of adding help messages to the arguments.
- Some users of getopt consider it very important that only a single function call is necessary. The API above does not satisfy this requirement because both ArgumentParser() and parse_args() must also be called.
Out of Scope: Various Feature Requests
Several feature requests for argparse were made in the discussion of this PEP:
- Support argument defaults from environment variables
- Support argument defaults from configuration files
- Support "foo --help subcommand" in addition to the currently supported "foo subcommand --help"
These are all reasonable feature requests for the argparse module, but are out of the scope of this PEP, and have been redirected to the argparse issue tracker.
Discussion: sys.stderr and sys.exit
There were some concerns that argparse by default always writes to sys.stderr and always calls sys.exit when invalid arguments are provided. This is the desired behavior for the vast majority of argparse use cases which revolve around simple command line interfaces. However, in some cases, it may be desirable to keep argparse from exiting, or to have it write its messages to something other than sys.stderr. These use cases can be supported by subclassing ArgumentParser and overriding the exit or _print_message methods. The latter is an undocumented implementation detail, but could be officially exposed if this turns out to be a common need.
References
| [1] | argparse (http://code.google.com/p/argparse/) |
| [2] | getopt (http://docs.python.org/library/getopt.html) |
| [3] | optparse (http://docs.python.org/library/optparse.html) |
| [4] | argparse in IPython (http://mail.scipy.org/pipermail/ipython-dev/2009-April/005102.html) |
| [5] | argparse in Debian (http://packages.debian.org/search?keywords=argparse) |
| [6] | (1, 2) 2007-01-03 request for argparse in the standard library (http://mail.python.org/pipermail/python-list/2007-January/472276.html) |
| [7] | 2009-06-09 request for argparse in the standard library (http://bugs.python.org/issue6247) |
| [8] | 2009-09-10 request for argparse in the standard library (http://mail.python.org/pipermail/stdlib-sig/2009-September/000342.html) |
| [9] | Fredrik Lundh response to [6] (http://mail.python.org/pipermail/python-list/2007-January/1086892.html) |
| [10] | optparse variable args (http://docs.python.org/library/optparse.html#callback-example-6-variable-arguments) |
| [11] | parser.largs and parser.rargs (http://docs.python.org/library/optparse.html#how-callbacks-are-called) |
| [12] | take_action values argument (http://docs.python.org/library/optparse.html#adding-new-actions) |
| [13] | use {}-formatting instead of %-formatting (http://bugs.python.org/msg89279) |
| [14] | transitioning from % to {} formatting (http://mail.python.org/pipermail/python-dev/2009-September/092326.html) |
| [15] | Vinay Sajip's %-to-{} converter (http://gist.github.com/200936) |
| [16] | Benjamin Peterson's %-to-{} converter (http://bazaar.launchpad.net/~gutworth/+junk/mod2format/files) |
| [17] | Guido's approval (http://mail.python.org/pipermail/python-dev/2010-February/097839.html) |
Copyright
This document has been placed in the public domain.
pep-0390 Static metadata for Distutils
| PEP: | 390 |
|---|---|
| Title: | Static metadata for Distutils |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Tarek ZiadĂŠ <tarek at ziade.org> |
| BDFL-Delegate: | Nick Coghlan |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 10-October-2009 |
| Python-Version: | 2.7 and 3.2 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/distutils-sig/2013-April/020597.html |
Contents
Abstract
This PEP describes a new section and a new format for the setup.cfg file, that allows describing the Metadata of a package without using setup.py.
Rejection Notice
As distutils2 is no longer going to be incorporated into the standard library, this PEP was rejected by Nick Coghlan in late April, 2013.
A replacement PEP based on PEP 426 (metadata 2.0) will be created that defines the minimum amount of information needed to generate an sdist archive given a source tarball or VCS checkout.
Rationale
Today, if you want to list all the Metadata of a distribution (see PEP 314) that is not installed, you need to use the setup.py command line interface.
So, basically, you download it, and run:
$ python setup.py --name Distribute $ python setup.py --version 0.6.4
Where name and version are metadata fields. This is working fine but as soon as the developers add more code in setup.py, this feature might break or in worst cases might do unwanted things on the target system.
Moreover, when an OS packager wants to get the metadata of a distribution he is re-packaging, he might encounter some problems to understand the setup.py file he's working with.
So the rationale of this PEP is to provide a way to declare the metadata in a static configuration file alongside setup.py that doesn't require any third party code to run.
Adding a metadata section in setup.cfg
The first thing we want to introduce is a [metadata] section, in the setup.cfg file, that may contain any field from the Metadata:
[metadata] name = Distribute version = 0.6.4
The setup.cfg file is used to avoid adding yet another configuration file to work with in Distutils.
This file is already read by Distutils when a command is executed, and if the metadata section is found, it will be used to fill the metadata fields. If an option that corresponds to a Metadata field is given to setup(), it will override the value that was possibly present in setup.cfg.
Notice that setup.py is still used and can be required to define some options that are not part of the Metadata fields. For instance, the sdist command can use options like packages or scripts.
Multi-lines values
Some Metadata fields can have multiple values. To keep setup.cfg compatible with ConfigParser and the RFC 822 LONG HEADER FIELDS (see section 3.1.1), these are expressed with ,-separated values:
requires = pywin32, bar > 1.0, foo
When this variable is read, the values are parsed and transformed into a list: ['pywin32', 'bar > 1.0', 'foo'].
Context-dependant sections
The metadata section will also be able to use context-dependant sections.
A context-dependant section is a section with a condition about the execution environment. Here's some examples:
[metadata] name = Distribute version = 0.6.4 [metadata:sys_platform == 'win32'] requires = pywin32, bar > 1.0 obsoletes = pywin31 [metadata:os_machine == 'i386'] requires = foo [metadata:python_version == '2.4' or python_version == '2.5'] requires = bar [metadata:'linux' in sys_platform] requires = baz
Every [metadata:condition] section will be used only if the condition is met when the file is read. The background motivation for these context-dependant sections is to be able to define requirements that varies depending on the platform the distribution might be installed on. (see PEP 314).
The micro-language behind this is the simplest possible: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. It makes it also easy to understand to non-pythoneers.
The pseudo-grammar is
EXPR [in|==|!=|not in] EXPR [or|and] ...
where EXPR belongs to any of those:
- python_version = '%s.%s' % (sys.version_info[0], sys.version_info[1])
- os_name = os.name
- sys_platform = sys.platform
- platform_version = platform.version()
- platform_machine = platform.machine()
- a free string, like 2.4, or win32
Notice that in is restricted to strings, meaning that it is not possible to use other sequences like tuples or lists on the right side.
Distutils will provide a function that is able to generate the metadata of a distribution, given a setup.cfg file, for the execution environment:
>>> from distutils.util import local_metadata
>>> local_metadata('setup.cfg')
<DistributionMetadata instance>
This means that a vanilla Python will be able to read the metadata of a package without running any third party code.
Notice that this feature is not restricted to the metadata namespace. Consequently, any other section can be extended with such context-dependant sections.
Impact on PKG-INFO generation and PEP 314
When PKG-INFO is generated by Distutils, every field that relies on a condition will have that condition written at the end of the line, after a ; separator:
Metadata-Version: 1.2 Name: distribute Version: 0.6.4 ... Requires: pywin32, bar > 1.0; sys_platform == 'win32' Requires: foo; os_machine == 'i386' Requires: bar; python_version == '2.4' or python_version == '2.5' Requires: baz; 'linux' in sys_platform Obsoletes = pywin31; sys_platform == 'win32' ... Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: Python Software Foundation License
Notice that this file can be opened with the DistributionMetadata class. This class will be able to use the micro-language using the execution environment.
Let's run in on a Python 2.5 i386 Linux:
>>> from distutils.dist import DistributionMetadata
>>> metadata = DistributionMetadata('PKG_INFO')
>>> metadata.get_requires()
['foo', 'bar', 'baz']
The execution environment can be overriden in case we want to get the meyadata for another environment:
>>> env = {'python_version': '2.4',
... 'os_name': 'nt',
... 'sys_platform': 'win32',
... 'platform_version': 'MVCC++ 6.0'
... 'platform_machine': 'i386'}
...
>>> metadata = DistributionMetadata('PKG_INFO', environment=env)
>>> metadata.get_requires()
['bar > 1.0', 'foo', 'bar']
PEP 314 is changed accordingly, meaning that each field will be able to have that extra condition marker.
Compatiblity
This change is is based on a new metadata 1.2 format meaning that Distutils will be able to distinguish old PKG-INFO files from new ones.
The setup.cfg file change will stay ConfigParser-compatible and will not break existing setup.cfg files.
Limitations
We are not providing < and > operators at this time, and python_version is a regular string. This implies using or operators when a section needs to be restricted to a couple of Python versions. Although, if PEP 386 is accepted, python_version could be changed internally into something comparable with strings, and < and > operators introduced.
Last, if a distribution is unable to set all metadata fields in setup.cfg, that's fine, the fields will be set to UNKNOWN when local_metadata is called. Getting UNKNOWN values will mean that it might be necessary to run the setup.py command line interface to get the whole set of metadata.
Acknowledgments
The Distutils-SIG.
Copyright
This document has been placed in the public domain.
pep-0391 Dictionary-Based Configuration For Logging
| PEP: | 391 |
|---|---|
| Title: | Dictionary-Based Configuration For Logging |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Vinay Sajip <vinay_sajip at red-dove.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Oct-2009 |
| Python-Version: | 2.7, 3.2 |
| Post-History: |
Contents
Abstract
This PEP describes a new way of configuring logging using a dictionary to hold configuration information.
Rationale
The present means for configuring Python's logging package is either by using the logging API to configure logging programmatically, or else by means of ConfigParser-based configuration files.
Programmatic configuration, while offering maximal control, fixes the configuration in Python code. This does not facilitate changing it easily at runtime, and, as a result, the ability to flexibly turn the verbosity of logging up and down for different parts of a using application is lost. This limits the usability of logging as an aid to diagnosing problems - and sometimes, logging is the only diagnostic aid available in production environments.
The ConfigParser-based configuration system is usable, but does not allow its users to configure all aspects of the logging package. For example, Filters cannot be configured using this system. Furthermore, the ConfigParser format appears to engender dislike (sometimes strong dislike) in some quarters. Though it was chosen because it was the only configuration format supported in the Python standard at that time, many people regard it (or perhaps just the particular schema chosen for logging's configuration) as 'crufty' or 'ugly', in some cases apparently on purely aesthetic grounds.
Recent versions of Python include JSON support in the standard library, and this is also usable as a configuration format. In other environments, such as Google App Engine, YAML is used to configure applications, and usually the configuration of logging would be considered an integral part of the application configuration. Although the standard library does not contain YAML support at present, support for both JSON and YAML can be provided in a common way because both of these serialization formats allow deserialization to Python dictionaries.
By providing a way to configure logging by passing the configuration in a dictionary, logging will be easier to configure not only for users of JSON and/or YAML, but also for users of custom configuration methods, by providing a common format in which to describe the desired configuration.
Another drawback of the current ConfigParser-based configuration system is that it does not support incremental configuration: a new configuration completely replaces the existing configuration. Although full flexibility for incremental configuration is difficult to provide in a multi-threaded environment, the new configuration mechanism will allow the provision of limited support for incremental configuration.
Specification
The specification consists of two parts: the API and the format of the dictionary used to convey configuration information (i.e. the schema to which it must conform).
Naming
Historically, the logging package has not been PEP 8 conformant [1]. At some future time, this will be corrected by changing method and function names in the package in order to conform with PEP 8. However, in the interests of uniformity, the proposed additions to the API use a naming scheme which is consistent with the present scheme used by logging.
API
The logging.config module will have the following addition:
- A function, called dictConfig(), which takes a single argument - the dictionary holding the configuration. Exceptions will be raised if there are errors while processing the dictionary.
It will be possible to customize this API - see the section on API Customization. Incremental configuration is covered in its own section.
Dictionary Schema - Overview
Before describing the schema in detail, it is worth saying a few words about object connections, support for user-defined objects and access to external and internal objects.
Object connections
The schema is intended to describe a set of logging objects - loggers, handlers, formatters, filters - which are connected to each other in an object graph. Thus, the schema needs to represent connections between the objects. For example, say that, once configured, a particular logger has attached to it a particular handler. For the purposes of this discussion, we can say that the logger represents the source, and the handler the destination, of a connection between the two. Of course in the configured objects this is represented by the logger holding a reference to the handler. In the configuration dict, this is done by giving each destination object an id which identifies it unambiguously, and then using the id in the source object's configuration to indicate that a connection exists between the source and the destination object with that id.
So, for example, consider the following YAML snippet:
formatters:
brief:
# configuration for formatter with id 'brief' goes here
precise:
# configuration for formatter with id 'precise' goes here
handlers:
h1: #This is an id
# configuration of handler with id 'h1' goes here
formatter: brief
h2: #This is another id
# configuration of handler with id 'h2' goes here
formatter: precise
loggers:
foo.bar.baz:
# other configuration for logger 'foo.bar.baz'
handlers: [h1, h2]
(Note: YAML will be used in this document as it is a little more readable than the equivalent Python source form for the dictionary.)
The ids for loggers are the logger names which would be used programmatically to obtain a reference to those loggers, e.g. foo.bar.baz. The ids for Formatters and Filters can be any string value (such as brief, precise above) and they are transient, in that they are only meaningful for processing the configuration dictionary and used to determine connections between objects, and are not persisted anywhere when the configuration call is complete.
Handler ids are treated specially, see the section on Handler Ids, below.
The above snippet indicates that logger named foo.bar.baz should have two handlers attached to it, which are described by the handler ids h1 and h2. The formatter for h1 is that described by id brief, and the formatter for h2 is that described by id precise.
User-defined objects
The schema should support user-defined objects for handlers, filters and formatters. (Loggers do not need to have different types for different instances, so there is no support - in the configuration - for user-defined logger classes.)
Objects to be configured will typically be described by dictionaries which detail their configuration. In some places, the logging system will be able to infer from the context how an object is to be instantiated, but when a user-defined object is to be instantiated, the system will not know how to do this. In order to provide complete flexibility for user-defined object instantiation, the user will need to provide a 'factory' - a callable which is called with a configuration dictionary and which returns the instantiated object. This will be signalled by an absolute import path to the factory being made available under the special key '()'. Here's a concrete example:
formatters:
brief:
format: '%(message)s'
default:
format: '%(asctime)s %(levelname)-8s %(name)-15s %(message)s'
datefmt: '%Y-%m-%d %H:%M:%S'
custom:
(): my.package.customFormatterFactory
bar: baz
spam: 99.9
answer: 42
The above YAML snippet defines three formatters. The first, with id brief, is a standard logging.Formatter instance with the specified format string. The second, with id default, has a longer format and also defines the time format explicitly, and will result in a logging.Formatter initialized with those two format strings. Shown in Python source form, the brief and default formatters have configuration sub-dictionaries:
{
'format' : '%(message)s'
}
and:
{
'format' : '%(asctime)s %(levelname)-8s %(name)-15s %(message)s',
'datefmt' : '%Y-%m-%d %H:%M:%S'
}
respectively, and as these dictionaries do not contain the special key '()', the instantiation is inferred from the context: as a result, standard logging.Formatter instances are created. The configuration sub-dictionary for the third formatter, with id custom, is:
{
'()' : 'my.package.customFormatterFactory',
'bar' : 'baz',
'spam' : 99.9,
'answer' : 42
}
and this contains the special key '()', which means that user-defined instantiation is wanted. In this case, the specified factory callable will be used. If it is an actual callable it will be used directly - otherwise, if you specify a string (as in the example) the actual callable will be located using normal import mechanisms. The callable will be called with the remaining items in the configuration sub-dictionary as keyword arguments. In the above example, the formatter with id custom will be assumed to be returned by the call:
my.package.customFormatterFactory(bar='baz', spam=99.9, answer=42)
The key '()' has been used as the special key because it is not a valid keyword parameter name, and so will not clash with the names of the keyword arguments used in the call. The '()' also serves as a mnemonic that the corresponding value is a callable.
Access to external objects
There are times where a configuration will need to refer to objects external to the configuration, for example sys.stderr. If the configuration dict is constructed using Python code then this is straightforward, but a problem arises when the configuration is provided via a text file (e.g. JSON, YAML). In a text file, there is no standard way to distinguish sys.stderr from the literal string 'sys.stderr'. To facilitate this distinction, the configuration system will look for certain special prefixes in string values and treat them specially. For example, if the literal string 'ext://sys.stderr' is provided as a value in the configuration, then the ext:// will be stripped off and the remainder of the value processed using normal import mechanisms.
The handling of such prefixes will be done in a way analogous to protocol handling: there will be a generic mechanism to look for prefixes which match the regular expression ^(?P<prefix>[a-z]+)://(?P<suffix>.*)$ whereby, if the prefix is recognised, the suffix is processed in a prefix-dependent manner and the result of the processing replaces the string value. If the prefix is not recognised, then the string value will be left as-is.
The implementation will provide for a set of standard prefixes such as ext:// but it will be possible to disable the mechanism completely or provide additional or different prefixes for special handling.
Access to internal objects
As well as external objects, there is sometimes also a need to refer to objects in the configuration. This will be done implicitly by the configuration system for things that it knows about. For example, the string value 'DEBUG' for a level in a logger or handler will automatically be converted to the value logging.DEBUG, and the handlers, filters and formatter entries will take an object id and resolve to the appropriate destination object.
However, a more generic mechanism needs to be provided for the case of user-defined objects which are not known to logging. For example, take the instance of logging.handlers.MemoryHandler, which takes a target which is another handler to delegate to. Since the system already knows about this class, then in the configuration, the given target just needs to be the object id of the relevant target handler, and the system will resolve to the handler from the id. If, however, a user defines a my.package.MyHandler which has a alternate handler, the configuration system would not know that the alternate referred to a handler. To cater for this, a generic resolution system will be provided which allows the user to specify:
handlers:
file:
# configuration of file handler goes here
custom:
(): my.package.MyHandler
alternate: cfg://handlers.file
The literal string 'cfg://handlers.file' will be resolved in an analogous way to the strings with the ext:// prefix, but looking in the configuration itself rather than the import namespace. The mechanism will allow access by dot or by index, in a similar way to that provided by str.format. Thus, given the following snippet:
handlers:
email:
class: logging.handlers.SMTPHandler
mailhost: localhost
fromaddr: my_app@domain.tld
toaddrs:
- support_team@domain.tld
- dev_team@domain.tld
subject: Houston, we have a problem.
in the configuration, the string 'cfg://handlers' would resolve to the dict with key handlers, the string 'cfg://handlers.email would resolve to the dict with key email in the handlers dict, and so on. The string 'cfg://handlers.email.toaddrs[1] would resolve to 'dev_team.domain.tld' and the string 'cfg://handlers.email.toaddrs[0]' would resolve to the value 'support_team@domain.tld'. The subject value could be accessed using either 'cfg://handlers.email.subject' or, equivalently, 'cfg://handlers.email[subject]'. The latter form only needs to be used if the key contains spaces or non-alphanumeric characters. If an index value consists only of decimal digits, access will be attempted using the corresponding integer value, falling back to the string value if needed.
Given a string cfg://handlers.myhandler.mykey.123, this will resolve to config_dict['handlers']['myhandler']['mykey']['123']. If the string is specified as cfg://handlers.myhandler.mykey[123], the system will attempt to retrieve the value from config_dict['handlers']['myhandler']['mykey'][123], and fall back to config_dict['handlers']['myhandler']['mykey']['123'] if that fails.
Handler Ids
Some specific logging configurations require the use of handler levels to achieve the desired effect. However, unlike loggers which can always be identified by their names, handlers have no persistent handles whereby levels can be changed via an incremental configuration call.
Therefore, this PEP proposes to add an optional name property to handlers. If used, this will add an entry in a dictionary which maps the name to the handler. (The entry will be removed when the handler is closed.) When an incremental configuration call is made, handlers will be looked up in this dictionary to set the handler level according to the value in the configuration. See the section on incremental configuration for more details.
In theory, such a "persistent name" facility could also be provided for Filters and Formatters. However, there is not a strong case to be made for being able to configure these incrementally. On the basis that practicality beats purity, only Handlers will be given this new name property. The id of a handler in the configuration will become its name.
The handler name lookup dictionary is for configuration use only and will not become part of the public API for the package.
Dictionary Schema - Detail
The dictionary passed to dictConfig() must contain the following keys:
- version - to be set to an integer value representing the schema version. The only valid value at present is 1, but having this key allows the schema to evolve while still preserving backwards compatibility.
All other keys are optional, but if present they will be interpreted as described below. In all cases below where a 'configuring dict' is mentioned, it will be checked for the special '()' key to see if a custom instantiation is required. If so, the mechanism described above is used to instantiate; otherwise, the context is used to determine how to instantiate.
formatters - the corresponding value will be a dict in which each key is a formatter id and each value is a dict describing how to configure the corresponding Formatter instance.
The configuring dict is searched for keys format and datefmt (with defaults of None) and these are used to construct a logging.Formatter instance.
filters - the corresponding value will be a dict in which each key is a filter id and each value is a dict describing how to configure the corresponding Filter instance.
The configuring dict is searched for key name (defaulting to the empty string) and this is used to construct a logging.Filter instance.
handlers - the corresponding value will be a dict in which each key is a handler id and each value is a dict describing how to configure the corresponding Handler instance.
The configuring dict is searched for the following keys:
- class (mandatory). This is the fully qualified name of the handler class.
- level (optional). The level of the handler.
- formatter (optional). The id of the formatter for this handler.
- filters (optional). A list of ids of the filters for this handler.
All other keys are passed through as keyword arguments to the handler's constructor. For example, given the snippet:
handlers: console: class : logging.StreamHandler formatter: brief level : INFO filters: [allow_foo] stream : ext://sys.stdout file: class : logging.handlers.RotatingFileHandler formatter: precise filename: logconfig.log maxBytes: 1024 backupCount: 3the handler with id console is instantiated as a logging.StreamHandler, using sys.stdout as the underlying stream. The handler with id file is instantiated as a logging.handlers.RotatingFileHandler with the keyword arguments filename='logconfig.log', maxBytes=1024, backupCount=3.
loggers - the corresponding value will be a dict in which each key is a logger name and each value is a dict describing how to configure the corresponding Logger instance.
The configuring dict is searched for the following keys:
- level (optional). The level of the logger.
- propagate (optional). The propagation setting of the logger.
- filters (optional). A list of ids of the filters for this logger.
- handlers (optional). A list of ids of the handlers for this logger.
The specified loggers will be configured according to the level, propagation, filters and handlers specified.
root - this will be the configuration for the root logger. Processing of the configuration will be as for any logger, except that the propagate setting will not be applicable.
incremental - whether the configuration is to be interpreted as incremental to the existing configuration. This value defaults to False, which means that the specified configuration replaces the existing configuration with the same semantics as used by the existing fileConfig() API.
If the specified value is True, the configuration is processed as described in the section on Incremental Configuration, below.
disable_existing_loggers - whether any existing loggers are to be disabled. This setting mirrors the parameter of the same name in fileConfig(). If absent, this parameter defaults to True. This value is ignored if incremental is True.
A Working Example
The following is an actual working configuration in YAML format (except that the email addresses are bogus):
formatters:
brief:
format: '%(levelname)-8s: %(name)-15s: %(message)s'
precise:
format: '%(asctime)s %(name)-15s %(levelname)-8s %(message)s'
filters:
allow_foo:
name: foo
handlers:
console:
class : logging.StreamHandler
formatter: brief
level : INFO
stream : ext://sys.stdout
filters: [allow_foo]
file:
class : logging.handlers.RotatingFileHandler
formatter: precise
filename: logconfig.log
maxBytes: 1024
backupCount: 3
debugfile:
class : logging.FileHandler
formatter: precise
filename: logconfig-detail.log
mode: a
email:
class: logging.handlers.SMTPHandler
mailhost: localhost
fromaddr: my_app@domain.tld
toaddrs:
- support_team@domain.tld
- dev_team@domain.tld
subject: Houston, we have a problem.
loggers:
foo:
level : ERROR
handlers: [debugfile]
spam:
level : CRITICAL
handlers: [debugfile]
propagate: no
bar.baz:
level: WARNING
root:
level : DEBUG
handlers : [console, file]
Incremental Configuration
It is difficult to provide complete flexibility for incremental configuration. For example, because objects such as filters and formatters are anonymous, once a configuration is set up, it is not possible to refer to such anonymous objects when augmenting a configuration.
Furthermore, there is not a compelling case for arbitrarily altering the object graph of loggers, handlers, filters, formatters at run-time, once a configuration is set up; the verbosity of loggers and handlers can be controlled just by setting levels (and, in the case of loggers, propagation flags). Changing the object graph arbitrarily in a safe way is problematic in a multi-threaded environment; while not impossible, the benefits are not worth the complexity it adds to the implementation.
Thus, when the incremental key of a configuration dict is present and is True, the system will ignore any formatters and filters entries completely, and process only the level settings in the handlers entries, and the level and propagate settings in the loggers and root entries.
It's certainly possible to provide incremental configuration by other means, for example making dictConfig() take an incremental keyword argument which defaults to False. The reason for suggesting that a value in the configuration dict be used is that it allows for configurations to be sent over the wire as pickled dicts to a socket listener. Thus, the logging verbosity of a long-running application can be altered over time with no need to stop and restart the application.
Note: Feedback on incremental configuration needs based on your practical experience will be particularly welcome.
API Customization
The bare-bones dictConfig() API will not be sufficient for all use cases. Provision for customization of the API will be made by providing the following:
- A class, called DictConfigurator, whose constructor is passed the dictionary used for configuration, and which has a configure() method.
- A callable, called dictConfigClass, which will (by default) be set to DictConfigurator. This is provided so that if desired, DictConfigurator can be replaced with a suitable user-defined implementation.
The dictConfig() function will call dictConfigClass passing the specified dictionary, and then call the configure() method on the returned object to actually put the configuration into effect:
def dictConfig(config):
dictConfigClass(config).configure()
This should cater to all customization needs. For example, a subclass of DictConfigurator could call DictConfigurator.__init__() in its own __init__(), then set up custom prefixes which would be usable in the subsequent configure() call. The dictConfigClass would be bound to the subclass, and then dictConfig() could be called exactly as in the default, uncustomized state.
Change to Socket Listener Implementation
The existing socket listener implementation will be modified as follows: when a configuration message is received, an attempt will be made to deserialize to a dictionary using the json module. If this step fails, the message will be assumed to be in the fileConfig format and processed as before. If deserialization is successful, then dictConfig() will be called to process the resulting dictionary.
Configuration Errors
If an error is encountered during configuration, the system will raise a ValueError, TypeError, AttributeError or ImportError with a suitably descriptive message. The following is a (possibly incomplete) list of conditions which will raise an error:
- A level which is not a string or which is a string not corresponding to an actual logging level
- A propagate value which is not a boolean
- An id which does not have a corresponding destination
- A non-existent handler id found during an incremental call
- An invalid logger name
- Inability to resolve to an internal or external object
Discussion in the community
The PEP has been announced on python-dev and python-list. While there hasn't been a huge amount of discussion, this is perhaps to be expected for a niche topic.
Discussion threads on python-dev:
http://mail.python.org/pipermail/python-dev/2009-October/092695.html http://mail.python.org/pipermail/python-dev/2009-October/092782.html http://mail.python.org/pipermail/python-dev/2009-October/093062.html
And on python-list:
http://mail.python.org/pipermail/python-list/2009-October/1223658.html http://mail.python.org/pipermail/python-list/2009-October/1224228.html
There have been some comments in favour of the proposal, no objections to the proposal as a whole, and some questions and objections about specific details. These are believed by the author to have been addressed by making changes to the PEP.
Reference implementation
A reference implementation of the changes is available as a module dictconfig.py with accompanying unit tests in test_dictconfig.py, at:
http://bitbucket.org/vinay.sajip/dictconfig
This incorporates all features other than the socket listener change.
References
| [1] | PEP 8, Style Guide for Python Code, van Rossum, Warsaw (http://www.python.org/dev/peps/pep-0008) |
Copyright
This document has been placed in the public domain.
pep-0392 Python 3.2 Release Schedule
| PEP: | 392 |
|---|---|
| Title: | Python 3.2 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 30-Dec-2009 |
| Python-Version: | 3.2 |
Contents
Abstract
This document describes the development and release schedule for the Python 3.2 series. The schedule primarily concerns itself with PEP-sized items.
Release Manager and Crew
- 3.2 Release Manager: Georg Brandl
- Windows installers: Martin v. Loewis
- Mac installers: Ronald Oussoren
- Documentation: Georg Brandl
3.2 Lifespan
3.2 will receive bugfix updates approximately every 4-6 months for approximately 18 months. After the release of 3.3.0 final (see PEP 398), a final 3.2 bugfix update will be released. After that, security updates (source only) will be released until 5 years after the release of 3.2 final, which will be February 2016.
Release Schedule
3.2 schedule
- 3.2 alpha 1: August 1, 2010
- 3.2 alpha 2: September 6, 2010
- 3.2 alpha 3: October 12, 2010
- 3.2 alpha 4: November 16, 2010
- 3.2 beta 1: December 6, 2010
(No new features beyond this point.)
- 3.2 beta 2: December 20, 2010
- 3.2 candidate 1: January 16, 2011
- 3.2 candidate 2: January 31, 2011
- 3.2 candidate 3: February 14, 2011
- 3.2 final: February 20, 2011
3.2.1 schedule
- 3.2.1 beta 1: May 8, 2011
- 3.2.1 candidate 1: May 17, 2011
- 3.2.1 candidate 2: July 3, 2011
- 3.2.1 final: July 11, 2011
3.2.2 schedule
- 3.2.2 candidate 1: August 14, 2011
- 3.2.2 final: September 4, 2011
3.2.3 schedule
- 3.2.3 candidate 1: February 25, 2012
- 3.2.3 candidate 2: March 18, 2012
- 3.2.3 final: April 11, 2012
3.2.4 schedule
- 3.2.4 candidate 1: March 23, 2013
- 3.2.4 final: April 6, 2013
3.2.5 schedule (regression fix release)
- 3.2.5 final: May 13, 2013
-- Only security releases after 3.2.5 --
3.2.6 schedule
- 3.2.6 candidate 1 (source-only release): October 4, 2014
- 3.2.6 final (source-only release): October 11, 2014
Features for 3.2
Note that PEP 3003 [1] is in effect: no changes to language syntax and no additions to the builtins may be made.
No large-scale changes have been recorded yet.
Copyright
This document has been placed in the public domain.
pep-0393 Flexible String Representation
| PEP: | 393 |
|---|---|
| Title: | Flexible String Representation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin v. Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 24-Jan-2010 |
| Python-Version: | 3.3 |
| Post-History: |
Contents
Abstract
The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. The distinction between narrow and wide Unicode builds is dropped. An implementation of this PEP is available at [1].
Rationale
There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character).
One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification chooses UTF-8 as the recommended way of exposing strings to C code.
For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. If representations do share data, it is also possible to omit structure fields, reducing the base size of string objects.
Specification
Unicode structures are now defined as a hierarchy of structures, namely:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
struct {
unsigned int interned:2;
unsigned int kind:2;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
wchar_t *wstr;
} PyASCIIObject;
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;
Objects for which both size and maximum character are known at creation time are called "compact" unicode objects; character data immediately follow the base structure. If the maximum character is less than 128, they use the PyASCIIObject structure, and the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. For non-ASCII strings, the PyCompactObject structure is used. Resizing compact objects is not supported.
Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length). They use the PyUnicodeObject structure. Initially, their data is only in the wstr pointer; when PyUnicode_READY is called, the data pointer (union) is allocated. Resizing is possible as long PyUnicode_READY has not been called.
The fields have the following interpretations:
length: number of code points in the string (result of sq_length)
interned: interned-state (SSTATE_*) as in 3.2
- kind: form of string
- 00 => str is not initialized (data are in wstr)
- 01 => 1 byte (Latin-1)
- 10 => 2 byte (UCS-2)
- 11 => 4 byte (UCS-4);
compact: the object uses one of the compact representations (implies ready)
ascii: the object uses the PyASCIIObject representation (implies compact and ready)
ready: the canonical representation is ready to be accessed through PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the object is compact, or the data pointer and length have been initialized.
wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). wstr_length differs from length only if there are surrogate pairs in the representation.
utf8_length, utf8: UTF-8 representation (null-terminated).
data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation).
All three representations are optional, although the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data.
The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation.
The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The data and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4).
String Creation
The recommended way to create a Unicode object is to use the function PyUnicode_New:
PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of characters and the maximum character in advance. An string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be uninitialized.
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported for processing UTF-8 input; the input is decoded, and the UTF-8 representation is not yet set for the string.
PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the data representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_READY() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it is finalized.
PyUnicode_READY() converts a string containing only a wstr representation into the canonical representation. Unless wstr and data can share the memory, the wstr representation is discarded after the conversion. The macro returns 0 on success and -1 on failure, which happens in particular if the memory allocation fails.
String Access
The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1), PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA gives the void pointer to the data. Access to individual characters should use PyUnicode_{READ|WRITE}[_CHAR]:
- PyUnicode_READ(kind, data, index)
- PyUnicode_WRITE(kind, data, index, value)
- PyUnicode_READ_CHAR(unicode, index)
All these macros assume that the string is in canonical form; callers need to ensure this by calling PyUnicode_READY.
A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). APIs that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion.
New API
This section summarizes the API additions.
Macros to access the internal representation of a Unicode object (read-only):
- PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o), PyUnicode_IS_READY(o)
- PyUnicode_GET_LENGTH(o)
- PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o), PyUnicode_MAX_CHAR_VALUE(o)
- PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o), PyUnicode_4BYTE_DATA(o)
Character access macros:
- PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index)
- PyUnicode_WRITE(kind, data, index, value)
Other macros:
- PyUnicode_READY(o)
- PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to)
String creation functions:
- PyUnicode_New(size, maxchar)
- PyUnicode_FromKindAndData(kind, data, size)
- PyUnicode_Substring(o, start, end)
Character access utility functions:
- PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index), PyUnicode_WriteChar(o, index, character)
- PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many)
- PyUnicode_FindChar(str, ch, start, end, direction)
Representation conversion:
- PyUnicode_AsUCS4(o, buffer, buflen)
- PyUnicode_AsUCS4Copy(o)
- PyUnicode_AsUnicodeAndSize(o, size_out)
- PyUnicode_AsUTF8(o)
- PyUnicode_AsUTF8AndSize(o, size_out)
UCS4 utility functions:
- Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr}
Stable ABI
The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength, PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find, PyUnicode_FindChar.
GDB Debugging Hooks
Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It has been updated to track the change.
Deprecations, Removals, and Incompatibilities
While the Py_UNICODE representation and APIs are deprecated with this PEP, no removal of the respective APIs is scheduled. The APIs should remain available at least five years after the PEP is accepted; before they are removed, existing extension modules should be studied to find out whether a sufficient majority of the open-source code on PyPI has been ported to the new API. A reasonable motivation for using the deprecated API even in new code is for code that shall work both on Python 2 and Python 3.
The following macros and functions are deprecated:
- PyUnicode_FromUnicode
- PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE,
- PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize
- PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH
- PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8, PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32, PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape, PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal, PyUnicode_TransformDecimalToASCII
- Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr}
- PyUnicode_AsUnicodeCopy
- PyUnicode_GetMax
_PyUnicode_AsDefaultEncodedString is removed. It previously returned a borrowed reference to an UTF-8-encoded bytes object. Since the unicode object cannot anymore cache such a reference, implementing it without leaking memory is not possible. No deprecation phase is provided, since it was an API for internal use only.
Extension modules using the legacy API may inadvertently call PyUnicode_READY, by calling some API that requires that the object is ready, and then continue accessing the (now invalid) Py_UNICODE pointer. Such code will break with this PEP. The code was already flawed in 3.2, as there is was no explicit guarantee that the PyUnicode_AS_UNICODE result would stay valid after an API call (due to the possibility of string resizing). Modules that face this issue need to re-fetch the Py_UNICODE pointer after API calls; doing so will continue to work correctly in earlier Python versions.
Discussion
Several concerns have been raised about the approach presented here:
It makes the implementation more complex. That's true, but considered worth it given the benefits.
The Py_UNICODE representation is not instantaneously available, slowing down applications that request it. While this is also true, applications that care about this problem can be rewritten to use the data representation.
Performance
Performance of this patch must be considered for both memory consumption and runtime efficiency. For memory consumption, the expectation is that applications that have many large strings will see a reduction in memory usage. For small strings, the effects depend on the pointer size of the system, and the size of the Py_UNICODE/wchar_t type. The following table demonstrates this for various small ASCII and Latin-1 string sizes and platforms.
| string size | Python 3.2 | This PEP | ||||||
| 16-bit wchar_t | 32-bit wchar_t | ASCII | Latin-1 | |||||
| 32-bit | 64-bit | 32-bit | 64-bit | 32-bit | 64-bit | 32-bit | 64-bit | |
| 1 | 32 | 64 | 40 | 64 | 32 | 56 | 40 | 80 |
| 2 | 40 | 64 | 40 | 72 | 32 | 56 | 40 | 80 |
| 3 | 40 | 64 | 48 | 72 | 32 | 56 | 40 | 80 |
| 4 | 40 | 72 | 48 | 80 | 32 | 56 | 48 | 80 |
| 5 | 40 | 72 | 56 | 80 | 32 | 56 | 48 | 80 |
| 6 | 48 | 72 | 56 | 88 | 32 | 56 | 48 | 80 |
| 7 | 48 | 72 | 64 | 88 | 32 | 56 | 48 | 80 |
| 8 | 48 | 80 | 64 | 96 | 40 | 64 | 48 | 88 |
The runtime effect is significantly affected by the API being used. After porting the relevant pieces of code to the new API, the iobench, stringbench, and json benchmarks see typically slowdowns of 1% to 30%; for specific benchmarks, speedups may happen as may happen significantly larger slowdowns.
In actual measurements of a Django application ([2]), significant reductions of memory usage could be found. For example, the storage for Unicode objects reduced to 2216807 bytes, down from 6378540 bytes for a wide Unicode build, and down from 3694694 bytes for a narrow Unicode build (all on a 32-bit system). This reduction came from the prevalence of ASCII strings in this application; out of 36,000 strings (with 1,310,000 chars), 35713 where ASCII strings (with 1,300,000 chars). The sources for these strings where not further analysed; many of them likely originate from identifiers in the library, and string constants in Django's source code.
In comparison to Python 2, both Unicode and byte strings need to be accounted. In the test application, Unicode and byte strings combined had a length of 2,046,000 units (bytes/chars) in 2.x, and 2,200,000 units in 3.x. On a 32-bit system, where the 2.x build used 32-bit wchar_t/Py_UNICODE, the 2.x test used 3,620,000 bytes, and the 3.x build 3,340,000 bytes. This reduction in 3.x using the PEP compared to 2.x only occurs when comparing with a wide unicode build.
Porting Guidelines
Only a small fraction of C code is affected by this PEP, namely code that needs to look "inside" unicode strings. That code doesn't necessarily need to be ported to this API, as the existing API will continue to work correctly. In particular, modules that need to support both Python 2 and Python 3 might get too complicated when simultaneously supporting this new API and the old Unicode API.
In order to port modules to the new API, try to eliminate the use of these API elements:
- the Py_UNICODE type,
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
- PyUnicode_GET_SIZE and PyUnicode_GetSize, and
- PyUnicode_FromUnicode.
When iterating over an existing string, or looking at specific characters, use indexing operations rather than pointer arithmetic; indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use void* as the buffer type for characters to let the compiler detect invalid dereferencing operations. If you do want to use pointer arithmetics (e.g. when converting existing code), use (unsigned) char* as the buffer type, and keep the element size (1, 2, or 4) in a variable. Notice that (1<<(kind-1)) will produce the element size given a buffer kind.
When creating new strings, it was common in Python to start of with a heuristical buffer size, and then grow or shrink if the heuristics failed. With this PEP, this is now less practical, as you need not only a heuristics for the length of the string, but also for the maximum character.
In order to avoid heuristics, you need to make two passes over the input: once to determine the output length, and the maximum character; then allocate the target string with PyUnicode_New and iterate over the input a second time to produce the final output. While this may sound expensive, it could actually be cheaper than having to copy the result again as in the following approach.
If you take the heuristical route, avoid allocating a string meant to be resized, as resizing strings won't work for their canonical representation. Instead, allocate a separate buffer to collect the characters, and then construct a unicode object from that using PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer element, assuming for the worst case in character ordinals. This will allow for pointer arithmetics, but may require a lot of memory. Alternatively, start with a 1-byte buffer, and increase the element size as you encounter larger characters. In any case, PyUnicode_FromKindAndData will scan over the buffer to verify the maximum character.
For common tasks, direct access to the string representation may not be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and PyUnicode_CopyCharacters help in analyzing and creating string objects, operating on indexes instead of data pointers.
References
| [1] | PEP 393 branch https://bitbucket.org/t0rsten/pep-393 |
| [2] | Django measurement results http://www.dcl.hpi.uni-potsdam.de/home/loewis/djmemprof/ |
Copyright
This document has been placed in the public domain.
pep-0394 The "python" Command on Unix-Like Systems
| PEP: | 394 |
|---|---|
| Title: | The "python" Command on Unix-Like Systems |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Kerrick Staley <mail at kerrickstaley.com>, Nick Coghlan <ncoghlan at gmail.com>, Barry Warsaw <barry at python.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 02-Mar-2011 |
| Post-History: | 04-Mar-2011, 20-Jul-2011, 16-Feb-2012, 30-Sep-2014 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-February/116594.html |
Contents
Abstract
This PEP provides a convention to ensure that Python scripts can continue to be portable across *nix systems, regardless of the default version of the Python interpreter (i.e. the version invoked by the python command).
- python2 will refer to some version of Python 2.x.
- python3 will refer to some version of Python 3.x.
- for the time being, all distributions should ensure that python refers to the same target as python2.
- however, end users should be aware that python refers to python3 on at least Arch Linux (that change is what prompted the creation of this PEP), so python should be used in the shebang line only for scripts that are source compatible with both Python 2 and 3.
- in preparation for an eventual change in the default version of Python, Python 2 only scripts should either be updated to be source compatible with Python 3 or else to use python2 in the shebang line.
Recommendation
- Unix-like software distributions (including systems like Mac OS X and Cygwin) should install the python2 command into the default path whenever a version of the Python 2 interpreter is installed, and the same for python3 and the Python 3 interpreter.
- When invoked, python2 should run some version of the Python 2 interpreter, and python3 should run some version of the Python 3 interpreter.
- The more general python command should be installed whenever any version of Python 2 is installed and should invoke the same version of Python as the python2 command (however, note that some distributions have already chosen to have python implement the python3 command; see the Rationale and Migration Notes below).
- The Python 2.x idle, pydoc, and python-config commands should likewise be available as idle2, pydoc2, and python2-config, with the original commands invoking these versions by default, but possibly invoking the Python 3.x versions instead if configured to do so by the system administrator.
- In order to tolerate differences across platforms, all new code that needs to invoke the Python interpreter should not specify python, but rather should specify either python2 or python3 (or the more specific python2.x and python3.x versions; see the Migration Notes). This distinction should be made in shebangs, when invoking from a shell script, when invoking via the system() call, or when invoking in any other context.
- One exception to this is scripts that are deliberately written to be source compatible with both Python 2.x and 3.x. Such scripts may continue to use python on their shebang line without affecting their portability.
- When reinvoking the interpreter from a Python script, querying sys.executable to avoid hardcoded assumptions regarding the interpreter location remains the preferred approach.
These recommendations are the outcome of the relevant python-dev discussions in March and July 2011 ([1], [2]), February 2012 ([4]) and September 2014 ([6]).
Rationale
This recommendation is needed as, even though the majority of distributions still alias the python command to Python 2, some now alias it to Python 3 ([5]). As some of the former distributions did not provide a python2 command by default, there was previously no way for Python 2 code (or any code that invokes the Python 2 interpreter directly rather than via sys.executable) to reliably run on all Unix-like systems without modification, as the python command would invoke the wrong interpreter version on some systems, and the python2 command would fail completely on others. The recommendations in this PEP provide a very simple mechanism to restore cross-platform support, with minimal additional work required on the part of distribution maintainers.
Future Changes to this Recommendation
It is anticipated that there will eventually come a time where the third party ecosystem surrounding Python 3 is sufficiently mature for this recommendation to be updated to suggest that the python symlink refer to python3 rather than python2.
This recommendation will be periodically reviewed over the next few years, and updated when the core development team judges it appropriate. As a point of reference, regular maintenance releases for the Python 2.7 series will continue until at least 2020.
Migration Notes
This section does not contain any official recommendations from the core CPython developers. It's merely a collection of notes regarding various aspects of migrating to Python 3 as the default version of Python for a system. They will hopefully be helpful to any distributions considering making such a change.
The main barrier to a distribution switching the python command from python2 to python3 isn't breakage within the distribution, but instead breakage of private third party scripts developed by sysadmins and other users. Updating the python command to invoke python3 by default indicates that a distribution is willing to break such scripts with errors that are potentially quite confusing for users that aren't yet familiar with the backwards incompatible changes in Python 3. For example, while the change of print from a statement to a builtin function is relatively simple for automated converters to handle, the SyntaxError from attempting to use the Python 2 notation in versions of Python 3 prior to 3.4.2 is thoroughly confusing if you aren't already aware of the change:
$ python3 -c 'print "Hello, world!"' File "<string>", line 1 print "Hello, world!" ^ SyntaxError: invalid syntax(In Python 3.4.2+, that generic error message has been replaced with the more explicit "SyntaxError: Missing parentheses in call to 'print'")
Avoiding breakage of such third party scripts is the key reason this PEP recommends that python continue to refer to python2 for the time being. Until the conventions described in this PEP are more widely adopted, having python invoke python2 will remain the recommended option.
The pythonX.X (e.g. python2.6) commands exist on some systems, on which they invoke specific minor versions of the Python interpreter. It can be useful for distribution-specific packages to take advantage of these utilities if they exist, since it will prevent code breakage if the default minor version of a given major version is changed. However, scripts intending to be cross-platform should not rely on the presence of these utilities, but rather should be tested on several recent minor versions of the target major version, compensating, if necessary, for the small differences that exist between minor versions. This prevents the need for sysadmins to install many very similar versions of the interpreter.
When the pythonX.X binaries are provided by a distribution, the python2 and python3 commands should refer to one of those files rather than being provided as a separate binary file.
It is suggested that even distribution-specific packages follow the python2/python3 convention, even in code that is not intended to operate on other distributions. This will reduce problems if the distribution later decides to change the version of the Python interpreter that the python command invokes, or if a sysadmin installs a custom python command with a different major version than the distribution default. Distributions can test whether they are fully following this convention by changing the python interpreter on a test box and checking to see if anything breaks.
If the above point is adhered to and sysadmins are permitted to change the python command, then the python command should always be implemented as a link to the interpreter binary (or a link to a link) and not vice versa. That way, if a sysadmin does decide to replace the installed python file, they can do so without inadvertently deleting the previously installed binary.
If the Python 2 interpreter becomes uncommon, scripts should nevertheless continue to use the python3 convention rather that just python. This will ease transition in the event that yet another major version of Python is released.
If these conventions are adhered to, it will become the case that the python command is only executed in an interactive manner as a user convenience, or to run scripts that are source compatible with both Python 2 and Python 3.
Backwards Compatibility
A potential problem can arise if a script adhering to the python2/python3 convention is executed on a system not supporting these commands. This is mostly a non-issue, since the sysadmin can simply create these symbolic links and avoid further problems. It is a significantly more obvious breakage than the sometimes cryptic errors that can arise when attempting to execute a script containing Python 2 specific syntax with a Python 3 interpreter.
Application to the CPython Reference Interpreter
While technically a new feature, the make install and make bininstall command in the 2.7 version of CPython were adjusted to create the following chains of symbolic links in the relevant bin directory (the final item listed in the chain is the actual installed binary, preceding items are relative symbolic links):
python -> python2 -> python2.7 python-config -> python2-config -> python2.7-config
Similar adjustments were made to the Mac OS X binary installer.
This feature first appeared in the default installation process in CPython 2.7.3.
The installation commands in the CPython 3.x series already create the appropriate symlinks. For example, CPython 3.2 creates:
python3 -> python3.2 idle3 -> idle3.2 pydoc3 -> pydoc3.2 python3-config -> python3.2-config
And CPython 3.3 creates:
python3 -> python3.3 idle3 -> idle3.3 pydoc3 -> pydoc3.3 python3-config -> python3.3-config pysetup3 -> pysetup3.3
The implementation progress of these features in the default installers was managed on the tracker as issue #12627 ([3]).
Impact on PYTHON* Environment Variables
The choice of target for the python command implicitly affects a distribution's expected interpretation of the various Python related environment variables. The use of *.pth files in the relevant site-packages folder, the "per-user site packages" feature (see python -m site) or more flexible tools such as virtualenv are all more tolerant of the presence of multiple versions of Python on a system than the direct use of PYTHONPATH.
Exclusion of MS Windows
This PEP deliberately excludes any proposals relating to Microsoft Windows, as devising an equivalent solution for Windows was deemed too complex to handle here. PEP 397 and the related discussion on the python-dev mailing list address this issue (like this PEP, the PEP 397 launcher invokes Python 2 by default if versions of both Python 2 and 3 are installed on the system).
References
| [1] | Support the /usr/bin/python2 symlink upstream (with bonus grammar class!) (http://mail.python.org/pipermail/python-dev/2011-March/108491.html) |
| [2] | Rebooting PEP 394 (aka Support the /usr/bin/python2 symlink upstream) (http://mail.python.org/pipermail/python-dev/2011-July/112322.html) |
| [3] | Implement PEP 394 in the CPython Makefile (http://bugs.python.org/issue12627) |
| [4] | PEP 394 request for pronouncement (python2 symlink in *nix systems) (http://mail.python.org/pipermail/python-dev/2012-February/116435.html) |
| [5] | Arch Linux announcement that their "python" link now refers Python 3 (https://www.archlinux.org/news/python-is-now-python-3/) |
| [6] | PEP 394 - Clarification of what "python" command should invoke (https://mail.python.org/pipermail/python-dev/2014-September/136374.html) |
Copyright
This document has been placed in the public domain.
pep-0395 Qualified Names for Modules
| PEP: | 395 |
|---|---|
| Title: | Qualified Names for Modules |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 4-Mar-2011 |
| Python-Version: | 3.4 |
| Post-History: | 5-Mar-2011, 19-Nov-2011 |
Contents
- PEP Withdrawal
- Abstract
- What's in a __name__?
- Traps for the Unwary
- Qualified Names for Modules
- Eliminating the Traps
- Explicit relative imports
- Reference Implementation
- References
- Copyright
PEP Withdrawal
This PEP was withdrawn by the author in December 2013, as other significant changes in the time since it was written have rendered several aspects obsolete. Most notably PEP 420 namespace packages rendered some of the proposals related to package detection unworkable and PEP 451 module specifications resolved the multiprocessing issues and provide a possible means to tackle the pickle compatibility issues.
A future PEP to resolve the remaining issues would still be appropriate, but it's worth starting any such effort as a fresh PEP restating the remaining problems in an updated context rather than trying to build on this one directly.
Abstract
This PEP proposes new mechanisms that eliminate some longstanding traps for the unwary when dealing with Python's import system, as well as serialisation and introspection of functions and classes.
It builds on the "Qualified Name" concept defined in PEP 3155.
Relationship with Other PEPs
Most significantly, this PEP is currently deferred as it requires significant changes in order to be made compatible with the removal of mandatory __init__.py files in PEP 420 (which has been implemented and released in Python 3.3).
This PEP builds on the "qualified name" concept introduced by PEP 3155, and also shares in that PEP's aim of fixing some ugly corner cases when dealing with serialisation of arbitrary functions and classes.
It also builds on PEP 366, which took initial tentative steps towards making explicit relative imports from the main module work correctly in at least some circumstances.
Finally, PEP 328 eliminated implicit relative imports from imported modules. This PEP proposes that the de facto implicit relative imports from main modules that are provided by the current initialisation behaviour for sys.path[0] also be eliminated.
What's in a __name__?
Over time, a module's __name__ attribute has come to be used to handle a number of different tasks.
The key use cases identified for this module attribute are:
- Flagging the main module in a program, using the if __name__ == "__main__": convention.
- As the starting point for relative imports
- To identify the location of function and class definitions within the running application
- To identify the location of classes for serialisation into pickle objects which may be shared with other interpreter instances
Traps for the Unwary
The overloading of the semantics of __name__, along with some historically associated behaviour in the initialisation of sys.path[0], has resulted in several traps for the unwary. These traps can be quite annoying in practice, as they are highly unobvious (especially to beginners) and can cause quite confusing behaviour.
Why are my imports broken?
There's a general principle that applies when modifying sys.path: never put a package directory directly on sys.path. The reason this is problematic is that every module in that directory is now potentially accessible under two different names: as a top level module (since the package directory is on sys.path) and as a submodule of the package (if the higher level directory containing the package itself is also on sys.path).
As an example, Django (up to and including version 1.3) is guilty of setting up exactly this situation for site-specific applications - the application ends up being accessible as both app and site.app in the module namespace, and these are actually two different copies of the module. This is a recipe for confusion if there is any meaningful mutable module level state, so this behaviour is being eliminated from the default site set up in version 1.4 (site-specific apps will always be fully qualified with the site name).
However, it's hard to blame Django for this, when the same part of Python responsible for setting __name__ = "__main__" in the main module commits the exact same error when determining the value for sys.path[0].
The impact of this can be seen relatively frequently if you follow the "python" and "import" tags on Stack Overflow. When I had the time to follow it myself, I regularly encountered people struggling to understand the behaviour of straightforward package layouts like the following (I actually use package layouts along these lines in my own projects):
project/
setup.py
example/
__init__.py
foo.py
tests/
__init__.py
test_foo.py
While I would often see it without the __init__.py files first, that's a trivial fix to explain. What's hard to explain is that all of the following ways to invoke test_foo.py probably won't work due to broken imports (either failing to find example for absolute imports, complaining about relative imports in a non-package or beyond the toplevel package for explicit relative imports, or issuing even more obscure errors if some other submodule happens to shadow the name of a top-level module, such as an example.json module that handled serialisation or an example.tests.unittest test runner):
# These commands will most likely *FAIL*, even if the code is correct # working directory: project/example/tests ./test_foo.py python test_foo.py python -m package.tests.test_foo python -c "from package.tests.test_foo import main; main()" # working directory: project/package tests/test_foo.py python tests/test_foo.py python -m package.tests.test_foo python -c "from package.tests.test_foo import main; main()" # working directory: project example/tests/test_foo.py python example/tests/test_foo.py # working directory: project/.. project/example/tests/test_foo.py python project/example/tests/test_foo.py # The -m and -c approaches don't work from here either, but the failure # to find 'package' correctly is easier to explain in this case
That's right, that long list is of all the methods of invocation that will almost certainly break if you try them, and the error messages won't make any sense if you're not already intimately familiar not only with the way Python's import system works, but also with how it gets initialised.
For a long time, the only way to get sys.path right with that kind of setup was to either set it manually in test_foo.py itself (hardly something a novice, or even many veteran, Python programmers are going to know how to do) or else to make sure to import the module instead of executing it directly:
# working directory: project python -c "from package.tests.test_foo import main; main()"
Since the implementation of PEP 366 (which defined a mechanism that allows relative imports to work correctly when a module inside a package is executed via the -m switch), the following also works properly:
# working directory: project python -m package.tests.test_foo
The fact that most methods of invoking Python code from the command line break when that code is inside a package, and the two that do work are highly sensitive to the current working directory is all thoroughly confusing for a beginner. I personally believe it is one of the key factors leading to the perception that Python packages are complicated and hard to get right.
This problem isn't even limited to the command line - if test_foo.py is open in Idle and you attempt to run it by pressing F5, or if you try to run it by clicking on it in a graphical filebrowser, then it will fail in just the same way it would if run directly from the command line.
There's a reason the general "no package directories on sys.path" guideline exists, and the fact that the interpreter itself doesn't follow it when determining sys.path[0] is the root cause of all sorts of grief.
In the past, this couldn't be fixed due to backwards compatibility concerns. However, scripts potentially affected by this problem will already require fixes when porting to the Python 3.x (due to the elimination of implicit relative imports when importing modules normally). This provides a convenient opportunity to implement a corresponding change in the initialisation semantics for sys.path[0].
Importing the main module twice
Another venerable trap is the issue of importing __main__ twice. This occurs when the main module is also imported under its real name, effectively creating two instances of the same module under different names.
If the state stored in __main__ is significant to the correct operation of the program, or if there is top-level code in the main module that has non-idempotent side effects, then this duplication can cause obscure and surprising errors.
In a bit of a pickle
Something many users may not realise is that the pickle module sometimes relies on the __module__ attribute when serialising instances of arbitrary classes. So instances of classes defined in __main__ are pickled that way, and won't be unpickled correctly by another python instance that only imported that module instead of running it directly. This behaviour is the underlying reason for the advice from many Python veterans to do as little as possible in the __main__ module in any application that involves any form of object serialisation and persistence.
Similarly, when creating a pseudo-module (see next paragraph), pickles rely on the name of the module where a class is actually defined, rather than the officially documented location for that class in the module hierarchy.
For the purposes of this PEP, a "pseudo-module" is a package designed like the Python 3.2 unittest and concurrent.futures packages. These packages are documented as if they were single modules, but are in fact internally implemented as a package. This is supposed to be an implementation detail that users and other implementations don't need to worry about, but, thanks to pickle (and serialisation in general), the details are often exposed and can effectively become part of the public API.
While this PEP focuses specifically on pickle as the principal serialisation scheme in the standard library, this issue may also affect other mechanisms that support serialisation of arbitrary class instances and rely on __module__ attributes to determine how to handle deserialisation.
Where's the source?
Some sophisticated users of the pseudo-module technique described above recognise the problem with implementation details leaking out via the pickle module, and choose to address it by altering __name__ to refer to the public location for the module before defining any functions or classes (or else by modifying the __module__ attributes of those objects after they have been defined).
This approach is effective at eliminating the leakage of information via pickling, but comes at the cost of breaking introspection for functions and classes (as their __module__ attribute now points to the wrong place).
Forkless Windows
To get around the lack of os.fork on Windows, the multiprocessing module attempts to re-execute Python with the same main module, but skipping over any code guarded by if __name__ == "__main__": checks. It does the best it can with the information it has, but is forced to make assumptions that simply aren't valid whenever the main module isn't an ordinary directly executed script or top-level module. Packages and non-top-level modules executed via the -m switch, as well as directly executed zipfiles or directories, are likely to make multiprocessing on Windows do the wrong thing (either quietly or noisily, depending on application details) when spawning a new process.
While this issue currently only affects Windows directly, it also impacts any proposals to provide Windows-style "clean process" invocation via the multiprocessing module on other platforms.
Qualified Names for Modules
To make it feasible to fix these problems once and for all, it is proposed to add a new module level attribute: __qualname__. This abbreviation of "qualified name" is taken from PEP 3155, where it is used to store the naming path to a nested class or function definition relative to the top level module.
For modules, __qualname__ will normally be the same as __name__, just as it is for top-level functions and classes in PEP 3155. However, it will differ in some situations so that the above problems can be addressed.
Specifically, whenever __name__ is modified for some other purpose (such as to denote the main module), then __qualname__ will remain unchanged, allowing code that needs it to access the original unmodified value.
If a module loader does not initialise __qualname__ itself, then the import system will add it automatically (setting it to the same value as __name__).
Alternative Names
Two alternative names were also considered for the new attribute: "full name" (__fullname__) and "implementation name" (__implname__).
Either of those would actually be valid for the use case in this PEP. However, as a meta-issue, PEP 3155 is also adding a new attribute (for functions and classes) that is "like __name__, but different in some cases where __name__ is missing necessary information" and those terms aren't accurate for the PEP 3155 function and class use case.
PEP 3155 deliberately omits the module information, so the term "full name" is simply untrue, and "implementation name" implies that it may specify an object other than that specified by __name__, and that is never the case for PEP 3155 (in that PEP, __name__ and __qualname__ always refer to the same function or class, it's just that __name__ is insufficient to accurately identify nested functions and classes).
Since it seems needlessly inconsistent to add two new terms for attributes that only exist because backwards compatibility concerns keep us from changing the behaviour of __name__ itself, this PEP instead chose to adopt the PEP 3155 terminology.
If the relative inscrutability of "qualified name" and __qualname__ encourages interested developers to look them up at least once rather than assuming they know what they mean just from the name and guessing wrong, that's not necessarily a bad outcome.
Besides, 99% of Python developers should never need to even care these extra attributes exist - they're really an implementation detail to let us fix a few problematic behaviours exhibited by imports, pickling and introspection, not something people are going to be dealing with on a regular basis.
Eliminating the Traps
The following changes are interrelated and make the most sense when considered together. They collectively either completely eliminate the traps for the unwary noted above, or else provide straightforward mechanisms for dealing with them.
A rough draft of some of the concepts presented here was first posted on the python-ideas list ([1]), but they have evolved considerably since first being discussed in that thread. Further discussion has subsequently taken place on the import-sig mailing list ([2]. [3]).
Fixing main module imports inside packages
To eliminate this trap, it is proposed that an additional filesystem check be performed when determining a suitable value for sys.path[0]. This check will look for Python's explicit package directory markers and use them to find the appropriate directory to add to sys.path.
The current algorithm for setting sys.path[0] in relevant cases is roughly as follows:
# Interactive prompt, -m switch, -c switch sys.path.insert(0, '')
# Valid sys.path entry execution (i.e. directory and zip execution) sys.path.insert(0, sys.argv[0])
# Direct script execution sys.path.insert(0, os.path.dirname(sys.argv[0]))
It is proposed that this initialisation process be modified to take package details stored on the filesystem into account:
# Interactive prompt, -m switch, -c switch
in_package, path_entry, _ignored = split_path_module(os.getcwd(), '')
if in_package:
sys.path.insert(0, path_entry)
else:
sys.path.insert(0, '')
# Start interactive prompt or run -c command as usual
# __main__.__qualname__ is set to "__main__"
# The -m switches uses the same sys.path[0] calculation, but:
# modname is the argument to the -m switch
# modname is passed to ``runpy._run_module_as_main()`` as usual
# __main__.__qualname__ is set to modname
# Valid sys.path entry execution (i.e. directory and zip execution) modname = "__main__" path_entry, modname = split_path_module(sys.argv[0], modname) sys.path.insert(0, path_entry) # modname (possibly adjusted) is passed to ``runpy._run_module_as_main()`` # __main__.__qualname__ is set to modname
# Direct script execution
in_package, path_entry, modname = split_path_module(sys.argv[0])
sys.path.insert(0, path_entry)
if in_package:
# Pass modname to ``runpy._run_module_as_main()``
else:
# Run script directly
# __main__.__qualname__ is set to modname
The split_path_module() supporting function used in the above pseudo-code would have the following semantics:
def _splitmodname(fspath):
path_entry, fname = os.path.split(fspath)
modname = os.path.splitext(fname)[0]
return path_entry, modname
def _is_package_dir(fspath):
return any(os.exists("__init__" + info[0]) for info
in imp.get_suffixes())
def split_path_module(fspath, modname=None):
"""Given a filesystem path and a relative module name, determine an
appropriate sys.path entry and a fully qualified module name.
Returns a 3-tuple of (package_depth, fspath, modname). A reported
package depth of 0 indicates that this would be a top level import.
If no relative module name is given, it is derived from the final
component in the supplied path with the extension stripped.
"""
if modname is None:
fspath, modname = _splitmodname(fspath)
package_depth = 0
while _is_package_dir(fspath):
fspath, pkg = _splitmodname(fspath)
modname = pkg + '.' + modname
return package_depth, fspath, modname
This PEP also proposes that the split_path_module() functionality be exposed directly to Python users via the runpy module.
With this fix in place, and the same simple package layout described earlier, all of the following commands would invoke the test suite correctly:
# working directory: project/example/tests ./test_foo.py python test_foo.py python -m package.tests.test_foo python -c "from .test_foo import main; main()" python -c "from ..tests.test_foo import main; main()" python -c "from package.tests.test_foo import main; main()" # working directory: project/package tests/test_foo.py python tests/test_foo.py python -m package.tests.test_foo python -c "from .tests.test_foo import main; main()" python -c "from package.tests.test_foo import main; main()" # working directory: project example/tests/test_foo.py python example/tests/test_foo.py python -m package.tests.test_foo python -c "from package.tests.test_foo import main; main()" # working directory: project/.. project/example/tests/test_foo.py python project/example/tests/test_foo.py # The -m and -c approaches still don't work from here, but the failure # to find 'package' correctly is pretty easy to explain in this case
With these changes, clicking Python modules in a graphical file browser should always execute them correctly, even if they live inside a package. Depending on the details of how it invokes the script, Idle would likely also be able to run test_foo.py correctly with F5, without needing any Idle specific fixes.
Optional addition: command line relative imports
With the above changes in place, it would be a fairly minor addition to allow explicit relative imports as arguments to the -m switch:
# working directory: project/example/tests python -m .test_foo python -m ..tests.test_foo # working directory: project/example/ python -m .tests.test_foo
With this addition, system initialisation for the -m switch would change as follows:
# -m switch (permitting explicit relative imports)
in_package, path_entry, pkg_name = split_path_module(os.getcwd(), '')
qualname= <<arguments to -m switch>>
if qualname.startswith('.'):
modname = qualname
while modname.startswith('.'):
modname = modname[1:]
pkg_name, sep, _ignored = pkg_name.rpartition('.')
if not sep:
raise ImportError("Attempted relative import beyond top level package")
qualname = pkg_name + '.' modname
if in_package:
sys.path.insert(0, path_entry)
else:
sys.path.insert(0, '')
# qualname is passed to ``runpy._run_module_as_main()``
# _main__.__qualname__ is set to qualname
Compatibility with PEP 382
Making this proposal compatible with the PEP 382 namespace packaging PEP is trivial. The semantics of _is_package_dir() are merely changed to be:
def _is_package_dir(fspath):
return (fspath.endswith(".pyp") or
any(os.exists("__init__" + info[0]) for info
in imp.get_suffixes()))
Incompatibility with PEP 402
PEP 402 proposes the elimination of explicit markers in the file system for Python packages. This fundamentally breaks the proposed concept of being able to take a filesystem path and a Python module name and work out an unambiguous mapping to the Python module namespace. Instead, the appropriate mapping would depend on the current values in sys.path, rendering it impossible to ever fix the problems described above with the calculation of sys.path[0] when the interpreter is initialised.
While some aspects of this PEP could probably be salvaged if PEP 402 were adopted, the core concept of making import semantics from main and other modules more consistent would no longer be feasible.
This incompatibility is discussed in more detail in the relevant import-sig threads ([2], [3]).
Potential incompatibilities with scripts stored in packages
The proposed change to sys.path[0] initialisation may break some existing code. Specifically, it will break scripts stored in package directories that rely on the implicit relative imports from __main__ in order to run correctly under Python 3.
While such scripts could be imported in Python 2 (due to implicit relative imports) it is already the case that they cannot be imported in Python 3, as implicit relative imports are no longer permitted when a module is imported.
By disallowing implicit relatives imports from the main module as well, such modules won't even work as scripts with this PEP. Switching them over to explicit relative imports will then get them working again as both executable scripts and as importable modules.
To support earlier versions of Python, a script could be written to use different forms of import based on the Python version:
if __name__ == "__main__" and sys.version_info < (3, 3):
import peer # Implicit relative import
else:
from . import peer # explicit relative import
Fixing dual imports of the main module
Given the above proposal to get __qualname__ consistently set correctly in the main module, one simple change is proposed to eliminate the problem of dual imports of the main module: the addition of a sys.metapath hook that detects attempts to import __main__ under its real name and returns the original main module instead:
class AliasImporter:
def __init__(self, module, alias):
self.module = module
self.alias = alias
def __repr__(self):
fmt = "{0.__class__.__name__}({0.module.__name__}, {0.alias})"
return fmt.format(self)
def find_module(self, fullname, path=None):
if path is None and fullname == self.alias:
return self
return None
def load_module(self, fullname):
if fullname != self.alias:
raise ImportError("{!r} cannot load {!r}".format(self, fullname))
return self.main_module
This metapath hook would be added automatically during import system initialisation based on the following logic:
main = sys.modules["__main__"]
if main.__name__ != main.__qualname__:
sys.metapath.append(AliasImporter(main, main.__qualname__))
This is probably the least important proposal in the PEP - it just closes off the last mechanism that is likely to lead to module duplication after the configuration of sys.path[0] at interpreter startup is addressed.
Fixing pickling without breaking introspection
To fix this problem, it is proposed to make use of the new module level __qualname__ attributes to determine the real module location when __name__ has been modified for any reason.
In the main module, __qualname__ will automatically be set to the main module's "real" name (as described above) by the interpreter.
Pseudo-modules that adjust __name__ to point to the public namespace will leave __qualname__ untouched, so the implementation location remains readily accessible for introspection.
If __name__ is adjusted at the top of a module, then this will automatically adjust the __module__ attribute for all functions and classes subsequently defined in that module.
Since multiple submodules may be set to use the same "public" namespace, functions and classes will be given a new __qualmodule__ attribute that refers to the __qualname__ of their module.
This isn't strictly necessary for functions (you could find out their module's qualified name by looking in their globals dictionary), but it is needed for classes, since they don't hold a reference to the globals of their defining module. Once a new attribute is added to classes, it is more convenient to keep the API consistent and add a new attribute to functions as well.
These changes mean that adjusting __name__ (and, either directly or indirectly, the corresponding function and class __module__ attributes) becomes the officially sanctioned way to implement a namespace as a package, while exposing the API as if it were still a single module.
All serialisation code that currently uses __name__ and __module__ attributes will then avoid exposing implementation details by default.
To correctly handle serialisation of items from the main module, the class and function definition logic will be updated to also use __qualname__ for the __module__ attribute in the case where __name__ == "__main__".
With __name__ and __module__ being officially blessed as being used for the public names of things, the introspection tools in the standard library will be updated to use __qualname__ and __qualmodule__ where appropriate. For example:
- pydoc will report both public and qualified names for modules
- inspect.getsource() (and similar tools) will use the qualified names that point to the implementation of the code
- additional pydoc and/or inspect APIs may be provided that report all modules with a given public __name__.
Fixing multiprocessing on Windows
With __qualname__ now available to tell multiprocessing the real name of the main module, it will be able to simply include it in the serialised information passed to the child process, eliminating the need for the current dubious introspection of the __file__ attribute.
For older Python versions, multiprocessing could be improved by applying the split_path_module() algorithm described above when attempting to work out how to execute the main module based on its __file__ attribute.
Explicit relative imports
This PEP proposes that __package__ be unconditionally defined in the main module as __qualname__.rpartition('.')[0]. Aside from that, it proposes that the behaviour of explicit relative imports be left alone.
In particular, if __package__ is not set in a module when an explicit relative import occurs, the automatically cached value will continue to be derived from __name__ rather than __qualname__. This minimises any backwards incompatibilities with existing code that deliberately manipulates relative imports by adjusting __name__ rather than setting __package__ directly.
This PEP does not propose that __package__ be deprecated. While it is technically redundant following the introduction of __qualname__, it just isn't worth the hassle of deprecating it within the lifetime of Python 3.x.
Reference Implementation
None as yet.
References
| [1] | Module aliases and/or "real names" (http://mail.python.org/pipermail/python-ideas/2011-January/008983.html) |
| [2] | (1, 2) PEP 395 (Module aliasing) and the namespace PEPs (http://mail.python.org/pipermail/import-sig/2011-November/000382.html) |
| [3] | (1, 2) Updated PEP 395 (aka "Implicit Relative Imports Must Die!") (http://mail.python.org/pipermail/import-sig/2011-November/000397.html) |
| [4] | Elaboration of compatibility problems between this PEP and PEP 402 (http://mail.python.org/pipermail/import-sig/2011-November/000403.html) |
Copyright
This document has been placed in the public domain.
pep-0396 Module Version Numbers
| PEP: | 396 |
|---|---|
| Title: | Module Version Numbers |
| Version: | 65628 |
| Last-Modified: | 2008-08-10 09:59:20 -0400 (Sun, 10 Aug 2008) |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Deferred |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 2011-03-16 |
| Post-History: | 2011-04-05 |
Contents
Abstract
Given that it is useful and common to specify version numbers for Python modules, and given that different ways of doing this have grown organically within the Python community, it is useful to establish standard conventions for module authors to adhere to and reference. This informational PEP describes best practices for Python module authors who want to define the version number of their Python module.
Conformance with this PEP is optional, however other Python tools (such as distutils2 [1]) may be adapted to use the conventions defined here.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
User Stories
Alice is writing a new module, called alice, which she wants to share with other Python developers. alice is a simple module and lives in one file, alice.py. Alice wants to specify a version number so that her users can tell which version they are using. Because her module lives entirely in one file, she wants to add the version number to that file.
Bob has written a module called bob which he has shared with many users. bob.py contains a version number for the convenience of his users. Bob learns about the Cheeseshop [2], and adds some simple packaging using classic distutils so that he can upload The Bob Bundle to the Cheeseshop. Because bob.py already specifies a version number which his users can access programmatically, he wants the same API to continue to work even though his users now get it from the Cheeseshop.
Carol maintains several namespace packages, each of which are independently developed and distributed. In order for her users to properly specify dependencies on the right versions of her packages, she specifies the version numbers in the namespace package's setup.py file. Because Carol wants to have to update one version number per package, she specifies the version number in her module and has the setup.py extract the module version number when she builds the sdist archive.
David maintains a package in the standard library, and also produces standalone versions for other versions of Python. The standard library copy defines the version number in the module, and this same version number is used for the standalone distributions as well.
Rationale
Python modules, both in the standard library and available from third parties, have long included version numbers. There are established de-facto standards for describing version numbers, and many ad-hoc ways have grown organically over the years. Often, version numbers can be retrieved from a module programmatically, by importing the module and inspecting an attribute. Classic Python distutils setup() functions [3] describe a version argument where the release's version number can be specified. PEP 8 [4] describes the use of a module attribute called __version__ for recording "Subversion, CVS, or RCS" version strings using keyword expansion. In the PEP author's own email archives, the earliest example of the use of an __version__ module attribute by independent module developers dates back to 1995.
Another example of version information is the sqlite3 [5] module with its sqlite_version_info, version, and version_info attributes. It may not be immediately obvious which attribute contains a version number for the module, and which contains a version number for the underlying SQLite3 library.
This informational PEP codifies established practice, and recommends standard ways of describing module version numbers, along with some use cases for when -- and when not -- to include them. Its adoption by module authors is purely voluntary; packaging tools in the standard library will provide optional support for the standards defined herein, and other tools in the Python universe may comply as well.
Specification
- In general, modules in the standard library SHOULD NOT have version numbers. They implicitly carry the version number of the Python release they are included in.
- On a case-by-case basis, standard library modules which are also released in standalone form for other Python versions MAY include a module version number when included in the standard library, and SHOULD include a version number when packaged separately.
- When a module (or package) includes a version number, the version SHOULD be available in the __version__ attribute.
- For modules which live inside a namespace package, the module SHOULD include the __version__ attribute. The namespace package itself SHOULD NOT include its own __version__ attribute.
- The __version__ attribute's value SHOULD be a string.
- Module version numbers SHOULD conform to the normalized version format specified in PEP 386 [6].
- Module version numbers SHOULD NOT contain version control system supplied revision numbers, or any other semantically different version numbers (e.g. underlying library version number).
- The version attribute in a classic distutils setup.py file, or the PEP 345 [7] Version metadata field SHOULD be derived from the __version__ field, or vice versa.
Examples
Retrieving the version number from a third party package:
>>> import bzrlib >>> bzrlib.__version__ '2.3.0'
Retrieving the version number from a standard library package that is also distributed as a standalone module:
>>> import email >>> email.__version__ '5.1.0'
Version numbers for namespace packages:
>>> import flufl.i18n >>> import flufl.enum >>> import flufl.lock >>> print flufl.i18n.__version__ 1.0.4 >>> print flufl.enum.__version__ 3.1 >>> print flufl.lock.__version__ 2.1 >>> import flufl >>> flufl.__version__ Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute '__version__' >>>
Deriving
Module version numbers can appear in at least two places, and sometimes more. For example, in accordance with this PEP, they are available programmatically on the module's __version__ attribute. In a classic distutils setup.py file, the setup() function takes a version argument, while the distutils2 setup.cfg file has a version key. The version number must also get into the PEP 345 metadata, preferably when the sdist archive is built. It's desirable for module authors to only have to specify the version number once, and have all the other uses derive from this single definition.
This could be done in any number of ways, a few of which are outlined below. These are included for illustrative purposes only and are not intended to be definitive, complete, or all-encompassing. Other approaches are possible, and some included below may have limitations that prevent their use in some situations.
Let's say Elle adds this attribute to her module file elle.py:
__version__ = '3.1.1'
Classic distutils
In classic distutils, the simplest way to add the version string to the setup() function in setup.py is to do something like this:
from elle import __version__ setup(name='elle', version=__version__)
In the PEP author's experience however, this can fail in some cases, such as when the module uses automatic Python 3 conversion via the 2to3 program (because setup.py is executed by Python 3 before the elle module has been converted).
In that case, it's not much more difficult to write a little code to parse the __version__ from the file rather than importing it. Without providing too much detail, it's likely that modules such as distutils2 will provide a way to parse version strings from files. E.g.:
from distutils2 import get_version
setup(name='elle', version=get_version('elle.py'))
Distutils2
Because the distutils2 style setup.cfg is declarative, we can't run any code to extract the __version__ attribute, either via import or via parsing.
In consultation with the distutils-sig [9], two options are proposed. Both entail containing the version number in a file, and declaring that file in the setup.cfg. When the entire contents of the file contains the version number, the version-file key will be used:
[metadata] version-file: version.txt
When the version number is contained within a larger file, e.g. of Python code, such that the file must be parsed to extract the version, the key version-from-file will be used:
[metadata] version-from-file: elle.py
A parsing method similar to that described above will be performed on the file named after the colon. The exact recipe for doing this will be discussed in the appropriate distutils2 development forum.
An alternative is to only define the version number in setup.cfg and use the pkgutil module [8] to make it available programmatically. E.g. in elle.py:
from distutils2._backport import pkgutil
__version__ = pkgutil.get_distribution('elle').metadata['version']
PEP 376 metadata
PEP 376 [10] defines a standard for static metadata, but doesn't describe the process by which this metadata gets created. It is highly desirable for the derived version information to be placed into the PEP 376 .dist-info metadata at build-time rather than install-time. This way, the metadata will be available for introspection even when the code is not installed.
References
| [1] | Distutils2 documentation (http://distutils2.notmyidea.org/) |
| [2] | The Cheeseshop (Python Package Index) (http://pypi.python.org) |
| [3] | http://docs.python.org/distutils/setupscript.html |
| [4] | PEP 8, Style Guide for Python Code (http://www.python.org/dev/peps/pep-0008) |
| [5] | sqlite3 module documentation (http://docs.python.org/library/sqlite3.html) |
| [6] | PEP 386, Changing the version comparison module in Distutils (http://www.python.org/dev/peps/pep-0386/) |
| [7] | PEP 345, Metadata for Python Software Packages 1.2 (http://www.python.org/dev/peps/pep-0345/#version) |
| [8] | pkgutil - Package utilities (http://distutils2.notmyidea.org/library/pkgutil.html) |
| [9] | http://mail.python.org/pipermail/distutils-sig/2011-June/017862.html |
| [10] | PEP 376, Database of Installed Python Distributions (http://www.python.org/dev/peps/pep-0376/) |
Copyright
This document has been placed in the public domain.
pep-0397 Python launcher for Windows
| PEP: | 397 |
|---|---|
| Title: | Python launcher for Windows |
| Version: | a57419aee37d |
| Last-Modified: | 2012/06/19 15:13:49 |
| Author: | Mark Hammond <mhammond at skippinet.com.au>, Martin v. LĂświs <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 15-Mar-2011 |
| Post-History: | 21-July-2011, 17-May-2011, 15-Mar-2011 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-June/120505.html |
Abstract
This PEP describes a Python launcher for the Windows platform. A
Python launcher is a single executable which uses a number of
heuristics to locate a Python executable and launch it with a
specified command line.
Rationale
Windows provides "file associations" so an executable can be associated
with an extension, allowing for scripts to be executed directly in some
contexts (eg., double-clicking the file in Windows Explorer.) Until now,
a strategy of "last installed Python wins" has been used and while not
ideal, has generally been workable due to the conservative changes in
Python 2.x releases. As Python 3.x scripts are often syntactically
incompatible with Python 2.x scripts, a different strategy must be used
to allow files with a '.py' extension to use a different executable based
on the Python version the script targets. This will be done by borrowing
the existing practices of another operating system - scripts will be able
to nominate the version of Python they need by way of a "shebang" line, as
described below.
Unix-like operating systems (referred to simply as "Unix" in this
PEP) allow scripts to be executed as if they were executable images
by examining the script for a "shebang" line which specifies the
actual executable to be used to run the script. This is described in
detail in the evecve(2) man page [1] and while user documentation will
be created for this feature, for the purposes of this PEP that man
page describes a valid shebang line.
Additionally, these operating systems provide symbolic-links to
Python executables in well-known directories. For example, many
systems will have a link /usr/bin/python which references a
particular version of Python installed under the operating-system.
These symbolic links allow Python to be executed without regard for
where Python it actually installed on the machine (eg., without
requiring the path where Python is actually installed to be
referenced in the shebang line or in the PATH.) PEP 394 'The "python"
command on Unix-Like Systems' [2] describes additional conventions
for more fine-grained specification of a particular Python version.
These 2 facilities combined allow for a portable and somewhat
predictable way of both starting Python interactively and for allowing
Python scripts to execute. This PEP describes an implementation of a
launcher which can offer the same benefits for Python on the Windows
platform and therefore allows the launcher to be the executable
associated with '.py' files to support multiple Python versions
concurrently.
While this PEP offers the ability to use a shebang line which should
work on both Windows and Unix, this is not the primary motivation for
this PEP - the primary motivation is to allow a specific version to be
specified without inventing new syntax or conventions to describe
it.
Specification
This PEP specifies features of the launcher; a prototype
implementation is provided in [3] which will be distributed
together with the Windows installer of Python, but will also be
available separately (but released along with the Python
installer). New features may be added to the launcher as
long as the features prescribed here continue to work.
Installation
The launcher comes in 2 versions - one which is a console program and
one which is a "windows" (ie., GUI) program. These 2 launchers correspond
to the 'python.exe' and 'pythonw.exe' executables which currently ship
with Python. The console launcher will be named 'py.exe' and the Windows
one named 'pyw.exe'. The "windows" (ie., GUI) version of the launcher
will attempt to locate and launch pythonw.exe even if a virtual shebang
line nominates simply "python" - infact, the trailing 'w' notation is
not supported in the virtual shebang line at all.
The launcher is installed into the Windows directory (see
discussion below) if installed by a privileged user. The
stand-alone installer asks for an alternative location of the
installer, and adds that location to the user's PATH.
The installation in the Windows directory is a 32-bit executable
(see discussion); the standalone installer may also offer to install
64-bit versions of the launcher.
The launcher installation is registered in
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\CurrentVersion\SharedDLLs
with a reference counter.
It contains a version resource matching the version number of the
pythonXY.dll with which it is distributed. Independent
installations will overwrite older version
of the launcher with newer versions. Stand-alone releases use
a release level of 0x10 in FIELD3 of the CPython release on which
they are based.
Once installed, the "console" version of the launcher is
associated with .py files and the "windows" version associated with .pyw
files.
The launcher is not tied to a specific version of Python - eg., a
launcher distributed with Python 3.3 should be capable of locating and
executing any Python 2.x and Python 3.x version. However, the
launcher binaries have a version resource that is the same as the
version resource in the Python binaries that they are released with.
Python Script Launching
The launcher is restricted to launching Python scripts.
It is not intended as a general-purpose script launcher or
shebang processor.
The launcher supports the syntax of shebang lines as described
in [1], including all restrictions listed.
The launcher supports shebang lines referring to Python
executables with any of the (regex) prefixes "/usr/bin/", "/usr/local/bin"
and "/usr/bin/env *", as well as binaries specified without
For example, a shebang line of '#! /usr/bin/python' should work even
though there is unlikely to be an executable in the relative Windows
directory "\usr\bin". This means that many scripts can use a single
shebang line and be likely to work on both Unix and Windows without
modification.
The launcher will support fully-qualified paths to executables.
While this will make the script inherently non-portable, it is a
feature offered by Unix and would be useful for Windows users in
some cases.
The launcher will be capable of supporting implementations other than
CPython, such as jython and IronPython, but given both the absence of
common links on Unix (such as "/usr/bin/jython") and the inability for the
launcher to automatically locate the installation location of these
implementations on Windows, the launcher will support this via
customization options. Scripts taking advantage of this will not be
portable (as these customization options must be set to reflect the
configuration of the machine on which the launcher is running) but this
ability is nonetheless considered worthwhile.
On Unix, the user can control which specific version of Python is used
by adjusting the links in /usr/bin to point to the desired version. As
the launcher on Windows will not use Windows links, cutomization options
(exposed via both environment variables and INI files) will be used to
override the semantics for determining what version of Python will be
used. For example, while a shebang line of "/usr/bin/python2" will
automatically locate a Python 2.x implementation, an environment variable
can override exactly which Python 2.x implementation will be chosen.
Similarly for "/usr/bin/python" and "/usr/bin/python3". This is
specified in detail later in this PEP.
Shebang line parsing
If the first command-line argument does not start with a dash ('-')
character, an attempt will be made to open that argument as a file
and parsed for a shebang line according to the rules in [1]::
#! interpreter [optional-arg]
Once parsed, the command will be categorized according to the following rules:
* If the command starts with the definition of a customized command
followed by a whitespace character (including a newline), the customized
command will be used. See below for a description of customized
commands.
* The launcher will define a set of prefixes which are considered Unix
compatible commands to launch Python, namely "/usr/bin/python",
"/usr/local/bin/python", "/usr/bin/env python", and "python".
If a command starts with one of these strings will be treated as a
'virtual command' and the rules described in Python Version Qualifiers
(below) will be used to locate the executable to use.
* Otherwise the command is assumed to be directly ready to execute - ie.
a fully-qualified path (or a reference to an executable on the PATH)
optionally followed by arguments. The contents of the string will not
be parsed - it will be passed directly to the Windows CreateProcess
function after appending the name of the script and the launcher
command-line arguments. This means that the rules used by
CreateProcess will be used, including how relative path names and
executable references without extensions are treated. Notably, the
Windows command processor will not be used, so special rules used by the
command processor (such as automatic appending of extensions other than
'.exe', support for batch files, etc) will not be used.
The use of 'virtual' shebang lines is encouraged as this should
allow for portable shebang lines to be specified which work on
multiple operating systems and different installations of the same
operating system.
If the first argument can not be opened as a file or if no valid
shebang line can be found, the launcher will act as if a shebang line of
'#!python' was found - ie., a default Python interpreter will be
located and the arguments passed to that. However, if a valid
shebang line is found but the process specified by that line can not
be started, the default interpreter will not be started - the error
to create the specified child process will cause the launcher to display
an appropriate message and terminate with a specific exit code.
Configuration file
Two .ini files will be searched by the launcher - ``py.ini`` in the
current user's "application data" directory (i.e. the directory returned
by calling the Windows function SHGetFolderPath with CSIDL_LOCAL_APPDATA,
%USERPROFILE%\AppData\Local on Vista+,
%USERPROFILE%\Local Settings\Application Data on XP)
and ``py.ini`` in the same directory as the launcher. The same .ini
files are used for both the 'console' version of the launcher (i.e.
py.exe) and for the 'windows' version (i.e. pyw.exe)
Customization specified in the "application directory" will have
precedence over the one next to the executable, so a user, who may not
have write access to the .ini file next to the launcher, can override
commands in that global .ini file)
Virtual commands in shebang lines
Virtual Commands are shebang lines which start with strings which would
be expected to work on Unix platforms - examples include
'/usr/bin/python', '/usr/bin/env python' and 'python'. Optionally, the
virtual command may be suffixed with a version qualifier (see below),
such as '/usr/bin/python2' or '/usr/bin/python3.2'. The command executed
is based on the rules described in Python Version Qualifiers
below.
Customized Commands
The launcher will support the ability to define "Customized Commands" in a
Windows .ini file (ie, a file which can be parsed by the Windows function
GetPrivateProfileString). A section called '[commands]' can be created
with key names defining the virtual command and the value specifying the
actual command-line to be used for this virtual command.
For example, if an INI file has the contents:
[commands]
vpython=c:\bin\vpython.exe -foo
Then a shebang line of '#! vpython' in a script named 'doit.py' will
result in the launcher using the command-line 'c:\bin\vpython.exe -foo
doit.py'
The precise details about the names, locations and search order of the
.ini files is in the launcher documentation [4]
Python Version Qualifiers
Some of the features described allow an optional Python version qualifier
to be used.
A version qualifier starts with a major version number and can optionally
be followed by a period ('.') and a minor version specifier. If the minor
qualifier is specified, it may optionally be followed by "-32" to indicate
the 32bit implementation of that version be used. Note that no "-64"
qualifier is necessary as this is the default implementation (see below).
On 64bit Windows with both 32bit and 64bit implementations of the
same (major.minor) Python version installed, the 64bit version will
always be preferred. This will be true for both 32bit and 64bit
implementations of the launcher - a 32bit launcher will prefer to
execute a 64bit Python installation of the specified version if
available. This is so the behavior of the launcher can be predicted
knowing only what versions are installed on the PC and without
regard to the order in which they were installed (ie, without knowing
whether a 32 or 64bit version of Python and corresponding launcher was
installed last). As noted above, an optional "-32" suffix can be used
on a version specifier to change this behaviour.
If no version qualifiers are found in a command, the environment variable
``PY_PYTHON`` can be set to specify the default version qualifier - the default
value is "2". Note this value could specify just a major version (e.g. "2") or
a major.minor qualifier (e.g. "2.6"), or even major.minor-32.
If no minor version qualifiers are found, the environment variable
``PY_PYTHON{major}`` (where ``{major}`` is the current major version qualifier
as determined above) can be set to specify the full version. If no such option
is found, the launcher will enumerate the installed Python versions and use
the latest minor release found for the major version, which is likely,
although not guaranteed, to be the most recently installed version in that
family.
In addition to environment variables, the same settings can be configured
in the .INI file used by the launcher. The section in the INI file is
called ``[defaults]`` and the key name will be the same as the
environment variables without the leading ``PY_`` prefix (and note that
the key names in the INI file are case insensitive.) The contents of
an environment variable will override things specified in the INI file.
Command-line handling
Only the first command-line argument will be checked for a shebang line
and only if that argument does not start with a '-'.
If the only command-line argument is "-h" or "--help", the launcher will
print a small banner and command-line usage, then pass the argument to
the default Python. This will cause help for the launcher being printed
followed by help for Python itself. The output from the launcher will
clearly indicate the extended help information is coming from the
launcher and not Python.
As a concession to interactively launching Python, the launcher will
support the first command-line argument optionally being a dash ("-")
followed by a version qualifier, as described above, to nominate a
specific version be used. For example, while "py.exe" may locate and
launch the latest Python 2.x implementation installed, a command-line such
as "py.exe -3" could specify the latest Python 3.x implementation be
launched, while "py.exe -2.6-32" could specify a 32bit implementation
Python 2.6 be located and launched. If a Python 2.x implementation is
desired to be launched with the -3 flag, the command-line would need to be
similar to "py.exe -2 -3" (or the specific version of Python could
obviously be launched manually without use of this launcher.) Note that
this feature can not be used with shebang processing as the file scanned
for a shebang line and this argument must both be the first argument and
therefore are mutually exclusive.
All other arguments will be passed untouched to the child Python process.
Process Launching
The launcher offers some conveniences for Python developers working
interactively - for example, starting the launcher with no command-line
arguments will launch the default Python with no command-line arguments.
Further, command-line arguments will be supported to allow a specific
Python version to be launched interactively - however, these conveniences
must not detract from the primary purpose of launching scripts and must
be easy to avoid if desired.
The launcher creates a subprocess to start the actual
interpreter. See `Discussion´ below for the rationale.
Discussion
It may be surprising that the launcher is installed into the
Windows directory, and not the System32 directory. The reason is
that the System32 directory is not on the Path of a 32-bit process
running on a 64-bit system. However, the Windows directory is
always on the path.
The launcher that is installed into the Windows directory is a 32-bit
executable so that the 32-bit CPython installer can provide the same
binary for both 32-bit and 64-bit Windows installations.
Ideally, the launcher process would execute Python directly inside
the same process, primarily so the parent of the launcher process could
terminate the launcher and have the Python interpreter terminate. If the
launcher executes Python as a sub-process and the parent of the launcher
terminates the launcher, the Python process will be unaffected.
However, there are a number of practical problems associated with this
approach. Windows does not support the execv* family of Unix functions,
so this could only be done by the launcher dynamically loading the Python
DLL, but this would have a number of side-effects. The most serious
side effect of this is that the value of sys.executable would refer to the
launcher instead of the Python implementation. Many Python scripts use the
value of sys.executable to launch child processes, and these scripts may
fail to work as expected if the launcher is used. Consider a "parent"
script with a shebang line of '#! /usr/bin/python3' which attempts to
launch a child script (with no shebang) via sys.executable - currently the
child is launched using the exact same version running the parent script.
If sys.executable referred to the launcher the child would be likely
executed using a Python 2.x version and would be likely to fail with a
SyntaxError.
Another hurdle is the support for alternative Python implementations
using the "customized commands" feature described above, where loading
the command dynamically into a running executable is not possible.
The final hurdle is the rules above regarding 64bit and 32bit programs -
a 32bit launcher would be unable to load the 64bit version of Python and
vice-versa.
Given these considerations, the launcher will execute its command in a
child process, remaining alive while the child process is executing, then
terminate with the same exit code as returned by the child. To address
concerns regarding the termination of the launcher not killing the child,
the Win32 Job API will be used to arrange so that the child process is
automatically killed when the parent is terminated (although children of
that child process will continue as is the case now.) As this Windows API
is available in Windows XP and later, this launcher will not work on
Windows 2000 or earlier.
References
[1] http://linux.die.net/man/2/execve
[2] http://www.python.org/dev/peps/pep-0394/
[3] https://bitbucket.org/vinay.sajip/pylauncher
[4] https://bitbucket.org/vinay.sajip/pylauncher/src/tip/Doc/launcher.rst
Copyright
This document has been placed in the public domain.
pep-0398 Python 3.3 Release Schedule
| PEP: | 398 |
|---|---|
| Title: | Python 3.3 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 23-Mar-2011 |
| Python-Version: | 3.3 |
Contents
Abstract
This document describes the development and release schedule for Python 3.3. The schedule primarily concerns itself with PEP-sized items.
Release Manager and Crew
- 3.3 Release Manager: Georg Brandl
- Windows installers: Martin v. Lรถwis
- Mac installers: Ronald Oussoren/Ned Deily
- Documentation: Georg Brandl
3.3 Lifespan
3.3 will receive bugfix updates approximately every 4-6 months for approximately 18 months. After the release of 3.4.0 final, a final 3.3 bugfix update will be released. After that, security updates (source only) will be released until 5 years after the release of 3.3 final, which will be September 2017.
Release Schedule
3.3.0 schedule
- 3.3.0 alpha 1: March 5, 2012
- 3.3.0 alpha 2: April 2, 2012
- 3.3.0 alpha 3: May 1, 2012
- 3.3.0 alpha 4: May 31, 2012
- 3.3.0 beta 1: June 27, 2012
(No new features beyond this point.)
- 3.3.0 beta 2: August 12, 2012
- 3.3.0 candidate 1: August 24, 2012
- 3.3.0 candidate 2: September 9, 2012
- 3.3.0 candidate 3: September 24, 2012
- 3.3.0 final: September 29, 2012
3.3.1 schedule
- 3.3.1 candidate 1: March 23, 2013
- 3.3.1 final: April 6, 2013
3.3.2 schedule
- 3.3.2 final: May 13, 2013
3.3.3 schedule
- 3.3.3 candidate 1: October 27, 2013
- 3.3.3 candidate 2: November 9, 2013
- 3.3.3 final: November 16, 2013
3.3.4 schedule
- 3.3.4 candidate 1: January 26, 2014
- 3.3.4 final: February 9, 2014
3.3.5 schedule
Python 3.3.5 was the last regular maintenance release before 3.3 entered security-fix only mode.
- 3.3.5 candidate 1: February 22, 2014
- 3.3.5 candidate 2: March 1, 2014
- 3.3.5 final: March 8, 2014
3.3.6 schedule
- 3.3.6 candidate 1 (source-only release): October 4, 2014
- 3.3.6 final (source-only release): October 11, 2014
Features for 3.3
Implemented / Final PEPs:
- PEP 362: Function Signature Object
- PEP 380: Syntax for Delegating to a Subgenerator
- PEP 393: Flexible String Representation
- PEP 397: Python launcher for Windows
- PEP 399: Pure Python/C Accelerator Module Compatibility Requirements
- PEP 405: Python Virtual Environments
- PEP 409: Suppressing exception context
- PEP 412: Key-Sharing Dictionary
- PEP 414: Explicit Unicode Literal for Python 3.3
- PEP 415: Implement context suppression with exception attributes
- PEP 417: Including mock in the Standard Library
- PEP 418: Add monotonic time, performance counter, and process time functions
- PEP 420: Implicit Namespace Packages
- PEP 421: Adding sys.implementation
- PEP 3118: Revising the buffer protocol (protocol semantics finalised)
- PEP 3144: IP Address manipulation library
- PEP 3151: Reworking the OS and IO exception hierarchy
- PEP 3155: Qualified name for classes and functions
Other final large-scale changes:
- Addition of the "faulthandler" module
- Addition of the "lzma" module, and lzma/xz support in tarfile
- Implementing __import__ using importlib
- Addition of the C decimal implementation
- Switch of Windows build toolchain to VS 2010
Candidate PEPs:
- None
Other planned large-scale changes:
- None
Deferred to post-3.3:
- PEP 395: Qualified Names for Modules
- PEP 3143: Standard daemon process library
- PEP 3154: Pickle protocol version 4
- Breaking out standard library and docs in separate repos
- Addition of the "packaging" module, deprecating "distutils"
- Addition of the "regex" module
- Email version 6
- A standard event-loop interface (PEP by Jim Fulton pending)
Copyright
This document has been placed in the public domain.
pep-0399 Pure Python/C Accelerator Module Compatibility Requirements
| PEP: | 399 |
|---|---|
| Title: | Pure Python/C Accelerator Module Compatibility Requirements |
| Version: | 88219 |
| Last-Modified: | 2011-01-27 13:47:00 -0800 (Thu, 27 Jan 2011) |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 04-Apr-2011 |
| Python-Version: | 3.3 |
| Post-History: | 04-Apr-2011, 12-Apr-2011, 17-Jul-2011, 15-Aug-2011, 01-Jan-2013 |
Contents
Abstract
The Python standard library under CPython contains various instances of modules implemented in both pure Python and C (either entirely or partially). This PEP requires that in these instances that the C code must pass the test suite used for the pure Python code so as to act as much as a drop-in replacement as reasonably possible (C- and VM-specific tests are exempt). It is also required that new C-based modules lacking a pure Python equivalent implementation get special permission to be added to the standard library.
Rationale
Python has grown beyond the CPython virtual machine (VM). IronPython [1], Jython [2], and PyPy [3] are all currently viable alternatives to the CPython VM. The VM ecosystem that has sprung up around the Python programming language has led to Python being used in many different areas where CPython cannot be used, e.g., Jython allowing Python to be used in Java applications.
A problem all of the VMs other than CPython face is handling modules from the standard library that are implemented (to some extent) in C. Since other VMs do not typically support the entire C API of CPython [4] they are unable to use the code used to create the module. Often times this leads these other VMs to either re-implement the modules in pure Python or in the programming language used to implement the VM itself (e.g., in C# for IronPython). This duplication of effort between CPython, PyPy, Jython, and IronPython is extremely unfortunate as implementing a module at least in pure Python would help mitigate this duplicate effort.
The purpose of this PEP is to minimize this duplicate effort by mandating that all new modules added to Python's standard library must have a pure Python implementation unless special dispensation is given. This makes sure that a module in the stdlib is available to all VMs and not just to CPython (pre-existing modules that do not meet this requirement are exempt, although there is nothing preventing someone from adding in a pure Python implementation retroactively).
Re-implementing parts (or all) of a module in C (in the case of CPython) is still allowed for performance reasons, but any such accelerated code must pass the same test suite (sans VM- or C-specific tests) to verify semantics and prevent divergence. To accomplish this, the test suite for the module must have comprehensive coverage of the pure Python implementation before the acceleration code may be added.
Details
Starting in Python 3.3, any modules added to the standard library must have a pure Python implementation. This rule can only be ignored if the Python development team grants a special exemption for the module. Typically the exemption will be granted only when a module wraps a specific C-based library (e.g., sqlite3 [5]). In granting an exemption it will be recognized that the module will be considered exclusive to CPython and not part of Python's standard library that other VMs are expected to support. Usage of ctypes to provide an API for a C library will continue to be frowned upon as ctypes lacks compiler guarantees that C code typically relies upon to prevent certain errors from occurring (e.g., API changes).
Even though a pure Python implementation is mandated by this PEP, it does not preclude the use of a companion acceleration module. If an acceleration module is provided it is to be named the same as the module it is accelerating with an underscore attached as a prefix, e.g., _warnings for warnings. The common pattern to access the accelerated code from the pure Python implementation is to import it with an import *, e.g., from _warnings import *. This is typically done at the end of the module to allow it to overwrite specific Python objects with their accelerated equivalents. This kind of import can also be done before the end of the module when needed, e.g., an accelerated base class is provided but is then subclassed by Python code. This PEP does not mandate that pre-existing modules in the stdlib that lack a pure Python equivalent gain such a module. But if people do volunteer to provide and maintain a pure Python equivalent (e.g., the PyPy team volunteering their pure Python implementation of the csv module and maintaining it) then such code will be accepted. In those instances the C version is considered the reference implementation in terms of expected semantics.
Any new accelerated code must act as a drop-in replacement as close to the pure Python implementation as reasonable. Technical details of the VM providing the accelerated code are allowed to differ as necessary, e.g., a class being a type when implemented in C. To verify that the Python and equivalent C code operate as similarly as possible, both code bases must be tested using the same tests which apply to the pure Python code (tests specific to the C code or any VM do not follow under this requirement). The test suite is expected to be extensive in order to verify expected semantics.
Acting as a drop-in replacement also dictates that no public API be provided in accelerated code that does not exist in the pure Python code. Without this requirement people could accidentally come to rely on a detail in the accelerated code which is not made available to other VMs that use the pure Python implementation. To help verify that the contract of semantic equivalence is being met, a module must be tested both with and without its accelerated code as thoroughly as possible.
As an example, to write tests which exercise both the pure Python and C accelerated versions of a module, a basic idiom can be followed:
from test.support import import_fresh_module
import unittest
c_heapq = import_fresh_module('heapq', fresh=['_heapq'])
py_heapq = import_fresh_module('heapq', blocked=['_heapq'])
class ExampleTest:
def test_example(self):
self.assertTrue(hasattr(self.module, 'heapify'))
class PyExampleTest(ExampleTest, unittest.TestCase):
module = py_heapq
@unittest.skipUnless(c_heapq, 'requires the C _heapq module')
class CExampleTest(ExampleTest, unittest.TestCase):
module = c_heapq
if __name__ == '__main__':
unittest.main()
The test module defines a base class (ExampleTest) with test methods that access the heapq module through a self.heapq class attribute, and two subclasses that set this attribute to either the Python or the C version of the module. Note that only the two subclasses inherit from unittest.TestCase -- this prevents the ExampleTest class from being detected as a TestCase subclass by unittest test discovery. A skipUnless decorator can be added to the class that tests the C code in order to have these tests skipped when the C module is not available.
If this test were to provide extensive coverage for heapq.heappop() in the pure Python implementation then the accelerated C code would be allowed to be added to CPython's standard library. If it did not, then the test suite would need to be updated until proper coverage was provided before the accelerated C code could be added.
To also help with compatibility, C code should use abstract APIs on objects to prevent accidental dependence on specific types. For instance, if a function accepts a sequence then the C code should default to using PyObject_GetItem() instead of something like PyList_GetItem(). C code is allowed to have a fast path if the proper PyList_CheckExact() is used, but otherwise APIs should work with any object that duck types to the proper interface instead of a specific type.
References
| [1] | http://ironpython.net/ |
| [2] | http://www.jython.org/ |
| [3] | http://pypy.org/ |
| [4] | http://docs.python.org/py3k/c-api/index.html |
| [5] | http://docs.python.org/py3k/library/sqlite3.html |
Copyright
This document has been placed in the public domain.
pep-0400 Deprecate codecs.StreamReader and codecs.StreamWriter
| PEP: | 400 |
|---|---|
| Title: | Deprecate codecs.StreamReader and codecs.StreamWriter |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-May-2011 |
| Python-Version: | 3.3 |
Contents
Abstract
io.TextIOWrapper and codecs.StreamReaderWriter offer the same API [1]. TextIOWrapper has more features and is faster than StreamReaderWriter. Duplicate code means that bugs should be fixed twice and that we may have subtle differences between the two implementations.
The codecs module was introduced in Python 2.0 (see the PEP 100). The io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and reimplemented in C in Python 2.7 and 3.1.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
Motivation
When the Python I/O model was updated for 3.0, the concept of a "stream-with-known-encoding" was introduced in the form of io.TextIOWrapper. As this class is critical to the performance of text-based I/O in Python 3, this module has an optimised C version which is used by CPython by default. Many corner cases in handling buffering, stateful codecs and universal newlines have been dealt with since the release of Python 3.0.
This new interface overlaps heavily with the legacy codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter interfaces that were part of the original codec interface design in PEP 100. These interfaces are organised around the principle of an encoding with an associated stream (i.e. the reverse of arrangement in the io module), so the original PEP 100 design required that codec writers provide appropriate StreamReader and StreamWriter implementations in addition to the core codec encode() and decode() methods. This places a heavy burden on codec authors providing these specialised implementations to correctly handle many of the corner cases (see Appendix A) that have now been dealt with by io.TextIOWrapper. While deeper integration between the codec and the stream allows for additional optimisations in theory, these optimisations have in practice either not been carried out and else the associated code duplication means that the corner cases that have been fixed in io.TextIOWrapper are still not handled correctly in the various StreamReader and StreamWriter implementations.
Accordingly, this PEP proposes that:
- codecs.open() be updated to delegate to the builtin open() in Python 3.3;
- the legacy codecs.Stream* interfaces, including the streamreader and streamwriter attributes of codecs.CodecInfo be deprecated in Python 3.3.
Rationale
StreamReader and StreamWriter issues
- StreamReader is unable to translate newlines.
- StreamWriter doesn't support "line buffering" (flush if the input text contains a newline).
- StreamReader classes of the CJK encodings (e.g. GB18030) only supports UNIX newlines ('\n').
- StreamReader and StreamWriter are stateful codecs but don't expose functions to control their state (getstate() or setstate()). Each codec has to handle corner cases, see Appendix A.
- StreamReader and StreamWriter are very similar to IncrementalReader and IncrementalEncoder, some code is duplicated for stateful codecs (e.g. UTF-16).
- Each codec has to reimplement its own StreamReader and StreamWriter class, even if it's trivial (just call the encoder/decoder).
- codecs.open(filename, "r") creates a io.TextIOWrapper object.
- No codec implements an optimized method in StreamReader or StreamWriter based on the specificities of the codec.
Issues in the bug tracker:
- Issue #5445 (2009-03-08): codecs.StreamWriter.writelines problem when passed generator
- Issue #7262: (2009-11-04): codecs.open() + eol (windows)
- Issue #8260 (2010-03-29): When I use codecs.open(...) and f.readline() follow up by f.read() return bad result
- Issue #8630 (2010-05-05): Keepends param in codec readline(s)
- Issue #10344 (2010-11-06): codecs.readline doesn't care buffering
- Issue #11461 (2011-03-10): Reading UTF-16 with codecs.readline() breaks on surrogate pairs
- Issue #12446 (2011-06-30): StreamReader Readlines behavior odd
- Issue #12508 (2011-07-06): Codecs Anomaly
- Issue #12512 (2011-07-07): codecs: StreamWriter issues with stateful codecs after a seek or with append mode
- Issue #12513 (2011-07-07): codec.StreamReaderWriter: issues with interlaced read-write
TextIOWrapper features
- TextIOWrapper supports any kind of newline, including translating newlines (to UNIX newlines), to read and write.
- TextIOWrapper reuses codecs incremental encoders and decoders (no duplication of code).
- The io module (TextIOWrapper) is faster than the codecs module (StreamReader). It is implemented in C, whereas codecs is implemented in Python.
- TextIOWrapper has a readahead algorithm which speeds up small reads: read character by character or line by line (io is 10x through 25x faster than codecs on these operations).
- TextIOWrapper has a write buffer.
- TextIOWrapper.tell() is optimized.
- TextIOWrapper supports random access (read+write) using a single class which permit to optimize interlaced read-write (but no such optimization is implemented).
TextIOWrapper issues
- Issue #12215 (2011-05-30): TextIOWrapper: issues with interlaced read-write
Possible improvements of StreamReader and StreamWriter
By adding codec state read/write functions to the StreamReader and StreamWriter classes, it will become possible to fix issues with stateful codecs in a base class instead of in each stateful StreamReader and StreamWriter classes.
It would be possible to change StreamReader and StreamWriter to make them use IncrementalDecoder and IncrementalEncoder.
A codec can implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. TextIOWrapper cannot implement such optimization, but TextIOWrapper uses incremental encoders and decoders and uses read and write buffers, so the overhead of incomplete inputs is low or nul.
A lot more could be done for other variable length encoding codecs, e.g. UTF-8, since these often have problems near the end of a read due to missing bytes. The UTF-32-BE/LE codecs could simply multiply the character position by 4 to get the byte position.
Usage of StreamReader and StreamWriter
These classes are rarely used directly, but indirectly using codecs.open(). They are not used in Python 3 standard library (except in the codecs module).
Some projects implement their own codec with StreamReader and StreamWriter, but don't use these classes.
Backwards Compatibility
Keep the public API, codecs.open
codecs.open() can be replaced by the builtin open() function. open() has a similar API but has also more options. Both functions return file-like objects (same API).
codecs.open() was the only way to open a text file in Unicode mode until Python 2.6. Many Python 2 programs uses this function. Removing codecs.open() implies more work to port programs from Python 2 to Python 3, especially projets using the same code base for the two Python versions (without using 2to3 program).
codecs.open() is kept for backward compatibility with Python 2.
Deprecate StreamReader and StreamWriter
Instanciating StreamReader or StreamWriter must emit a DeprecationWarning in Python 3.3. Defining a subclass doesn't emit a DeprecationWarning.
codecs.open() will be changed to reuse the builtin open() function (TextIOWrapper) to read-write text files.
Alternative Approach
An alternative to the deprecation of the codecs.Stream* classes is to rename codecs.open() to codecs.open_stream(), and to create a new codecs.open() function reusing open() and so io.TextIOWrapper.
Appendix A: Issues with stateful codecs
It is difficult to use correctly a stateful codec with a stream. Some cases are supported by the codecs module, while io has no more known bug related to stateful codecs. The main difference between the codecs and the io module is that bugs have to be fixed in StreamReader and/or StreamWriter classes of each codec for the codecs module, whereas bugs can be fixed only once in io.TextIOWrapper. Here are some examples of issues with stateful codecs.
Stateful codecs
Python supports the following stateful codecs:
- cp932
- cp949
- cp950
- euc_jis_2004
- euc_jisx2003
- euc_jp
- euc_kr
- gb18030
- gbk
- hz
- iso2022_jp
- iso2022_jp_1
- iso2022_jp_2
- iso2022_jp_2004
- iso2022_jp_3
- iso2022_jp_ext
- iso2022_kr
- shift_jis
- shift_jis_2004
- shift_jisx0213
- utf_8_sig
- utf_16
- utf_32
Read and seek(0)
with open(filename, 'w', encoding='utf-16') as f:
f.write('abc')
f.write('def')
f.seek(0)
assert f.read() == 'abcdef'
f.seek(0)
assert f.read() == 'abcdef'
The io and codecs modules support this usecase correctly.
seek(n)
with open(filename, 'w', encoding='utf-16') as f:
f.write('abc')
pos = f.tell()
with open(filename, 'w', encoding='utf-16') as f:
f.seek(pos)
f.write('def')
f.seek(0)
f.write('###')
with open(filename, 'r', encoding='utf-16') as f:
assert f.read() == '###def'
The io module supports this usecase, whereas codecs fails because it writes a new BOM on the second write (issue #12512).
Append mode
with open(filename, 'w', encoding='utf-16') as f:
f.write('abc')
with open(filename, 'a', encoding='utf-16') as f:
f.write('def')
with open(filename, 'r', encoding='utf-16') as f:
assert f.read() == 'abcdef'
The io module supports this usecase, whereas codecs fails because it writes a new BOM on the second write (issue #12512).
Copyright
This document has been placed in the public domain.
pep-0401 BDFL Retirement
| PEP: | 401 |
|---|---|
| Title: | BDFL Retirement |
| Version: | $Revision$ |
| Last-Modified: | $Date: 2009-04-01 00:00:00 -0400 (Wed, 1 Apr 2009)$ |
| Author: | Barry Warsaw, Brett Cannon |
| Status: | April Fool! |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 01-Apr-2009 |
| Post-History: | 01-Apr-2009 |
Abstract
The BDFL, having shepherded Python development for 20 years, officially announces his retirement, effective immediately. Following a unanimous vote, his replacement is named.
Rationale
Guido wrote the original implementation of Python in 1989, and after nearly 20 years of leading the community, has decided to step aside as its Benevolent Dictator For Life. His official title is now Benevolent Dictator Emeritus Vacationing Indefinitely from the Language (BDEVIL). Guido leaves Python in the good hands of its new leader and its vibrant community, in order to train for his lifelong dream of climbing Mount Everest.
After unanimous vote of the Python Steering Union (not to be confused with the Python Secret Underground, which emphatically does not exist) at the 2009 Python Conference (PyCon [7] 2009), Guido's successor has been chosen: Barry Warsaw, or as he is affectionately known, Uncle Barry. Uncle Barry's official title is Friendly Language Uncle For Life (FLUFL).
Official Acts of the FLUFL
FLUFL Uncle Barry enacts the following decisions, in order to demonstrate his intention to lead the community in the same responsible and open manner as his predecessor, whose name escapes him:
- Recognized that the selection of Hg as the DVCS of choice was clear proof of the onset of the BDEVIL's insanity, and reverting this decision to switch to Bzr instead, the only true choice.
- Recognized that the != inequality operator in Python 3.0 was a horrible, finger pain inducing mistake, the FLUFL reinstates the <> diamond operator as the sole spelling. This change is important enough to be implemented for, and released in Python 3.1. To help transition to this feature, a new future statement, from __future__ import barry_as_FLUFL has been added.
- Recognized that the print function in Python 3.0 was a horrible, pain-inducing mistake, the FLUFL reinstates the print statement. This change is important enough to be implemented for, and released in Python 3.0.2.
- Recognized that the disappointing adoption curve of Python 3.0 signals its abject failure, all work on Python 3.1 and subsequent Python 3.x versions is hereby terminated. All features in Python 3.0 shall be back ported to Python 2.7 which will be the official and sole next release. The Python 3.0 string and bytes types will be back ported to Python 2.6.2 for the convenience of developers.
- Recognized that C is a 20th century language with almost universal rejection by programmers under the age of 30, the CPython implementation will terminate with the release of Python 2.6.2 and 3.0.2. Thereafter, the reference implementation of Python will target the Parrot [1] virtual machine. Alternative implementations of Python (e.g. Jython [2], IronPython [3], and PyPy [4]) are officially discouraged but tolerated.
- Recognized that the Python Software Foundation [5] having fulfilled its mission admirably, is hereby disbanded. The Python Steering Union [6] (not to be confused with the Python Secret Underground, which emphatically does not exist), is now the sole steward for all of Python's intellectual property. All PSF funds are hereby transferred to the PSU (not that PSU, the other PSU).
References
| [1] | http://www.parrot.org |
| [2] | http://www.jython.org |
| [3] | http://www.ironpython.com |
| [4] | http://www.codespeak.net/pypy |
| [5] | http://www.python.org/psf |
| [6] | http://www.pythonlabs.com |
| [7] | http://us.pycon.org/ |
Copyright
This document is the property of the Python Steering Union (not to be confused with the Python Secret Underground, which emphatically does not exist). We suppose it's okay for you to read this, but don't even think about quoting, copying, modifying, or distributing it.
pep-0402 Simplified Package Layout and Partitioning
| PEP: | 402 |
|---|---|
| Title: | Simplified Package Layout and Partitioning |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | P.J. Eby |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Jul-2011 |
| Python-Version: | 3.3 |
| Post-History: | 20-Jul-2011 |
| Replaces: | 382 |
Contents
Rejection Notice
On the first day of sprints at US PyCon 2012 we had a long and fruitful discussion about PEP 382 and PEP 402. We ended up rejecting both but a new PEP will be written to carry on in the spirit of PEP 402. Martin von Lรถwis wrote up a summary: [3].
Abstract
This PEP proposes an enhancement to Python's package importing to:
- Surprise users of other languages less,
- Make it easier to convert a module into a package, and
- Support dividing packages into separately installed components (ala "namespace packages", as described in PEP 382)
The proposed enhancements do not change the semantics of any currently-importable directory layouts, but make it possible for packages to use a simplified directory layout (that is not importable currently).
However, the proposed changes do NOT add any performance overhead to the importing of existing modules or packages, and performance for the new directory layout should be about the same as that of previous "namespace package" solutions (such as pkgutil.extend_path()).
The Problem
"Most packages are like modules. Their contents are highly interdependent and can't be pulled apart. [However,] some packages exist to provide a separate namespace. ... It should be possible to distribute sub-packages or submodules of these [namespace packages] independently."
—Jim Fulton, shortly before the release of Python 2.3 [1]
When new users come to Python from other languages, they are often confused by Python's package import semantics. At Google, for example, Guido received complaints from "a large crowd with pitchforks" [2] that the requirement for packages to contain an __init__ module was a "misfeature", and should be dropped.
In addition, users coming from languages like Java or Perl are sometimes confused by a difference in Python's import path searching.
In most other languages that have a similar path mechanism to Python's sys.path, a package is merely a namespace that contains modules or classes, and can thus be spread across multiple directories in the language's path. In Perl, for instance, a Foo::Bar module will be searched for in Foo/ subdirectories all along the module include path, not just in the first such subdirectory found.
Worse, this is not just a problem for new users: it prevents anyone from easily splitting a package into separately-installable components. In Perl terms, it would be as if every possible Net:: module on CPAN had to be bundled up and shipped in a single tarball!
For that reason, various workarounds for this latter limitation exist, circulated under the term "namespace packages". The Python standard library has provided one such workaround since Python 2.3 (via the pkgutil.extend_path() function), and the "setuptools" package provides another (via pkg_resources.declare_namespace()).
The workarounds themselves, however, fall prey to a third issue with Python's way of laying out packages in the filesystem.
Because a package must contain an __init__ module, any attempt to distribute modules for that package must necessarily include that __init__ module, if those modules are to be importable.
However, the very fact that each distribution of modules for a package must contain this (duplicated) __init__ module, means that OS vendors who package up these module distributions must somehow handle the conflict caused by several module distributions installing that __init__ module to the same location in the filesystem.
This led to the proposing of PEP 382 ("Namespace Packages") - a way to signal to Python's import machinery that a directory was importable, using unique filenames per module distribution.
However, there was more than one downside to this approach. Performance for all import operations would be affected, and the process of designating a package became even more complex. New terminology had to be invented to explain the solution, and so on.
As terminology discussions continued on the Import-SIG, it soon became apparent that the main reason it was so difficult to explain the concepts related to "namespace packages" was because Python's current way of handling packages is somewhat underpowered, when compared to other languages.
That is, in other popular languages with package systems, no special term is needed to describe "namespace packages", because all packages generally behave in the desired fashion.
Rather than being an isolated single directory with a special marker module (as in Python), packages in other languages are typically just the union of appropriately-named directories across the entire import or inclusion path.
In Perl, for example, the module Foo is always found in a Foo.pm file, and a module Foo::Bar is always found in a Foo/Bar.pm file. (In other words, there is One Obvious Way to find the location of a particular module.)
This is because Perl considers a module to be different from a package: the package is purely a namespace in which other modules may reside, and is only coincidentally the name of a module as well.
In current versions of Python, however, the module and the package are more tightly bound together. Foo is always a module -- whether it is found in Foo.py or Foo/__init__.py -- and it is tightly linked to its submodules (if any), which must reside in the exact same directory where the __init__.py was found.
On the positive side, this design choice means that a package is quite self-contained, and can be installed, copied, etc. as a unit just by performing an operation on the package's root directory.
On the negative side, however, it is non-intuitive for beginners, and requires a more complex step to turn a module into a package. If Foo begins its life as Foo.py, then it must be moved and renamed to Foo/__init__.py.
Conversely, if you intend to create a Foo.Bar module from the start, but have no particular module contents to put in Foo itself, then you have to create an empty and seemingly-irrelevant Foo/__init__.py file, just so that Foo.Bar can be imported.
(And these issues don't just confuse newcomers to the language, either: they annoy many experienced developers as well.)
So, after some discussion on the Import-SIG, this PEP was created as an alternative to PEP 382, in an attempt to solve all of the above problems, not just the "namespace package" use cases.
And, as a delightful side effect, the solution proposed in this PEP does not affect the import performance of ordinary modules or self-contained (i.e. __init__-based) packages.
The Solution
In the past, various proposals have been made to allow more intuitive approaches to package directory layout. However, most of them failed because of an apparent backward-compatibility problem.
That is, if the requirement for an __init__ module were simply dropped, it would open up the possibility for a directory named, say, string on sys.path, to block importing of the standard library string module.
Paradoxically, however, the failure of this approach does not arise from the elimination of the __init__ requirement!
Rather, the failure arises because the underlying approach takes for granted that a package is just ONE thing, instead of two.
In truth, a package comprises two separate, but related entities: a module (with its own, optional contents), and a namespace where other modules or packages can be found.
In current versions of Python, however, the module part (found in __init__) and the namespace for submodule imports (represented by the __path__ attribute) are both initialized at the same time, when the package is first imported.
And, if you assume this is the only way to initialize these two things, then there is no way to drop the need for an __init__ module, while still being backwards-compatible with existing directory layouts.
After all, as soon as you encounter a directory on sys.path matching the desired name, that means you've "found" the package, and must stop searching, right?
Well, not quite.
A Thought Experiment
Let's hop into the time machine for a moment, and pretend we're back in the early 1990s, shortly before Python packages and __init__.py have been invented. But, imagine that we are familiar with Perl-like package imports, and we want to implement a similar system in Python.
We'd still have Python's module imports to build on, so we could certainly conceive of having Foo.py as a parent Foo module for a Foo package. But how would we implement submodule and subpackage imports?
Well, if we didn't have the idea of __path__ attributes yet, we'd probably just search sys.path looking for Foo/Bar.py.
But we'd only do it when someone actually tried to import Foo.Bar.
NOT when they imported Foo.
And that lets us get rid of the backwards-compatibility problem of dropping the __init__ requirement, back here in 2011.
How?
Well, when we import Foo, we're not even looking for Foo/ directories on sys.path, because we don't care yet. The only point at which we care, is the point when somebody tries to actually import a submodule or subpackage of Foo.
That means that if Foo is a standard library module (for example), and I happen to have a Foo directory on sys.path (without an __init__.py, of course), then nothing breaks. The Foo module is still just a module, and it's still imported normally.
Self-Contained vs. "Virtual" Packages
Of course, in today's Python, trying to import Foo.Bar will fail if Foo is just a Foo.py module (and thus lacks a __path__ attribute).
So, this PEP proposes to dynamically create a __path__, in the case where one is missing.
That is, if I try to import Foo.Bar the proposed change to the import machinery will notice that the Foo module lacks a __path__, and will therefore try to build one before proceeding.
And it will do this by making a list of all the existing Foo/ subdirectories of the directories listed in sys.path.
If the list is empty, the import will fail with ImportError, just like today. But if the list is not empty, then it is saved in a new Foo.__path__ attribute, making the module a "virtual package".
That is, because it now has a valid __path__, we can proceed to import submodules or subpackages in the normal way.
Now, notice that this change does not affect "classic", self-contained packages that have an __init__ module in them. Such packages already have a __path__ attribute (initialized at import time) so the import machinery won't try to create another one later.
This means that (for example) the standard library email package will not be affected in any way by you having a bunch of unrelated directories named email on sys.path. (Even if they contain *.py files.)
But it does mean that if you want to turn your Foo module into a Foo package, all you have to do is add a Foo/ directory somewhere on sys.path, and start adding modules to it.
But what if you only want a "namespace package"? That is, a package that is only a namespace for various separately-distributed submodules and subpackages?
For example, if you're Zope Corporation, distributing dozens of separate tools like zc.buildout, each in packages under the zc namespace, you don't want to have to make and include an empty zc.py in every tool you ship. (And, if you're a Linux or other OS vendor, you don't want to deal with the package installation conflicts created by trying to install ten copies of zc.py to the same location!)
No problem. All we have to do is make one more minor tweak to the import process: if the "classic" import process fails to find a self-contained module or package (e.g., if import zc fails to find a zc.py or zc/__init__.py), then we once more try to build a __path__ by searching for all the zc/ directories on sys.path, and putting them in a list.
If this list is empty, we raise ImportError. But if it's non-empty, we create an empty zc module, and put the list in zc.__path__. Congratulations: zc is now a namespace-only, "pure virtual" package! It has no module contents, but you can still import submodules and subpackages from it, regardless of where they're located on sys.path.
(By the way, both of these additions to the import protocol (i.e. the dynamically-added __path__, and dynamically-created modules) apply recursively to child packages, using the parent package's __path__ in place of sys.path as a basis for generating a child __path__. This means that self-contained and virtual packages can contain each other without limitation, with the caveat that if you put a virtual package inside a self-contained one, it's gonna have a really short __path__!)
Backwards Compatibility and Performance
Notice that these two changes only affect import operations that today would result in ImportError. As a result, the performance of imports that do not involve virtual packages is unaffected, and potential backward compatibility issues are very restricted.
Today, if you try to import submodules or subpackages from a module with no __path__, it's an immediate error. And of course, if you don't have a zc.py or zc/__init__.py somewhere on sys.path today, import zc would likewise fail.
Thus, the only potential backwards-compatibility issues are:
Tools that expect package directories to have an __init__ module, that expect directories without an __init__ module to be unimportable, or that expect __path__ attributes to be static, will not recognize virtual packages as packages.
(In practice, this just means that tools will need updating to support virtual packages, e.g. by using pkgutil.walk_modules() instead of using hardcoded filesystem searches.)
Code that expects certain imports to fail may now do something unexpected. This should be fairly rare in practice, as most sane, non-test code does not import things that are expected not to exist!
The biggest likely exception to the above would be when a piece of code tries to check whether some package is installed by importing it. If this is done only by importing a top-level module (i.e., not checking for a __version__ or some other attribute), and there is a directory of the same name as the sought-for package on sys.path somewhere, and the package is not actually installed, then such code could be fooled into thinking a package is installed that really isn't.
For example, suppose someone writes a script (datagen.py) containing the following code:
try:
import json
except ImportError:
import simplejson as json
And runs it in a directory laid out like this:
datagen.py
json/
foo.js
bar.js
If import json succeeded due to the mere presence of the json/ subdirectory, the code would incorrectly believe that the json module was available, and proceed to fail with an error.
However, we can prevent corner cases like these from arising, simply by making one small change to the algorithm presented so far. Instead of allowing you to import a "pure virtual" package (like zc), we allow only importing of the contents of virtual packages.
That is, a statement like import zc should raise ImportError if there is no zc.py or zc/__init__.py on sys.path. But, doing import zc.buildout should still succeed, as long as there's a zc/buildout.py or zc/buildout/__init__.py on sys.path.
In other words, we don't allow pure virtual packages to be imported directly, only modules and self-contained packages. (This is an acceptable limitation, because there is no functional value to importing such a package by itself. After all, the module object will have no contents until you import at least one of its subpackages or submodules!)
Once zc.buildout has been successfully imported, though, there will be a zc module in sys.modules, and trying to import it will of course succeed. We are only preventing an initial import from succeeding, in order to prevent false-positive import successes when clashing subdirectories are present on sys.path.
So, with this slight change, the datagen.py example above will work correctly. When it does import json, the mere presence of a json/ directory will simply not affect the import process at all, even if it contains .py files. The json/ directory will still only be searched in the case where an import like import json.converter is attempted.
Meanwhile, tools that expect to locate packages and modules by walking a directory tree can be updated to use the existing pkgutil.walk_modules() API, and tools that need to inspect packages in memory should use the other APIs described in the Standard Library Changes/Additions section below.
Specification
A change is made to the existing import process, when importing names containing at least one . -- that is, imports of modules that have a parent package.
Specifically, if the parent package does not exist, or exists but lacks a __path__ attribute, an attempt is first made to create a "virtual path" for the parent package (following the algorithm described in the section on virtual paths, below).
If the computed "virtual path" is empty, an ImportError results, just as it would today. However, if a non-empty virtual path is obtained, the normal import of the submodule or subpackage proceeds, using that virtual path to find the submodule or subpackage. (Just as it would have with the parent's __path__, if the parent package had existed and had a __path__.)
When a submodule or subpackage is found (but not yet loaded), the parent package is created and added to sys.modules (if it didn't exist before), and its __path__ is set to the computed virtual path (if it wasn't already set).
In this way, when the actual loading of the submodule or subpackage occurs, it will see a parent package existing, and any relative imports will work correctly. However, if no submodule or subpackage exists, then the parent package will not be created, nor will a standalone module be converted into a package (by the addition of a spurious __path__ attribute).
Note, by the way, that this change must be applied recursively: that is, if foo and foo.bar are pure virtual packages, then import foo.bar.baz must wait until foo.bar.baz is found before creating module objects for both foo and foo.bar, and then create both of them together, properly setting the foo module's .bar attribute to point to the foo.bar module.
In this way, pure virtual packages are never directly importable: an import foo or import foo.bar by itself will fail, and the corresponding modules will not appear in sys.modules until they are needed to point to a successfully imported submodule or self-contained subpackage.
Virtual Paths
A virtual path is created by obtaining a PEP 302 "importer" object for each of the path entries found in sys.path (for a top-level module) or the parent __path__ (for a submodule).
(Note: because sys.meta_path importers are not associated with sys.path or __path__ entry strings, such importers do not participate in this process.)
Each importer is checked for a get_subpath() method, and if present, the method is called with the full name of the module/package the path is being constructed for. The return value is either a string representing a subdirectory for the requested package, or None if no such subdirectory exists.
The strings returned by the importers are added to the path list being built, in the same order as they are found. (None values and missing get_subpath() methods are simply skipped.)
The resulting list (whether empty or not) is then stored in a sys.virtual_package_paths dictionary, keyed by module name.
This dictionary has two purposes. First, it serves as a cache, in the event that more than one attempt is made to import a submodule of a virtual package.
Second, and more importantly, the dictionary can be used by code that extends sys.path at runtime to update imported packages' __path__ attributes accordingly. (See Standard Library Changes/Additions below for more details.)
In Python code, the virtual path construction algorithm would look something like this:
def get_virtual_path(modulename, parent_path=None):
if modulename in sys.virtual_package_paths:
return sys.virtual_package_paths[modulename]
if parent_path is None:
parent_path = sys.path
path = []
for entry in parent_path:
# Obtain a PEP 302 importer object - see pkgutil module
importer = pkgutil.get_importer(entry)
if hasattr(importer, 'get_subpath'):
subpath = importer.get_subpath(modulename)
if subpath is not None:
path.append(subpath)
sys.virtual_package_paths[modulename] = path
return path
And a function like this one should be exposed in the standard library as e.g. imp.get_virtual_path(), so that people creating __import__ replacements or sys.meta_path hooks can reuse it.
Standard Library Changes/Additions
The pkgutil module should be updated to handle this specification appropriately, including any necessary changes to extend_path(), iter_modules(), etc.
Specifically the proposed changes and additions to pkgutil are:
A new extend_virtual_paths(path_entry) function, to extend existing, already-imported virtual packages' __path__ attributes to include any portions found in a new sys.path entry. This function should be called by applications extending sys.path at runtime, e.g. when adding a plugin directory or an egg to the path.
The implementation of this function does a simple top-down traversal of sys.virtual_package_paths, and performs any necessary get_subpath() calls to identify what path entries need to be added to the virtual path for that package, given that path_entry has been added to sys.path. (Or, in the case of sub-packages, adding a derived subpath entry, based on their parent package's virtual path.)
(Note: this function must update both the path values in sys.virtual_package_paths as well as the __path__ attributes of any corresponding modules in sys.modules, even though in the common case they will both be the same list object.)
A new iter_virtual_packages(parent='') function to allow top-down traversal of virtual packages from sys.virtual_package_paths, by yielding the child virtual packages of parent. For example, calling iter_virtual_packages("zope") might yield zope.app and zope.products (if they are virtual packages listed in sys.virtual_package_paths), but not zope.foo.bar. (This function is needed to implement extend_virtual_paths(), but is also potentially useful for other code that needs to inspect imported virtual packages.)
ImpImporter.iter_modules() should be changed to also detect and yield the names of modules found in virtual packages.
In addition to the above changes, the zipimport importer should have its iter_modules() implementation similarly changed. (Note: current versions of Python implement this via a shim in pkgutil, so technically this is also a change to pkgutil.)
Last, but not least, the imp module (or importlib, if appropriate) should expose the algorithm described in the virtual paths section above, as a get_virtual_path(modulename, parent_path=None) function, so that creators of __import__ replacements can use it.
Implementation Notes
For users, developers, and distributors of virtual packages:
While virtual packages are easy to set up and use, there is still a time and place for using self-contained packages. While it's not strictly necessary, adding an __init__ module to your self-contained packages lets users of the package (and Python itself) know that all of the package's code will be found in that single subdirectory. In addition, it lets you define __all__, expose a public API, provide a package-level docstring, and do other things that make more sense for a self-contained project than for a mere "namespace" package.
sys.virtual_package_paths is allowed to contain entries for non-existent or not-yet-imported package names; code that uses its contents should not assume that every key in this dictionary is also present in sys.modules or that importing the name will necessarily succeed.
If you are changing a currently self-contained package into a virtual one, it's important to note that you can no longer use its __file__ attribute to locate data files stored in a package directory. Instead, you must search __path__ or use the __file__ of a submodule adjacent to the desired files, or of a self-contained subpackage that contains the desired files.
(Note: this caveat is already true for existing users of "namespace packages" today. That is, it is an inherent result of being able to partition a package, that you must know which partition the desired data file lives in. We mention it here simply so that new users converting from self-contained to virtual packages will also be aware of it.)
XXX what is the __file__ of a "pure virtual" package? None? Some arbitrary string? The path of the first directory with a trailing separator? No matter what we put, some code is going to break, but the last choice might allow some code to accidentally work. Is that good or bad?
For those implementing PEP 302 importer objects:
Importers that support the iter_modules() method (used by pkgutil to locate importable modules and packages) and want to add virtual package support should modify their iter_modules() method so that it discovers and lists virtual packages as well as standard modules and packages. To do this, the importer should simply list all immediate subdirectory names in its jurisdiction that are valid Python identifiers.
XXX This might list a lot of not-really-packages. Should we require importable contents to exist? If so, how deep do we search, and how do we prevent e.g. link loops, or traversing onto different filesystems, etc.? Ick. Also, if virtual packages are listed, they still can't be imported, which is a problem for the way that pkgutil.walk_modules() is currently implemented.
"Meta" importers (i.e., importers placed on sys.meta_path) do not need to implement get_subpath(), because the method is only called on importers corresponding to sys.path entries and __path__ entries. If a meta importer wishes to support virtual packages, it must do so entirely within its own find_module() implementation.
Unfortunately, it is unlikely that any such implementation will be able to merge its package subpaths with those of other meta importers or sys.path importers, so the meaning of "supporting virtual packages" for a meta importer is currently undefined!
(However, since the intended use case for meta importers is to replace Python's normal import process entirely for some subset of modules, and the number of such importers currently implemented is quite small, this seems unlikely to be a big issue in practice.)
References
| [1] | "namespace" vs "module" packages (mailing list thread) (http://mail.zope.org/pipermail/zope3-dev/2002-December/004251.html) |
| [2] | "Dropping __init__.py requirement for subpackages" (http://mail.python.org/pipermail/python-dev/2006-April/064400.html) |
| [3] | Namespace Packages resolution (http://mail.python.org/pipermail/import-sig/2012-March/000421.html) |
Copyright
This document has been placed in the public domain.
pep-0403 General purpose decorator clause (aka "@in" clause)
| PEP: | 403 |
|---|---|
| Title: | General purpose decorator clause (aka "@in" clause) |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2011-10-13 |
| Python-Version: | 3.4 |
| Post-History: | 2011-10-13 |
| Resolution: | TBD |
Contents
Abstract
This PEP proposes the addition of a new @in decorator clause that makes it possible to override the name binding step of a function or class definition.
The new clause accepts a single simple statement that can make a forward reference to decorated function or class definition.
This new clause is designed to be used whenever a "one-shot" function or class is needed, and placing the function or class definition before the statement that uses it actually makes the code harder to read. It also avoids any name shadowing concerns by making sure the new name is visible only to the statement in the @in clause.
This PEP is based heavily on many of the ideas in PEP 3150 (Statement Local Namespaces) so some elements of the rationale will be familiar to readers of that PEP. Both PEPs remain deferred for the time being, primarily due to the lack of compelling real world use cases in either PEP.
Basic Examples
Before diving into the long history of this problem and the detailed rationale for this specific proposed solution, here are a few simple examples of the kind of code it is designed to simplify.
As a trivial example, a weakref callback could be defined as follows:
@in x = weakref.ref(target, report_destruction)
def report_destruction(obj):
print("{} is being destroyed".format(obj))
This contrasts with the current (conceptually) "out of order" syntax for this operation:
def report_destruction(obj):
print("{} is being destroyed".format(obj))
x = weakref.ref(target, report_destruction)
That structure is OK when you're using the callable multiple times, but it's irritating to be forced into it for one-off operations.
If the repetition of the name seems especially annoying, then a throwaway name like f can be used instead:
@in x = weakref.ref(target, f)
def f(obj):
print("{} is being destroyed".format(obj))
Similarly, a sorted operation on a particularly poorly defined type could now be defined as:
@in sorted_list = sorted(original, key=f)
def f(item):
try:
return item.calc_sort_order()
except NotSortableError:
return float('inf')
Rather than:
def force_sort(item):
try:
return item.calc_sort_order()
except NotSortableError:
return float('inf')
sorted_list = sorted(original, key=force_sort)
And early binding semantics in a list comprehension could be attained via:
@in funcs = [adder(i) for i in range(10)]
def adder(i):
return lambda x: x + i
Proposal
This PEP proposes the addition of a new @in clause that is a variant of the existing class and function decorator syntax.
The new @in clause precedes the decorator lines, and allows forward references to the trailing function or class definition.
The trailing function or class definition is always named - the name of the trailing definition is then used to make the forward reference from the @in clause.
The @in clause is allowed to contain any simple statement (including those that don't make any sense in that context, such as pass - while such code would be legal, there wouldn't be any point in writing it). This permissive structure is easier to define and easier to explain, but a more restrictive approach that only permits operations that "make sense" would also be possible (see PEP 3150 for a list of possible candidates).
The @in clause will not create a new scope - all name binding operations aside from the trailing function or class definition will affect the containing scope.
The name used in the trailing function or class definition is only visible from the associated @in clause, and behaves as if it was an ordinary variable defined in that scope. If any nested scopes are created in either the @in clause or the trailing function or class definition, those scopes will see the trailing function or class definition rather than any other bindings for that name in the containing scope.
In a very real sense, this proposal is about making it possible to override the implicit "name = <defined function or class>" name binding operation that is part of every function or class definition, specifically in those cases where the local name binding isn't actually needed.
Under this PEP, an ordinary class or function definition:
@deco2
@deco1
def name():
...
can be explained as being roughly equivalent to:
@in name = deco2(deco1(name))
def name():
...
Syntax Change
Syntactically, only one new grammar rule is needed:
in_stmt: '@in' simple_stmt decorated
Grammar: http://hg.python.org/cpython/file/default/Grammar/Grammar
Design Discussion
Background
The question of "multi-line lambdas" has been a vexing one for many Python users for a very long time, and it took an exploration of Ruby's block functionality for me to finally understand why this bugs people so much: Python's demand that the function be named and introduced before the operation that needs it breaks the developer's flow of thought. They get to a point where they go "I need a one-shot operation that does <X>", and instead of being able to just say that directly, they instead have to back up, name a function to do <X>, then call that function from the operation they actually wanted to do in the first place. Lambda expressions can help sometimes, but they're no substitute for being able to use a full suite.
Ruby's block syntax also heavily inspired the style of the solution in this PEP, by making it clear that even when limited to one anonymous function per statement, anonymous functions could still be incredibly useful. Consider how many constructs Python has where one expression is responsible for the bulk of the heavy lifting:
- comprehensions, generator expressions, map(), filter()
- key arguments to sorted(), min(), max()
- partial function application
- provision of callbacks (e.g. for weak references or aysnchronous IO)
- array broadcast operations in NumPy
However, adopting Ruby's block syntax directly won't work for Python, since the effectiveness of Ruby's blocks relies heavily on various conventions in the way functions are defined (specifically, using Ruby's yield syntax to call blocks directly and the &arg mechanism to accept a block as a function's final argument).
Since Python has relied on named functions for so long, the signatures of APIs that accept callbacks are far more diverse, thus requiring a solution that allows one-shot functions to be slotted in at the appropriate location.
The approach taken in this PEP is to retain the requirement to name the function explicitly, but allow the relative order of the definition and the statement that references it to be changed to match the developer's flow of thought. The rationale is essentially the same as that used when introducing decorators, but covering a broader set of applications.
Relation to PEP 3150
PEP 3150 (Statement Local Namespaces) describes its primary motivation as being to elevate ordinary assignment statements to be on par with class and def statements where the name of the item to be defined is presented to the reader in advance of the details of how the value of that item is calculated. This PEP achieves the same goal in a different way, by allowing the simple name binding of a standard function definition to be replaced with something else (like assigning the result of the function to a value).
Despite having the same author, the two PEPs are in direct competition with each other. PEP 403 represents a minimalist approach that attempts to achieve useful functionality with a minimum of change from the status quo. This PEP instead aims for a more flexible standalone statement design, which requires a larger degree of change to the language.
Note that where PEP 403 is better suited to explaining the behaviour of generator expressions correctly, this PEP is better able to explain the behaviour of decorator clauses in general. Both PEPs support adequate explanations for the semantics of container comprehensions.
Keyword Choice
The proposal definitely requires some kind of prefix to avoid parsing ambiguity and backwards compatibility problems with existing constructs. It also needs to be clearly highlighted to readers, since it declares that the following piece of code is going to be executed only after the trailing function or class definition has been executed.
The in keyword was chosen as an existing keyword that can be used to denote the concept of a forward reference.
The @ prefix was included in order to exploit the fact that Python programmers are already used to decorator syntax as an indication of out of order execution, where the function or class is actually defined first and then decorators are applied in reverse order.
For functions, the construct is intended to be read as "in <this statement that references NAME> define NAME as a function that does <operation>".
The mapping to English prose isn't as obvious for the class definition case, but the concept remains the same.
Better Debugging Support for Functions and Classes with Short Names
One of the objections to widespread use of lambda expressions is that they have a negative effect on traceback intelligibility and other aspects of introspection. Similar objections are raised regarding constructs that promote short, cryptic function names (including this one, which requires that the name of the trailing definition be supplied at least twice, encouraging the use of shorthand placeholder names like f).
However, the introduction of qualified names in PEP 3155 means that even anonymous classes and functions will now have different representations if they occur in different scopes. For example:
>>> def f(): ... return lambda: y ... >>> f() <function f.<locals>.<lambda> at 0x7f6f46faeae0>
Anonymous functions (or functions that share a name) within the same scope will still share representations (aside from the object ID), but this is still a major improvement over the historical situation where everything except the object ID was identical.
Possible Implementation Strategy
This proposal has at least one titanic advantage over PEP 3150: implementation should be relatively straightforward.
The @in clause will be included in the AST for the associated function or class definition and the statement that references it. When the @in clause is present, it will be emitted in place of the local name binding operation normally implied by a function or class definition.
The one potentially tricky part is changing the meaning of the references to the statement local function or namespace while within the scope of the in statement, but that shouldn't be too hard to address by maintaining some additional state within the compiler (it's much easier to handle this for a single name than it is for an unknown number of names in a full nested suite).
Explaining Container Comprehensions and Generator Expressions
One interesting feature of the proposed construct is that it can be used as a primitive to explain the scoping and execution order semantics of both generator expressions and container comprehensions:
seq2 = [x for x in y if q(x) for y in seq if p(y)]
# would be equivalent to
@in seq2 = f(seq):
def f(seq)
result = []
for y in seq:
if p(y):
for x in y:
if q(x):
result.append(x)
return result
The important point in this expansion is that it explains why comprehensions appear to misbehave at class scope: only the outermost iterator is evaluated at class scope, while all predicates, nested iterators and value expressions are evaluated inside a nested scope.
An equivalent expansion is possible for generator expressions:
gen = (x for x in y if q(x) for y in seq if p(y))
# would be equivalent to
@in gen = g(seq):
def g(seq)
for y in seq:
if p(y):
for x in y:
if q(x):
yield x
More Examples
Calculating attributes without polluting the local namespace (from os.py):
# Current Python (manual namespace cleanup)
def _createenviron():
... # 27 line function
environ = _createenviron()
del _createenviron
# Becomes:
@in environ = _createenviron()
def _createenviron():
... # 27 line function
Loop early binding:
# Current Python (default argument hack)
funcs = [(lambda x, i=i: x + i) for i in range(10)]
# Becomes:
@in funcs = [adder(i) for i in range(10)]
def adder(i):
return lambda x: x + i
# Or even:
@in funcs = [adder(i) for i in range(10)]
def adder(i):
@in return incr
def incr(x):
return x + i
A trailing class can be used as a statement local namespace:
# Evaluate subexpressions only once
@in c = math.sqrt(x.a*x.a + x.b*x.b)
class x:
a = calculate_a()
b = calculate_b()
A function can be bound directly to a location which isn't a valid identifier:
@in dispatch[MyClass] = f
def f():
...
Constructs that verge on decorator abuse can be eliminated:
# Current Python
@call
def f():
...
# Becomes:
@in f()
def f():
...
Reference Implementation
None as yet.
Acknowledgements
Huge thanks to Gary Bernhardt for being blunt in pointing out that I had no idea what I was talking about in criticising Ruby's blocks, kicking off a rather enlightening process of investigation.
Rejected Concepts
To avoid retreading previously covered ground, some rejected alternatives are documented in this section.
Omitting the decorator prefix character
Earlier versions of this proposal omitted the @ prefix. However, without that prefix, the bare in keyword didn't associate the clause strongly enough with the subsequent function or class definition. Reusing the decorator prefix and explicitly characterising the new construct as a kind of decorator clause is intended to help users link the two concepts and see them as two variants of the same idea.
Anonymous Forward References
A previous incarnation of this PEP (see [1]) proposed a syntax where the new clause was introduced with : and the forward reference was written using @. Feedback on this variant was almost universally negative, as it was considered both ugly and excessively magical:
:x = weakref.ref(target, @)
def report_destruction(obj):
print("{} is being destroyed".format(obj))
A more recent variant always used ... for forward references, along with genuinely anonymous function and class definitions. However, this degenerated quickly into a mass of unintelligible dots in more complex cases:
in funcs = [...(i) for i in range(10)]
def ...(i):
in return ...
def ...(x):
return x + i
in c = math.sqrt(....a*....a + ....b*....b)
class ...:
a = calculate_a()
b = calculate_b()
Using a nested suite
The problems with using a full nested suite are best described in PEP 3150. It's comparatively difficult to implement properly, the scoping semantics are harder to explain and it creates quite a few situations where there are two ways to do it without clear guidelines for choosing between them (as almost any construct that can be expressed with ordinary imperative code could instead be expressed using a given statement). While the PEP does propose some new PEP 8 guidelines to help address that last problem, the difficulties in implementation are not so easily dealt with.
By contrast, the decorator inspired syntax in this PEP explicitly limits the new feature to cases where it should actually improve readability, rather than harming it. As in the case of the original introduction of decorators, the idea of this new syntax is that if it can be used (i.e. the local name binding of the function is completely unnecessary) then it probably should be used.
Another possible variant of this idea is to keep the decorator based semantics of this PEP, while adopting the prettier syntax from PEP 3150:
x = weakref.ref(target, report_destruction) given:
def report_destruction(obj):
print("{} is being destroyed".format(obj))
There are a couple of problems with this approach. The main issue is that this syntax variant uses something that looks like a suite, but really isn't one. A secondary concern is that it's not clear how the compiler will know which name(s) in the leading expression are forward references (although that could potentially be addressed through a suitable definition of the suite-that-is-not-a-suite in the language grammar).
However, a nested suite has not yet been ruled out completely. The latest version of PEP 3150 uses explicit forward reference and name binding schemes that greatly simplify the semantics of the statement, and it does offer the advantage of allowing the definition of arbitrary subexpressions rather than being restricted to a single function or class definition.
References
| [1] | Start of python-ideas thread: http://mail.python.org/pipermail/python-ideas/2011-October/012276.html |
Copyright
This document has been placed in the public domain.
pep-0404 Python 2.8 Un-release Schedule
| PEP: | 404 |
|---|---|
| Title: | Python 2.8 Un-release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 2011-11-09 |
| Python-Version: | 2.8 |
Contents
Abstract
This document describes the un-development and un-release schedule for Python 2.8.
Un-release Manager and Crew
| Position | Name |
|---|---|
| 2.8 Un-release Manager | Cardinal Biggles |
Official pronouncement
Rule number six: there is no official Python 2.8 release. There never will be an official Python 2.8 release. It is an ex-release. Python 2.7 is the end of the Python 2 line of development.
Upgrade path
The official upgrade path from Python 2.7 is to Python 3.
And Now For Something Completely Different
In all seriousness, there are important reasons why there won't be an official Python 2.8 release, and why you should plan to migrate instead to Python 3.
Python is (as of this writing) more than 20 years old, and Guido and the community have learned a lot in those intervening years. Guido's original concept for Python 3 was to make changes to the language primarily to remove the warts that had grown in the preceding versions. Python 3 was not to be a complete redesign, but instead an evolution of the language, and while maintaining full backward compatibility with Python 2 was explicitly off-the-table, neither were gratuitous changes in syntax or semantics acceptable. In most cases, Python 2 code can be translated fairly easily to Python 3, sometimes entirely mechanically by such tools as 2to3 [1] (there's also a non-trivial subset of the language that will run without modification on both 2.7 and 3.x).
Because maintaining multiple versions of Python is a significant drag on the resources of the Python developers, and because the improvements to the language and libraries embodied in Python 3 are so important, it was decided to end the Python 2 lineage with Python 2.7. Thus, all new development occurs in the Python 3 line of development, and there will never be an official Python 2.8 release. Python 2.7 will however be maintained for longer than the usual period of time.
Here are some highlights of the significant improvements in Python 3. You can read in more detail on the differences [2] between Python 2 and Python 3. There are also many good guides on porting [3] from Python 2 to Python 3.
Strings and bytes
Python 2's basic original strings are called 8-bit strings, and they play a dual role in Python 2 as both ASCII text and as byte sequences. While Python 2 also has a unicode string type, the fundamental ambiguity of the core string type, coupled with Python 2's default behavior of supporting automatic coercion from 8-bit strings to unicode objects when the two are combined, often leads to UnicodeErrors. Python 3's standard string type is Unicode based, and Python 3 adds a dedicated bytes type, but critically, no automatic coercion between bytes and unicode strings is provided. The closest the language gets to implicit coercion are a few text-based APIs that assume a default encoding (usually UTF-8) if no encoding is explicitly stated. Thus, the core interpreter, its I/O libraries, module names, etc. are clear in their distinction between unicode strings and bytes. Python 3's unicode support even extends to the filesystem, so that non-ASCII file names are natively supported.
This string/bytes clarity is often a source of difficulty in transitioning existing code to Python 3, because many third party libraries and applications are themselves ambiguous in this distinction. Once migrated though, most UnicodeErrors can be eliminated.
Numbers
Python 2 has two basic integer types, a native machine-sized int type, and an arbitrary length long type. These have been merged in Python 3 into a single int type analogous to Python 2's long type.
In addition, integer division now produces floating point numbers for non-integer results.
Classes
Python 2 has two core class hierarchies, often called classic classes and new-style classes. The latter allow for such things as inheriting from the builtin basic types, support descriptor based tools like the property builtin and provide a generally more sane and coherent system for dealing with multiple inheritance. Python 3 provided the opportunity to completely drop support for classic classes, so all classes in Python 3 automatically use the new-style semantics (although that's a misnomer now). There is no need to explicitly inherit from object or set the default metatype to enable them (in fact, setting a default metatype at the module level is no longer supported - the default metatype is always object).
The mechanism for explicitly specifying a metaclass has also changed to use a metaclass keyword argument in the class header line rather than a __metaclass__ magic attribute in the class body.
Multiple spellings
There are many cases in Python 2 where multiple spellings of some constructs exist, such as repr() and backticks, or the two inequality operators != and <>. In all cases, Python 3 has chosen exactly one spelling and removed the other (e.g. repr() and != were kept).
Imports
In Python 3, implicit relative imports within packages are no longer available - only absolute imports and explicit relative imports are supported. In addition, star imports (e.g. from x import *) are only permitted in module level code.
Also, some areas of the standard library have been reorganized to make the naming scheme more intuitive. Some rarely used builtins have been relocated to standard library modules.
Iterators and views
Many APIs, which in Python 2 returned concrete lists, in Python 3 now return iterators or lightweight views.
References
| [1] | http://docs.python.org/library/2to3.html |
| [2] | http://docs.python.org/release/3.0.1/whatsnew/3.0.html |
| [3] | http://python3porting.com/ |
Copyright
This document has been placed in the public domain.
pep-0405 Python Virtual Environments
| PEP: | 405 |
|---|---|
| Title: | Python Virtual Environments |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Carl Meyer <carl at oddbird.net> |
| BDFL-Delegate: | Nick Coghlan |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 13-Jun-2011 |
| Python-Version: | 3.3 |
| Post-History: | 24-Oct-2011, 28-Oct-2011, 06-Mar-2012, 24-May-2012 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-May/119668.html |
Contents
Abstract
This PEP proposes to add to Python a mechanism for lightweight "virtual environments" with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (allowing creation of environments with various Python versions) and can have its own independent set of installed Python packages in its site directories, but shares the standard library with the base installed Python.
Motivation
The utility of Python virtual environments has already been well established by the popularity of existing third-party virtual-environment tools, primarily Ian Bicking's virtualenv [1]. Virtual environments are already widely used for dependency management and isolation, ease of installing and using Python packages without system-administrator access, and automated testing of Python software across multiple Python versions, among other uses.
Existing virtual environment tools suffer from lack of support from the behavior of Python itself. Tools such as rvirtualenv [2], which do not copy the Python binary into the virtual environment, cannot provide reliable isolation from system site directories. Virtualenv, which does copy the Python binary, is forced to duplicate much of Python's site module and manually symlink/copy an ever-changing set of standard-library modules into the virtual environment in order to perform a delicate boot-strapping dance at every startup. (Virtualenv must copy the binary in order to provide isolation, as Python dereferences a symlinked executable before searching for sys.prefix.)
The PYTHONHOME environment variable, Python's only existing built-in solution for virtual environments, requires copying/symlinking the entire standard library into every environment. Copying the whole standard library is not a lightweight solution, and cross-platform support for symlinks remains inconsistent (even on Windows platforms that do support them, creating them often requires administrator privileges).
A virtual environment mechanism integrated with Python and drawing on years of experience with existing third-party tools can lower maintenance, raise reliability, and be more easily available to all Python users.
Specification
When the Python binary is executed, it attempts to determine its prefix (which it stores in sys.prefix), which is then used to find the standard library and other key files, and by the site module to determine the location of the site-package directories. Currently the prefix is found (assuming PYTHONHOME is not set) by first walking up the filesystem tree looking for a marker file (os.py) that signifies the presence of the standard library, and if none is found, falling back to the build-time prefix hardcoded in the binary.
This PEP proposes to add a new first step to this search. If a pyvenv.cfg file is found either adjacent to the Python executable or one directory above it (if the executable is a symlink, it is not dereferenced), this file is scanned for lines of the form key = value. If a home key is found, this signifies that the Python binary belongs to a virtual environment, and the value of the home key is the directory containing the Python executable used to create this virtual environment.
In this case, prefix-finding continues as normal using the value of the home key as the effective Python binary location, which finds the prefix of the base installation. sys.base_prefix is set to this value, while sys.prefix is set to the directory containing pyvenv.cfg.
(If pyvenv.cfg is not found or does not contain the home key, prefix-finding continues normally, and sys.prefix will be equal to sys.base_prefix.)
Also, sys.base_exec_prefix is added, and handled similarly with regard to sys.exec_prefix. (sys.exec_prefix is the equivalent of sys.prefix, but for platform-specific files; by default it has the same value as sys.prefix.)
The site and sysconfig standard-library modules are modified such that the standard library and header files are are found relative to sys.base_prefix / sys.base_exec_prefix, while site-package directories ("purelib" and "platlib", in sysconfig terms) are still found relative to sys.prefix / sys.exec_prefix.
Thus, a Python virtual environment in its simplest form would consist of nothing more than a copy or symlink of the Python binary accompanied by a pyvenv.cfg file and a site-packages directory.
Isolation from system site-packages
By default, a virtual environment is entirely isolated from the system-level site-packages directories.
If the pyvenv.cfg file also contains a key include-system-site-packages with a value of true (not case sensitive), the site module will also add the system site directories to sys.path after the virtual environment site directories. Thus system-installed packages will still be importable, but a package of the same name installed in the virtual environment will take precedence.
PEP 370 user-level site-packages are considered part of the system site-packages for venv purposes: they are not available from an isolated venv, but are available from an include-system-site-packages = true venv.
Creating virtual environments
This PEP also proposes adding a new venv module to the standard library which implements the creation of virtual environments. This module can be executed using the -m flag:
python3 -m venv /path/to/new/virtual/environment
A pyvenv installed script is also provided to make this more convenient:
pyvenv /path/to/new/virtual/environment
Running this command creates the target directory (creating any parent directories that don't exist already) and places a pyvenv.cfg file in it with a home key pointing to the Python installation the command was run from. It also creates a bin/ (or Scripts on Windows) subdirectory containing a copy (or symlink) of the python3 executable, and the pysetup3 script from the packaging standard library module (to facilitate easy installation of packages from PyPI into the new venv). And it creates an (initially empty) lib/pythonX.Y/site-packages (or Lib\site-packages on Windows) subdirectory.
If the target directory already exists an error will be raised, unless the --clear option was provided, in which case the target directory will be deleted and virtual environment creation will proceed as usual.
The created pyvenv.cfg file also includes the include-system-site-packages key, set to true if pyvenv is run with the --system-site-packages option, false by default.
Multiple paths can be given to pyvenv, in which case an identical venv will be created, according to the given options, at each provided path.
The venv module also places "shell activation scripts" for POSIX and Windows systems in the bin or Scripts directory of the venv. These scripts simply add the virtual environment's bin (or Scripts) directory to the front of the user's shell PATH. This is not strictly necessary for use of a virtual environment (as an explicit path to the venv's python binary or scripts can just as well be used), but it is convenient.
In order to allow pysetup and other Python package managers to install packages into the virtual environment the same way they would install into a normal Python installation, and avoid special-casing virtual environments in sysconfig beyond using sys.base_prefix in place of sys.prefix where appropriate, the internal virtual environment layout mimics the layout of the Python installation itself on each platform. So a typical virtual environment layout on a POSIX system would be:
pyvenv.cfg bin/python3 bin/python bin/pysetup3 include/ lib/python3.3/site-packages/
While on a Windows system:
pyvenv.cfg
Scripts/python.exe
Scripts/python3.dll
Scripts/pysetup3.exe
Scripts/pysetup3-script.py
... other DLLs and pyds...
Include/
Lib/site-packages/
Third-party packages installed into the virtual environment will have their Python modules placed in the site-packages directory, and their executables placed in bin/ or Scripts.
Note
On a normal Windows system-level installation, the Python binary itself wouldn't go inside the "Scripts/" subdirectory, as it does in the default venv layout. This is useful in a virtual environment so that a user only has to add a single directory to their shell PATH in order to effectively "activate" the virtual environment.
Note
On Windows, it is necessary to also copy or symlink DLLs and pyd files from compiled stdlib modules into the env, because if the venv is created from a non-system-wide Python installation, Windows won't be able to find the Python installation's copies of those files when Python is run from the venv.
Sysconfig install schemes and user-site
This approach explicitly chooses not to introduce a new sysconfig install scheme for venvs. Rather, by modifying sys.prefix we ensure that existing install schemes which base locations on sys.prefix will simply work in a venv. Installation to other install schemes (for instance, the user-site schemes) whose paths are not relative to sys.prefix, will not be affected by a venv at all.
It may be feasible to create an alternative implementation of Python virtual environments based on a virtual-specific sysconfig scheme, but it would be less robust, as it would require more code to be aware of whether it is operating within a virtual environment or not.
Copies versus symlinks
The technique in this PEP works equally well in general with a copied or symlinked Python binary (and other needed DLLs on Windows). Symlinking is preferable where possible, because in the case of an upgrade to the underlying Python installation, a Python executable copied in a venv might become out-of-sync with the installed standard library and require manual upgrade.
There are some cross-platform difficulties with symlinks:
- Not all Windows versions support symlinks, and even on those that do, creating them often requires administrator privileges.
- On OS X framework builds of Python, sys.executable is just a stub that executes the real Python binary. Symlinking this stub does not work; it must be copied. (Fortunately the stub is also small, and not changed by bugfix upgrades to Python, so copying it is not an issue).
Thus, this PEP proposes to symlink the binary on all platforms except for Windows, and OS X framework builds. A --symlink option is available to force the use of symlinks on Windows versions that support them, if the appropriate permissions are available. (This option has no effect on OS X framework builds, since symlinking can never work there, and has no advantages).
On Windows, if --symlink is not used, this means that if the underlying Python installation is upgraded, the Python binary and DLLs in the venv should be updated, or there could be issues of mismatch with the upgraded standard library. The pyvenv script accepts a --upgrade option for easily performing this upgrade on an existing venv.
Include files
Current virtualenv handles include files in this way:
On POSIX systems where the installed Python's include files are found in ${base_prefix}/include/pythonX.X, virtualenv creates ${venv}/include/ and symlinks ${base_prefix}/include/pythonX.X to ${venv}/include/pythonX.X. On Windows, where Python's include files are found in {{ sys.prefix }}/Include and symlinks are not reliably available, virtualenv copies {{ sys.prefix }}/Include to ${venv}/Include. This ensures that extension modules built and installed within the virtualenv will always find the Python header files they need in the expected location relative to sys.prefix.
This solution is not ideal when an extension module installs its own header files, as the default installation location for those header files may be a symlink to a system directory that may not be writable. One installer, pip, explicitly works around this by installing header files to a nonstandard location ${venv}/include/site/pythonX.X/, as in Python there's currently no standard abstraction for a site-specific include directory.
This PEP proposes a slightly different approach, though one with essentially the same effect and the same set of advantages and disadvantages. Rather than symlinking or copying include files into the venv, we simply modify the sysconfig schemes so that header files are always sought relative to base_prefix rather than prefix. (We also create an include/ directory within the venv, so installers have somewhere to put include files installed within the env).
Better handling of include files in distutils/packaging and, by extension, pyvenv, is an area that may deserve its own future PEP. For now, we propose that the behavior of virtualenv has thus far proved itself to be at least "good enough" in practice.
API
The high-level method described above makes use of a simple API which provides mechanisms for third-party virtual environment creators to customize environment creation according to their needs.
The venv module contains an EnvBuilder class which accepts the following keyword arguments on instantiation:
- system_site_packages - A Boolean value indicating that the system Python site-packages should be available to the environment. Defaults to False.
- clear - A Boolean value which, if true, will delete any existing target directory instead of raising an exception. Defaults to False.
- symlinks - A Boolean value indicating whether to attempt to symlink the Python binary (and any necessary DLLs or other binaries, e.g. pythonw.exe), rather than copying. Defaults to False.
The instantiated env-builder has a create method, which takes as required argument the path (absolute or relative to the current directory) of the target directory which is to contain the virtual environment. The create method either creates the environment in the specified directory, or raises an appropriate exception.
The venv module also provides a module-level create function as a convenience:
def create(env_dir,
system_site_packages=False, clear=False, use_symlinks=False):
builder = EnvBuilder(
system_site_packages=system_site_packages,
clear=clear,
use_symlinks=use_symlinks)
builder.create(env_dir)
Creators of third-party virtual environment tools are free to use the provided EnvBuilder class as a base class.
The create method of the EnvBuilder class illustrates the hooks available for customization:
def create(self, env_dir):
"""
Create a virtualized Python environment in a directory.
:param env_dir: The target directory to create an environment in.
"""
env_dir = os.path.abspath(env_dir)
context = self.create_directories(env_dir)
self.create_configuration(context)
self.setup_python(context)
self.post_setup(context)
Each of the methods create_directories, create_configuration, setup_python, and post_setup can be overridden. The functions of these methods are:
- create_directories - creates the environment directory and all necessary directories, and returns a context object. This is just a holder for attributes (such as paths), for use by the other methods.
- create_configuration - creates the pyvenv.cfg configuration file in the environment.
- setup_python - creates a copy of the Python executable (and, under Windows, DLLs) in the environment.
- post_setup - A (no-op by default) hook method which can be overridden in third party subclasses to pre-install packages or install scripts in the virtual environment.
In addition, EnvBuilder provides a utility method that can be called from post_setup in subclasses to assist in installing custom scripts into the virtual environment. The method install_scripts accepts as arguments the context object (see above) and a path to a directory. The directory should contain subdirectories "common", "posix", "nt", each containing scripts destined for the bin directory in the environment. The contents of "common" and the directory corresponding to os.name are copied after doing some text replacement of placeholders:
- __VENV_DIR__ is replaced with absolute path of the environment directory.
- __VENV_NAME__ is replaced with the environment name (final path segment of environment directory).
- __VENV_BIN_NAME__ is replaced with the name of the bin directory (either bin or Scripts).
- __VENV_PYTHON__ is replaced with the absolute path of the environment's executable.
The DistributeEnvBuilder subclass in the reference implementation illustrates how the customization hook can be used in practice to pre-install Distribute into the virtual environment. It's not envisaged that DistributeEnvBuilder will be actually added to Python core, but it makes the reference implementation more immediately useful for testing and exploratory purposes.
Backwards Compatibility
Splitting the meanings of sys.prefix
Any virtual environment tool along these lines (which attempts to isolate site-packages, while still making use of the base Python's standard library with no need for it to be symlinked into the virtual environment) is proposing a split between two different meanings (among others) that are currently both wrapped up in sys.prefix: the answers to the questions "Where is the standard library?" and "Where is the site-packages location where third-party modules should be installed?"
This split could be handled by introducing a new sys attribute for either the former prefix or the latter prefix. Either option potentially introduces some backwards-incompatibility with software written to assume the other meaning for sys.prefix. (Such software should preferably be using the APIs in the site and sysconfig modules to answer these questions rather than using sys.prefix directly, in which case there is no backwards-compatibility issue, but in practice sys.prefix is sometimes used.)
The documentation [7] for sys.prefix describes it as "A string giving the site-specific directory prefix where the platform independent Python files are installed," and specifically mentions the standard library and header files as found under sys.prefix. It does not mention site-packages.
Maintaining this documented definition would mean leaving sys.prefix pointing to the base system installation (which is where the standard library and header files are found), and introducing a new value in sys (something like sys.site_prefix) to point to the prefix for site-packages. This would maintain the documented semantics of sys.prefix, but risk breaking isolation if third-party code uses sys.prefix rather than sys.site_prefix or the appropriate site API to find site-packages directories.
The most notable case is probably setuptools [3] and its fork distribute [4], which mostly use distutils and sysconfig APIs, but do use sys.prefix directly to build up a list of site directories for pre-flight checking where pth files can usefully be placed.
Otherwise, a Google Code Search [5] turns up what appears to be a roughly even mix of usage between packages using sys.prefix to build up a site-packages path and packages using it to e.g. eliminate the standard-library from code-execution tracing.
Although it requires modifying the documented definition of sys.prefix, this PEP prefers to have sys.prefix point to the virtual environment (where site-packages is found), and introduce sys.base_prefix to point to the standard library and Python header files. Rationale for this choice:
- It is preferable to err on the side of greater isolation of the virtual environment.
- Virtualenv already modifies sys.prefix to point at the virtual environment, and in practice this has not been a problem.
- No modification is required to setuptools/distribute.
Impact on other Python implementations
The majority of this PEP's changes occur in the standard library, which is shared by other Python implementations and should not present any problem.
Other Python implementations will need to replicate the new sys.prefix-finding behavior of the interpreter bootstrap, including locating and parsing the pyvenv.cfg file, if it is present.
Reference Implementation
The reference implementation is found in a clone of the CPython Mercurial repository [6]. To test it, build and run bin/pyvenv /path/to/new/venv to create a virtual environment.
References
| [1] | http://www.virtualenv.org |
| [2] | https://github.com/kvbik/rvirtualenv |
| [3] | http://peak.telecommunity.com/DevCenter/setuptools |
| [4] | http://packages.python.org/distribute/ |
| [5] | http://www.google.com/codesearch#search/&q=sys.prefix&p=1&type=cs |
| [6] | http://hg.python.org/sandbox/vsajip#venv |
| [7] | http://docs.python.org/dev/library/sys.html#sys.prefix |
Copyright
This document has been placed in the public domain.
pep-0406 Improved Encapsulation of Import State
| PEP: | 406 |
|---|---|
| Title: | Improved Encapsulation of Import State |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Greg Slodkowicz <jergosh at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 4-Jul-2011 |
| Python-Version: | 3.4 |
| Post-History: | 31-Jul-2011, 13-Nov-2011, 4-Dec-2011 |
Contents
Abstract
This PEP proposes the introduction of a new 'ImportEngine' class as part of importlib which would encapsulate all state related to importing modules into a single object. Creating new instances of this object would then provide an alternative to completely replacing the built-in implementation of the import statement, by overriding the __import__() function. To work with the builtin import functionality and importing via import engine objects, this PEP proposes a context management based approach to temporarily replacing the global import state.
The PEP also proposes inclusion of a GlobalImportEngine subclass and a globally accessible instance of that class, which "writes through" to the process global state. This provides a backwards compatible bridge between the proposed encapsulated API and the legacy process global state, and allows straightforward support for related state updates (e.g. selectively invalidating path cache entries when sys.path is modified).
PEP Withdrawal
The import system has seen substantial changes since this PEP was originally written, as part of PEP 420 in Python 3.3 and PEP 451 in Python 3.4.
While providing an encapsulation of the import state is still highly desirable, it is better tackled in a new PEP using PEP 451 as a foundation, and permitting only the use of PEP 451 compatible finders and loaders (as those avoid many of the issues of direct manipulation of global state associated with the previous loader API).
Rationale
Currently, most state related to the import system is stored as module level attributes in the sys module. The one exception is the import lock, which is not accessible directly, but only via the related functions in the imp module. The current process global import state comprises:
- sys.modules
- sys.path
- sys.path_hooks
- sys.meta_path
- sys.path_importer_cache
- the import lock (imp.lock_held()/acquire_lock()/release_lock())
Isolating this state would allow multiple import states to be conveniently stored within a process. Placing the import functionality in a self-contained object would also allow subclassing to add additional features (e.g. module import notifications or fine-grained control over which modules can be imported). The engine would also be subclassed to make it possible to use the import engine API to interact with the existing process-global state.
The namespace PEPs (especially PEP 402) raise a potential need for additional process global state, in order to correctly update package paths as sys.path is modified.
Finally, providing a coherent object for all this state makes it feasible to also provide context management features that allow the import state to be temporarily substituted.
Proposal
We propose introducing an ImportEngine class to encapsulate import functionality. This includes an __import__() method which can be used as an alternative to the built-in __import__() when desired and also an import_module() method, equivalent to importlib.import_module() [3].
Since there are global import state invariants that are assumed and should be maintained, we introduce a GlobalImportState class with an interface identical to ImportEngine but directly accessing the current global import state. This can be easily implemented using class properties.
Specification
ImportEngine API
The proposed extension consists of the following objects:
importlib.engine.ImportEngine
from_engine(self, other)
Create a new import object from another ImportEngine instance. The new object is initialised with a copy of the state in other. When called on importlib engine.sysengine, from_engine() can be used to create an ImportEngine object with a copy of the global import state.__import__(self, name, globals={}, locals={}, fromlist=[], level=0)
Reimplementation of the builtin __import__() function. The import of a module will proceed using the state stored in the ImportEngine instance rather than the global import state. For full documentation of __import__ funtionality, see [2] . __import__() from ImportEngine and its subclasses can be used to customise the behaviour of the import statement by replacing __builtin__.__import__ with ImportEngine().__import__.import_module(name, package=None)
A reimplementation of importlib.import_module() which uses the import state stored in the ImportEngine instance. See [3] for a full reference.modules, path, path_hooks, meta_path, path_importer_cache
Instance-specific versions of their process global sys equivalents
importlib.engine.GlobalImportEngine(ImportEngine)
Convenience class to provide engine-like access to the global state. Provides __import__(), import_module() and from_engine() methods like ImportEngine but writes through to the global state in sys.
To support various namespace package mechanisms, when sys.path is altered, tools like pkgutil.extend_path should be used to also modify other parts of the import state (in this case, package __path__ attributes). The path importer cache should also be invalidated when a variety of changes are made.
The ImportEngine API will provide convenience methods that automatically make related import state updates as part of a single operation.
Global variables
importlib.engine.sysengine
A precreated instance of GlobalImportEngine. Intended for use by importers and loaders that have been updated to accept optional engine parameters and with ImportEngine.from_engine(sysengine) to start with a copy of the process global import state.
No changes to finder/loader interfaces
Rather than attempting to update the PEP 302 APIs to accept additional state, this PEP proposes that ImportEngine support the content management protocol (similar to the context substitution mechanisms in the decimal module).
The context management mechanism for ImportEngine would:
- On entry: * Acquire the import lock * Substitute the global import state with the import engine's own state
- On exit: * Restore the previous global import state * Release the import lock
The precise API for this is TBD (but will probably use a distinct context management object, along the lines of that created by decimal.localcontext).
Open Issues
API design for falling back to global import state
The current proposal relies on the from_engine() API to fall back to the global import state. It may be desirable to offer a variant that instead falls back to the global import state dynamically.
However, one big advantage of starting with an "as isolated as possible" design is that it becomes possible to experiment with subclasses that blur the boundaries between the engine instance state and the process global state in various ways.
Builtin and extension modules must be process global
Due to platform limitations, only one copy of each builtin and extension module can readily exist in each process. Accordingly, it is impossible for each ImportEngine instance to load such modules independently.
The simplest solution is for ImportEngine to refuse to load such modules, raising ImportError. GlobalImportEngine would be able to load them normally.
ImportEngine will still return such modules from a prepopulated module cache - it's only loading them directly which causes problems.
Scope of substitution
Related to the previous open issue is the question of what state to substitute when using the context management API. It is currently the case that replacing sys.modules can be unreliable due to cached references and there's the underlying fact that having independent copies of some modules is simply impossible due to platform limitations.
As part of this PEP, it will be necessary to document explicitly:
- Which parts of the global import state can be substituted (and declare code which caches references to that state without dealing with the substitution case buggy)
- Which parts must be modified in-place (and hence are not substituted by the ImportEngine context management API, or otherwise scoped to ImportEngine instances)
Reference Implementation
A reference implementation [4] for an earlier draft of this PEP, based on Brett Cannon's importlib has been developed by Greg Slodkowicz as part of the 2011 Google Summer of Code. Note that the current implementation avoids modifying existing code, and hence duplicates a lot of things unnecessarily. An actual implementation would just modify any such affected code in place.
That earlier draft of the PEP proposed change the PEP 302 APIs to support passing in an optional engine instance. This had the (serious) downside of not correctly affecting further imports from the imported module, hence the change to the context management based proposal for substituting the global state.
References
| [1] | PEP 302, New Import Hooks, J van Rossum, Moore (http://www.python.org/dev/peps/pep-0302) |
| [2] | __import__() builtin function, The Python Standard Library documentation (http://docs.python.org/library/functions.html#__import__) |
| [3] | (1, 2) Importlib documentation, Cannon (http://docs.python.org/dev/library/importlib) |
| [4] | Reference implentation (https://bitbucket.org/jergosh/gsoc_import_engine/src/default/Lib/importlib/engine.py) |
Copyright
This document has been placed in the public domain.
pep-0407 New release cycle and introducing long-term support versions
| PEP: | 407 |
|---|---|
| Title: | New release cycle and introducing long-term support versions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net>, Georg Brandl <georg at python.org>, Barry Warsaw <barry at python.org> |
| Status: | Deferred |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 2012-01-12 |
| Post-History: | http://mail.python.org/pipermail/python-dev/2012-January/115838.html |
| Resolution: | TBD |
Contents
Abstract
Finding a release cycle for an open-source project is a delicate exercise in managing mutually contradicting constraints: developer manpower, availability of release management volunteers, ease of maintenance for users and third-party packagers, quick availability of new features (and behavioural changes), availability of bug fixes without pulling in new features or behavioural changes.
The current release cycle errs on the conservative side. It is adequate for people who value stability over reactivity. This PEP is an attempt to keep the stability that has become a Python trademark, while offering a more fluid release of features, by introducing the notion of long-term support versions.
Scope
This PEP doesn't try to change the maintenance period or release scheme for the 2.7 branch. Only 3.x versions are considered.
Proposal
Under the proposed scheme, there would be two kinds of feature versions (sometimes dubbed "minor versions", for example 3.2 or 3.3): normal feature versions and long-term support (LTS) versions.
Normal feature versions would get either zero or at most one bugfix release; the latter only if needed to fix critical issues. Security fix handling for these branches needs to be decided.
LTS versions would get regular bugfix releases until the next LTS version is out. They then would go into security fixes mode, up to a termination date at the release manager's discretion.
Periodicity
A new feature version would be released every X months. We tentatively propose X = 6 months.
LTS versions would be one out of N feature versions. We tentatively propose N = 4.
With these figures, a new LTS version would be out every 24 months, and remain supported until the next LTS version 24 months later. This is mildly similar to today's 18 months bugfix cycle for every feature version.
Pre-release versions
More frequent feature releases imply a smaller number of disruptive changes per release. Therefore, the number of pre-release builds (alphas and betas) can be brought down considerably. Two alpha builds and a single beta build would probably be enough in the regular case. The number of release candidates depends, as usual, on the number of last-minute fixes before final release.
Effects
Effect on development cycle
More feature releases might mean more stress on the development and release management teams. This is quantitatively alleviated by the smaller number of pre-release versions; and qualitatively by the lesser amount of disruptive changes (meaning less potential for breakage). The shorter feature freeze period (after the first beta build until the final release) is easier to accept. The rush for adding features just before feature freeze should also be much smaller.
Effect on bugfix cycle
The effect on fixing bugs should be minimal with the proposed figures. The same number of branches would be simultaneously open for bugfix maintenance (two until 2.x is terminated, then one).
Effect on workflow
The workflow for new features would be the same: developers would only commit them on the default branch.
The workflow for bug fixes would be slightly updated: developers would commit bug fixes to the current LTS branch (for example 3.3) and then merge them into default.
If some critical fixes are needed to a non-LTS version, they can be grafted from the current LTS branch to the non-LTS branch, just like fixes are ported from 3.x to 2.7 today.
Effect on the community
People who value stability can just synchronize on the LTS releases which, with the proposed figures, would give a similar support cycle (both in duration and in stability).
People who value reactivity and access to new features (without taking the risk to install alpha versions or Mercurial snapshots) would get much more value from the new release cycle than currently.
People who want to contribute new features or improvements would be more motivated to do so, knowing that their contributions will be more quickly available to normal users. Also, a smaller feature freeze period makes it less cumbersome to interact with contributors of features.
Discussion
These are open issues that should be worked out during discussion:
- Decide on X (months between feature releases) and N (feature releases per LTS release) as defined above.
- For given values of X and N, is the no-bugfix-releases policy for non-LTS versions feasible?
- What is the policy for security fixes?
- Restrict new syntax and similar changes (i.e. everything that was prohibited by PEP 3003) to LTS versions?
- What is the effect on packagers such as Linux distributions?
- How will release version numbers or other identifying and marketing material make it clear to users which versions are normal feature releases and which are LTS releases? How do we manage user expectations?
- Does the faster release cycle mean we could some day reach 3.10 and above? Some people expressed a tacit expectation that version numbers always fit in one decimal digit.
A community poll or survey to collect opinions from the greater Python community would be valuable before making a final decision.
Copyright
This document has been placed in the public domain.
pep-0408 Standard library __preview__ package
| PEP: | 408 |
|---|---|
| Title: | Standard library __preview__ package |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Eli Bendersky <eliben at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2012-01-07 |
| Python-Version: | 3.3 |
| Post-History: | 2012-01-27 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-January/115962.html |
Contents
Abstract
The process of including a new module into the Python standard library is hindered by the API lock-in and promise of backward compatibility implied by a module being formally part of Python. This PEP proposes a transitional state for modules - inclusion in a special __preview__ package for the duration of a minor release (roughly 18 months) prior to full acceptance into the standard library. On one hand, this state provides the module with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the module's eventual full inclusion into the standard library, or to the stability of its API, which may change for the next release.
PEP Rejection
Based on his experience with a similar "labs" namespace in Google App Engine, Guido has rejected this PEP [3] in favour of the simpler alternative of explicitly marking provisional modules as such in their documentation.
If a module is otherwise considered suitable for standard library inclusion, but some concerns remain regarding maintainability or certain API details, then the module can be accepted on a provisional basis. While it is considered an unlikely outcome, such modules may be removed from the standard library without a deprecation period if the lingering concerns prove well-founded.
As part of the same announcement, Guido explicitly accepted Matthew Barnett's 'regex' module [4] as a provisional addition to the standard library for Python 3.3 (using the 'regex' name, rather than as a drop-in replacement for the existing 're' module).
Proposal - the __preview__ package
Whenever the Python core development team decides that a new module should be included into the standard library, but isn't entirely sure about whether the module's API is optimal, the module can be placed in a special package named __preview__ for a single minor release.
In the next minor release, the module may either be "graduated" into the standard library (and occupy its natural place within its namespace, leaving the __preview__ package), or be rejected and removed entirely from the Python source tree. If the module ends up graduating into the standard library after spending a minor release in __preview__, its API may be changed according to accumulated feedback. The core development team explicitly makes no guarantees about API stability and backward compatibility of modules in __preview__.
Entry into the __preview__ package marks the start of a transition of the module into the standard library. It means that the core development team assumes responsibility of the module, similarly to any other module in the standard library.
Which modules should go through __preview__
We expect most modules proposed for addition into the Python standard library to go through a minor release in __preview__. There may, however, be some exceptions, such as modules that use a pre-defined API (for example lzma, which generally follows the API of the existing bz2 module), or modules with an API that has wide acceptance in the Python development community.
In any case, modules that are proposed to be added to the standard library, whether via __preview__ or directly, must fulfill the acceptance conditions set by PEP 2.
It is important to stress that the aim of of this proposal is not to make the process of adding new modules to the standard library more difficult. On the contrary, it tries to provide a means to add more useful libraries. Modules which are obvious candidates for entry can be added as before. Modules which due to uncertainties about the API could be stalled for a long time now have a means to still be distributed with Python, via an incubation period in the __preview__ package.
Criteria for "graduation"
In principle, most modules in the __preview__ package should eventually graduate to the stable standard library. Some reasons for not graduating are:
- The module may prove to be unstable or fragile, without sufficient developer support to maintain it.
- A much better alternative module may be found during the preview release
Essentially, the decision will be made by the core developers on a per-case basis. The point to emphasize here is that a module's appearance in the __preview__ package in some release does not guarantee it will continue being part of Python in the next release.
Example
Suppose the example module is a candidate for inclusion in the standard library, but some Python developers aren't convinced that it presents the best API for the problem it intends to solve. The module can then be added to the __preview__ package in release 3.X, importable via:
from __preview__ import example
Assuming the module is then promoted to the the standard library proper in release 3.X+1, it will be moved to a permanent location in the library:
import example
And importing it from __preview__ will no longer work.
Rationale
Benefits for the core development team
Currently, the core developers are really reluctant to add new interfaces to the standard library. This is because as soon as they're published in a release, API design mistakes get locked in due to backward compatibility concerns.
By gating all major API additions through some kind of a preview mechanism for a full release, we get one full release cycle of community feedback before we lock in the APIs with our standard backward compatibility guarantee.
We can also start integrating preview modules with the rest of the standard library early, so long as we make it clear to packagers that the preview modules should not be considered optional. The only difference between preview APIs and the rest of the standard library is that preview APIs are explicitly exempted from the usual backward compatibility guarantees.
Essentially, the __preview__ package is intended to lower the risk of locking in minor API design mistakes for extended periods of time. Currently, this concern can block new additions, even when the core development team consensus is that a particular addition is a good idea in principle.
Benefits for end users
For future end users, the broadest benefit lies in a better "out-of-the-box" experience - rather than being told "oh, the standard library tools for task X are horrible, download this 3rd party library instead", those superior tools are more likely to be just be an import away.
For environments where developers are required to conduct due diligence on their upstream dependencies (severely harming the cost-effectiveness of, or even ruling out entirely, much of the material on PyPI), the key benefit lies in ensuring that anything in the __preview__ package is clearly under python-dev's aegis from at least the following perspectives:
- Licensing: Redistributed by the PSF under a Contributor Licensing Agreement.
- Documentation: The documentation of the module is published and organized via the standard Python documentation tools (i.e. ReST source, output generated with Sphinx and published on http://docs.python.org).
- Testing: The module test suites are run on the python.org buildbot fleet and results published via http://www.python.org/dev/buildbot.
- Issue management: Bugs and feature requests are handled on http://bugs.python.org
- Source control: The master repository for the software is published on http://hg.python.org.
Candidates for inclusion into __preview__
For Python 3.3, there are a number of clear current candidates:
- regex (http://pypi.python.org/pypi/regex)
- daemon (PEP 3143)
- ipaddr (PEP 3144)
Other possible future use cases include:
- Improved HTTP modules (e.g. requests)
- HTML 5 parsing support (e.g. html5lib)
- Improved URL/URI/IRI parsing
- A standard image API (PEP 368)
- Encapsulation of the import state (PEP 368)
- Standard event loop API (PEP 3153)
- A binary version of WSGI for Python 3 (e.g. PEP 444)
- Generic function support (e.g. simplegeneric)
Relationship with PEP 407
PEP 407 proposes a change to the core Python release cycle to permit interim releases every 6 months (perhaps limited to standard library updates). If such a change to the release cycle is made, the following policy for the __preview__ namespace is suggested:
- For long term support releases, the __preview__ namespace would always be empty.
- New modules would be accepted into the __preview__ namespace only in interim releases that immediately follow a long term support release.
- All modules added will either be migrated to their final location in the standard library or dropped entirely prior to the next long term support release.
Rejected alternatives and variations
Using __future__
Python already has a "forward-looking" namespace in the form of the __future__ module, so it's reasonable to ask why that can't be re-used for this new purpose.
There are two reasons why doing so not appropriate:
1. The __future__ module is actually linked to a separate compiler directives feature that can actually change the way the Python interpreter compiles a module. We don't want that for the preview package - we just want an ordinary Python package.
2. The __future__ module comes with an express promise that names will be maintained in perpetuity, long after the associated features have become the compiler's default behaviour. Again, this is precisely the opposite of what is intended for the preview package - it is almost certain that all names added to the preview will be removed at some point, most likely due to their being moved to a permanent home in the standard library, but also potentially due to their being reverted to third party package status (if community feedback suggests the proposed addition is irredeemably broken).
Versioning the package
One proposed alternative [1] was to add explicit versioning to the __preview__ package, i.e. __preview34__. We think that it's better to simply define that a module being in __preview__ in Python 3.X will either graduate to the normal standard library namespace in Python 3.X+1 or will disappear from the Python source tree altogether. Versioning the _preview__ package complicates the process and does not align well with the main intent of this proposal.
Using a package name without leading and trailing underscores
It was proposed [1] to use a package name like preview or exp, instead of __preview__. This was rejected in the discussion due to the special meaning a "dunder" package name (that is, a name with leading and trailing double-underscores) conveys in Python. Besides, a non-dunder name would suggest normal standard library API stability guarantees, which is not the intention of the __preview__ package.
Preserving pickle compatibility
A pickled class instance based on a module in __preview__ in release 3.X won't be unpickle-able in release 3.X+1, where the module won't be in __preview__. Special code may be added to make this work, but this goes against the intent of this proposal, since it implies backward compatibility. Therefore, this PEP does not propose to preserve pickle compatibility.
Credits
Dj Gilcrease initially proposed the idea of having a __preview__ package in Python [2]. Although his original proposal uses the name __experimental__, we feel that __preview__ conveys the meaning of this package in a better way.
References
| [1] | (1, 2) Discussed in this thread: http://mail.python.org/pipermail/python-ideas/2012-January/013246.html |
| [2] | http://mail.python.org/pipermail/python-ideas/2011-August/011278.html |
| [3] | Guido's decision: http://mail.python.org/pipermail/python-dev/2012-January/115962.html |
| [4] | Proposal for inclusion of regex: http://bugs.python.org/issue2636 |
Copyright
This document has been placed in the public domain.
pep-0409 Suppressing exception context
| PEP: | 409 |
|---|---|
| Title: | Suppressing exception context |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ethan Furman <ethan at stoneleaf.us> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Jan-2012 |
| Post-History: | 30-Aug-2002, 01-Feb-2012, 03-Feb-2012 |
| Superseded-By: | 415 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-February/116136.html |
Contents
Abstract
One of the open issues from PEP 3134 is suppressing context: currently there is no way to do it. This PEP proposes one.
Rationale
There are two basic ways to generate exceptions:
- Python does it (buggy code, missing resources, ending loops, etc.)
- manually (with a raise statement)
When writing libraries, or even just custom classes, it can become necessary to raise exceptions; moreover it can be useful, even necessary, to change from one exception to another. To take an example from my dbf module:
try:
value = int(value)
except Exception:
raise DbfError(...)
Whatever the original exception was (ValueError, TypeError, or something else) is irrelevant. The exception from this point on is a DbfError, and the original exception is of no value. However, if this exception is printed, we would currently see both.
Alternatives
Several possibilities have been put forth:
raise as NewException()
Reuses the as keyword; can be confusing since we are not really reraising the originating exception
raise NewException() from None
Follows existing syntax of explicitly declaring the originating exception
exc = NewException(); exc.__context__ = None; raise exc
Very verbose way of the previous method
raise NewException.no_context(...)
Make context suppression a class method.
All of the above options will require changes to the core.
Proposal
I propose going with the second option:
raise NewException from None
It has the advantage of using the existing pattern of explicitly setting the cause:
raise KeyError() from NameError()
but because the cause is None the previous context is not displayed by the default exception printing routines.
Implementation Discussion
Note: after acceptance of this PEP, a cleaner implementation mechanism was proposed and accepted in PEP 415. Refer to that PEP for more details on the implementation actually used in Python 3.3.
Currently, None is the default for both __context__ and __cause__. In order to support raise ... from None (which would set __cause__ to None) we need a different default value for __cause__. Several ideas were put forth on how to implement this at the language level:
Overwrite the previous exception information (side-stepping the issue and leaving __cause__ at None).
Rejected as this can seriously hinder debugging due to poor error messages [1].
Use one of the boolean values in __cause__: False would be the default value, and would be replaced when from ... was used with the explicity chained exception or None.
Rejected as this encourages the use of two different objects types for __cause__ with one of them (boolean) not allowed to have the full range of possible values (True would never be used).
Create a special exception class, __NoException__.
Rejected as possibly confusing, possibly being mistakenly raised by users, and not being a truly unique value as None, True, and False are.
Use Ellipsis as the default value (the ... singleton).
Accepted.
Ellipses are commonly used in English as place holders when words are omitted. This works in our favor here as a signal that __cause__ is omitted, so look in __context__ for more details.
Ellipsis is not an exception, so cannot be raised.
There is only one Ellipsis, so no unused values.
Error information is not thrown away, so custom code can trace the entire exception chain even if the default code does not.
Language Details
To support raise Exception from None, __context__ will stay as it is, but __cause__ will start out as Ellipsis and will change to None when the raise Exception from None method is used.
| form | __context__ | __cause__ |
|---|---|---|
| raise | None | Ellipsis |
| reraise | previous exception | Ellipsis |
| reraise from None | ChainedException | previous exception | None | explicitly chained exception |
The default exception printing routine will then:
- If __cause__ is Ellipsis the __context__ (if any) will be printed.
- If __cause__ is None the __context__ will not be printed.
- if __cause__ is anything else, __cause__ will be printed.
In both of the latter cases the exception chain will stop being followed.
Because the default value for __cause__ is now Ellipsis and raise Exception from Cause is simply syntactic sugar for:
_exc = NewException() _exc.__cause__ = Cause() raise _exc
Ellipsis, as well as None, is now allowed as a cause:
raise Exception from Ellipsis
Patches
There is a patch for CPython implementing this attached to Issue 6210 [2].
References
Discussion and refinements in this thread on python-dev [3].
| [1] | http://bugs.python.org/msg152294 |
| [2] | http://bugs.python.org/issue6210 |
| [3] | http://mail.python.org/pipermail/python-dev/2012-January/115838.html |
Copyright
This document has been placed in the public domain.
pep-0410 Use decimal.Decimal type for timestamps
| PEP: | 410 |
|---|---|
| Title: | Use decimal.Decimal type for timestamps |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 01-February-2012 |
| Python-Version: | 3.3 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-February/116837.html |
Contents
Rejection Notice
This PEP is rejected. See http://mail.python.org/pipermail/python-dev/2012-February/116837.html.
Abstract
Decimal becomes the official type for high-resolution timestamps to make Python support new functions using a nanosecond resolution without loss of precision.
Rationale
Python 2.3 introduced float timestamps to support sub-second resolutions. os.stat() uses float timestamps by default since Python 2.5. Python 3.3 introduced functions supporting nanosecond resolutions:
- os module: futimens(), utimensat()
- time module: clock_gettime(), clock_getres(), monotonic(), wallclock()
os.stat() reads nanosecond timestamps but returns timestamps as float.
The Python float type uses binary64 format of the IEEE 754 standard. With a resolution of one nanosecond (10-9), float timestamps lose precision for values bigger than 224 seconds (194 days: 1970-07-14 for an Epoch timestamp).
Nanosecond resolution is required to set the exact modification time on filesystems supporting nanosecond timestamps (e.g. ext4, btrfs, NTFS, ...). It helps also to compare the modification time to check if a file is newer than another file. Use cases: copy the modification time of a file using shutil.copystat(), create a TAR archive with the tarfile module, manage a mailbox with the mailbox module, etc.
An arbitrary resolution is preferred over a fixed resolution (like nanosecond) to not have to change the API when a better resolution is required. For example, the NTP protocol uses fractions of 232 seconds (approximatively 2.3 Ă 10-10 second), whereas the NTP protocol version 4 uses fractions of 264 seconds (5.4 Ă 10-20 second).
Note
With a resolution of 1 microsecond (10-6), float timestamps lose precision for values bigger than 233 seconds (272 years: 2242-03-16 for an Epoch timestamp). With a resolution of 100 nanoseconds (10-7, resolution used on Windows), float timestamps lose precision for values bigger than 229 seconds (17 years: 1987-01-05 for an Epoch timestamp).
Specification
Add decimal.Decimal as a new type for timestamps. Decimal supports any timestamp resolution, support arithmetic operations and is comparable. It is possible to coerce a Decimal to float, even if the conversion may lose precision. The clock resolution can also be stored in a Decimal object.
Add an optional timestamp argument to:
- os module: fstat(), fstatat(), lstat(), stat() (st_atime, st_ctime and st_mtime fields of the stat structure), sched_rr_get_interval(), times(), wait3() and wait4()
- resource module: ru_utime and ru_stime fields of getrusage()
- signal module: getitimer(), setitimer()
- time module: clock(), clock_gettime(), clock_getres(), monotonic(), time() and wallclock()
The timestamp argument value can be float or Decimal, float is still the default for backward compatibility. The following functions support Decimal as input:
- datetime module: date.fromtimestamp(), datetime.fromtimestamp() and datetime.utcfromtimestamp()
- os module: futimes(), futimesat(), lutimes(), utime()
- select module: epoll.poll(), kqueue.control(), select()
- signal module: setitimer(), sigtimedwait()
- time module: ctime(), gmtime(), localtime(), sleep()
The os.stat_float_times() function is deprecated: use an explicit cast using int() instead.
Note
The decimal module is implemented in Python and is slower than float, but there is a new C implementation which is almost ready for inclusion in CPython.
Backwards Compatibility
The default timestamp type (float) is unchanged, so there is no impact on backward compatibility nor on performances. The new timestamp type, decimal.Decimal, is only returned when requested explicitly.
Objection: clocks accuracy
Computer clocks and operating systems are inaccurate and fail to provide nanosecond accuracy in practice. A nanosecond is what it takes to execute a couple of CPU instructions. Even on a real-time operating system, a nanosecond-precise measurement is already obsolete when it starts being processed by the higher-level application. A single cache miss in the CPU will make the precision worthless.
Note
Linux actually is able to measure time in nanosecond precision, even though it is not able to keep its clock synchronized to UTC with a nanosecond accuracy.
Alternatives: Timestamp types
To support timestamps with an arbitrary or nanosecond resolution, the following types have been considered:
- decimal.Decimal
- number of nanoseconds
- 128-bits float
- datetime.datetime
- datetime.timedelta
- tuple of integers
- timespec structure
Criteria:
- Doing arithmetic on timestamps must be possible
- Timestamps must be comparable
- An arbitrary resolution, or at least a resolution of one nanosecond without losing precision
- It should be possible to coerce the new timestamp to float for backward compatibility
A resolution of one nanosecond is enough to support all current C functions.
The best resolution used by operating systems is one nanosecond. In practice, most clock accuracy is closer to microseconds than nanoseconds. So it sounds reasonable to use a fixed resolution of one nanosecond.
Number of nanoseconds (int)
A nanosecond resolution is enough for all current C functions and so a timestamp can simply be a number of nanoseconds, an integer, not a float.
The number of nanoseconds format has been rejected because it would require to add new specialized functions for this format because it not possible to differentiate a number of nanoseconds and a number of seconds just by checking the object type.
128-bits float
Add a new IEEE 754-2008 quad-precision binary float type. The IEEE 754-2008 quad precision float has 1 sign bit, 15 bits of exponent and 112 bits of mantissa. 128-bits float is supported by GCC (4.3), Clang and ICC compilers.
Python must be portable and so cannot rely on a type only available on some platforms. For example, Visual C++ 2008 doesn't support 128-bits float, whereas it is used to build the official Windows executables. Another example: GCC 4.3 does not support __float128 in 32-bit mode on x86 (but GCC 4.4 does).
There is also a license issue: GCC uses the MPFR library for 128-bits float, library distributed under the GNU LGPL license. This license is not compatible with the Python license.
Note
The x87 floating point unit of Intel CPU supports 80-bit floats. This format is not supported by the SSE instruction set, which is now preferred over float, especially on x86_64. Other CPU vendors don't support 80-bit float.
datetime.datetime
The datetime.datetime type is the natural choice for a timestamp because it is clear that this type contains a timestamp, whereas int, float and Decimal are raw numbers. It is an absolute timestamp and so is well defined. It gives direct access to the year, month, day, hours, minutes and seconds. It has methods related to time like methods to format the timestamp as string (e.g. datetime.datetime.strftime).
The major issue is that except os.stat(), time.time() and time.clock_gettime(time.CLOCK_GETTIME), all time functions have an unspecified starting point and no timezone information, and so cannot be converted to datetime.datetime.
datetime.datetime has also issues with timezone. For example, a datetime object without timezone (unaware) and a datetime with a timezone (aware) cannot be compared. There is also an ordering issues with daylight saving time (DST) in the duplicate hour of switching from DST to normal time.
datetime.datetime has been rejected because it cannot be used for functions using an unspecified starting point like os.times() or time.clock().
For time.time() and time.clock_gettime(time.CLOCK_GETTIME): it is already possible to get the current time as a datetime.datetime object using:
datetime.datetime.now(datetime.timezone.utc)
For os.stat(), it is simple to create a datetime.datetime object from a decimal.Decimal timestamp in the UTC timezone:
datetime.datetime.fromtimestamp(value, datetime.timezone.utc)
Note
datetime.datetime only supports microsecond resolution, but can be enhanced to support nanosecond.
datetime.timedelta
datetime.timedelta is the natural choice for a relative timestamp because it is clear that this type contains a timestamp, whereas int, float and Decimal are raw numbers. It can be used with datetime.datetime to get an absolute timestamp when the starting point is known.
datetime.timedelta has been rejected because it cannot be coerced to float and has a fixed resolution. One new standard timestamp type is enough, Decimal is preferred over datetime.timedelta. Converting a datetime.timedelta to float requires an explicit call to the datetime.timedelta.total_seconds() method.
Note
datetime.timedelta only supports microsecond resolution, but can be enhanced to support nanosecond.
Tuple of integers
To expose C functions in Python, a tuple of integers is the natural choice to store a timestamp because the C language uses structures with integers fields (e.g. timeval and timespec structures). Using only integers avoids the loss of precision (Python supports integers of arbitrary length). Creating and parsing a tuple of integers is simple and fast.
Depending of the exact format of the tuple, the precision can be arbitrary or fixed. The precision can be choose as the loss of precision is smaller than an arbitrary limit like one nanosecond.
Different formats have been proposed:
- A: (numerator, denominator)
- value = numerator / denominator
- resolution = 1 / denominator
- denominator > 0
- B: (seconds, numerator, denominator)
- value = seconds + numerator / denominator
- resolution = 1 / denominator
- 0 <= numerator < denominator
- denominator > 0
- C: (intpart, floatpart, base, exponent)
- value = intpart + floatpart / baseexponent
- resolution = 1 / base exponent
- 0 <= floatpart < base exponent
- base > 0
- exponent >= 0
- D: (intpart, floatpart, exponent)
- value = intpart + floatpart / 10exponent
- resolution = 1 / 10 exponent
- 0 <= floatpart < 10 exponent
- exponent >= 0
- E: (sec, nsec)
- value = sec + nsec Ă 10-9
- resolution = 10 -9 (nanosecond)
- 0 <= nsec < 10 9
All formats support an arbitrary resolution, except of the format (E).
The format (D) may not be able to store the exact value (may loss of precision) if the clock frequency is arbitrary and cannot be expressed as a power of 10. The format (C) has a similar issue, but in such case, it is possible to use base=frequency and exponent=1.
The formats (C), (D) and (E) allow optimization for conversion to float if the base is 2 and to decimal.Decimal if the base is 10.
The format (A) is a simple fraction. It supports arbitrary precision, is simple (only two fields), only requires a simple division to get the floating point value, and is already used by float.as_integer_ratio().
To simplify the implementation (especially the C implementation to avoid integer overflow), a numerator bigger than the denominator can be accepted. The tuple may be normalized later.
Tuple of integers have been rejected because they don't support arithmetic operations.
Note
On Windows, the QueryPerformanceCounter() clock uses the frequency of the processor which is an arbitrary number and so may not be a power or 2 or 10. The frequency can be read using QueryPerformanceFrequency().
timespec structure
timespec is the C structure used to store timestamp with a nanosecond resolution. Python can use a type with the same structure: (seconds, nanoseconds). For convenience, arithmetic operations on timespec are supported.
Example of an incomplete timespec type supporting addition, subtraction and coercion to float:
class timespec(tuple):
def __new__(cls, sec, nsec):
if not isinstance(sec, int):
raise TypeError
if not isinstance(nsec, int):
raise TypeError
asec, nsec = divmod(nsec, 10 ** 9)
sec += asec
obj = tuple.__new__(cls, (sec, nsec))
obj.sec = sec
obj.nsec = nsec
return obj
def __float__(self):
return self.sec + self.nsec * 1e-9
def total_nanoseconds(self):
return self.sec * 10 ** 9 + self.nsec
def __add__(self, other):
if not isinstance(other, timespec):
raise TypeError
ns_sum = self.total_nanoseconds() + other.total_nanoseconds()
return timespec(*divmod(ns_sum, 10 ** 9))
def __sub__(self, other):
if not isinstance(other, timespec):
raise TypeError
ns_diff = self.total_nanoseconds() - other.total_nanoseconds()
return timespec(*divmod(ns_diff, 10 ** 9))
def __str__(self):
if self.sec < 0 and self.nsec:
sec = abs(1 + self.sec)
nsec = 10**9 - self.nsec
return '-%i.%09u' % (sec, nsec)
else:
return '%i.%09u' % (self.sec, self.nsec)
def __repr__(self):
return '<timespec(%s, %s)>' % (self.sec, self.nsec)
The timespec type is similar to the format (E) of tuples of integer, except that it supports arithmetic and coercion to float.
The timespec type was rejected because it only supports nanosecond resolution and requires to implement each arithmetic operation, whereas the Decimal type is already implemented and well tested.
Alternatives: API design
Add a string argument to specify the return type
Add an string argument to function returning timestamps, example: time.time(format="datetime"). A string is more extensible than a type: it is possible to request a format that has no type, like a tuple of integers.
This API was rejected because it was necessary to import implicitly modules to instantiate objects (e.g. import datetime to create datetime.datetime). Importing a module may raise an exception and may be slow, such behaviour is unexpected and surprising.
Add a global flag to change the timestamp type
A global flag like os.stat_decimal_times(), similar to os.stat_float_times(), can be added to set globally the timestamp type.
A global flag may cause issues with libraries and applications expecting float instead of Decimal. Decimal is not fully compatible with float. float+Decimal raises a TypeError for example. The os.stat_float_times() case is different because an int can be coerced to float and int+float gives float.
Add a protocol to create a timestamp
Instead of hard coding how timestamps are created, a new protocol can be added to create a timestamp from a fraction.
For example, time.time(timestamp=type) would call the class method type.__fromfraction__(numerator, denominator) to create a timestamp object of the specified type. If the type doesn't support the protocol, a fallback is used: type(numerator) / type(denominator).
A variant is to use a "converter" callback to create a timestamp. Example creating a float timestamp:
- def timestamp_to_float(numerator, denominator):
- return float(numerator) / float(denominator)
Common converters can be provided by time, datetime and other modules, or maybe a specific "hires" module. Users can define their own converters.
Such protocol has a limitation: the timestamp structure has to be decided once and cannot be changed later. For example, adding a timezone or the absolute start of the timestamp would break the API.
The protocol proposition was as being excessive given the requirements, but that the specific syntax proposed (time.time(timestamp=type)) allows this to be introduced later if compelling use cases are discovered.
Note
Other formats may be used instead of a fraction: see the tuple of integers section for example.
Add new fields to os.stat
To get the creation, modification and access time of a file with a nanosecond resolution, three fields can be added to os.stat() structure.
The new fields can be timestamps with nanosecond resolution (e.g. Decimal) or the nanosecond part of each timestamp (int).
If the new fields are timestamps with nanosecond resolution, populating the extra fields would be time consuming. Any call to os.stat() would be slower, even if os.stat() is only called to check if a file exists. A parameter can be added to os.stat() to make these fields optional, the structure would have a variable number of fields.
If the new fields only contain the fractional part (nanoseconds), os.stat() would be efficient. These fields would always be present and so set to zero if the operating system does not support sub-second resolution. Splitting a timestamp in two parts, seconds and nanoseconds, is similar to the timespec type and tuple of integers, and so have the same drawbacks.
Adding new fields to the os.stat() structure does not solve the nanosecond issue in other modules (e.g. the time module).
Add a boolean argument
Because we only need one new type (Decimal), a simple boolean flag can be added. Example: time.time(decimal=True) or time.time(hires=True).
Such flag would require to do an hidden import which is considered as a bad practice.
The boolean argument API was rejected because it is not "pythonic". Changing the return type with a parameter value is preferred over a boolean parameter (a flag).
Add new functions
Add new functions for each type, examples:
- time.clock_decimal()
- time.time_decimal()
- os.stat_decimal()
- os.stat_timespec()
- etc.
Adding a new function for each function creating timestamps duplicate a lot of code and would be a pain to maintain.
Add a new hires module
Add a new module called "hires" with the same API than the time module, except that it would return timestamp with high resolution, e.g. decimal.Decimal. Adding a new module avoids to link low-level modules like time or os to the decimal module.
This idea was rejected because it requires to duplicate most of the code of the time module, would be a pain to maintain, and timestamps are used modules other than the time module. Examples: signal.sigtimedwait(), select.select(), resource.getrusage(), os.stat(), etc. Duplicate the code of each module is not acceptable.
Links
Python:
Other languages:
- Ruby (1.9.3), the Time class supports picosecond (10-12)
- .NET framework, DateTime type: number of 100-nanosecond intervals that have elapsed since 12:00:00 midnight, January 1, 0001. DateTime.Ticks uses a signed 64-bit integer.
- Java (1.5), System.nanoTime(): wallclock with an unspecified starting point as a number of nanoseconds, use a signed 64 bits integer (long).
- Perl, Time::Hiref module: use float so has the same loss of precision issue with nanosecond resolution than Python float timestamps
Copyright
This document has been placed in the public domain.
pep-0411 Provisional packages in the Python standard library
| PEP: | 411 |
|---|---|
| Title: | Provisional packages in the Python standard library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Eli Bendersky <eliben at gmail.com> |
| Status: | Accepted |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 2012-02-10 |
| Python-Version: | 3.3 |
| Post-History: | 2012-02-10, 2012-03-24 |
Contents
Abstract
The process of including a new package into the Python standard library is hindered by the API lock-in and promise of backward compatibility implied by a package being formally part of Python. This PEP describes a methodology for marking a standard library package "provisional" for the period of a single feature release. A provisional package may have its API modified prior to "graduating" into a "stable" state. On one hand, this state provides the package with the benefits of being formally part of the Python distribution. On the other hand, the core development team explicitly states that no promises are made with regards to the the stability of the package's API, which may change for the next release. While it is considered an unlikely outcome, such packages may even be removed from the standard library without a deprecation period if the concerns regarding their API or maintenance prove well-founded.
Proposal - a documented provisional state
Whenever the Python core development team decides that a new package should be included into the standard library, but isn't entirely sure about whether the package's API is optimal, the package can be included and marked as "provisional".
In the next feature release, the package may either be "graduated" into a normal "stable" state in the standard library, remain in provisional state, or be rejected and removed entirely from the Python source tree. If the package ends up graduating into the stable state after being provisional, its API may be changed according to accumulated feedback. The core development team explicitly makes no guarantees about API stability and backward compatibility of provisional packages.
Marking a package provisional
A package will be marked provisional by a notice in its documentation page and its docstring. The following paragraph will be added as a note at the top of the documentation page:
The <X> package has been included in the standard library on a provisional basis. Backwards incompatible changes (up to and including removal of the package) may occur if deemed necessary by the core developers.
The phrase "provisional basis" will then be a link to the glossary term "provisional package", defined as:
A provisional package is one which has been deliberately excluded from the standard library's backwards compatibility guarantees. While major changes to such packages are not expected, as long as they are marked provisional, backwards incompatible changes (up to and including removal of the package) may occur if deemed necessary by core developers. Such changes will not be made gratuitously -- they will occur only if serious flaws are uncovered that were missed prior to the inclusion of the package.
This process allows the standard library to continue to evolve over time, without locking in problematic design errors for extended periods of time. See PEP 411 for more details.
The following will be added to the start of the package's docstring:
The API of this package is currently provisional. Refer to the documentation for details.
Moving a package from the provisional to the stable state simply implies removing these notes from its documentation page and docstring.
Which packages should go through the provisional state
We expect most packages proposed for addition into the Python standard library to go through a feature release in the provisional state. There may, however, be some exceptions, such as packages that use a pre-defined API (for example lzma, which generally follows the API of the existing bz2 package), or packages with an API that has wide acceptance in the Python development community.
In any case, packages that are proposed to be added to the standard library, whether via the provisional state or directly, must fulfill the acceptance conditions set by PEP 2.
Criteria for "graduation"
In principle, most provisional packages should eventually graduate to the stable standard library. Some reasons for not graduating are:
- The package may prove to be unstable or fragile, without sufficient developer support to maintain it.
- A much better alternative package may be found during the preview release.
Essentially, the decision will be made by the core developers on a per-case basis. The point to emphasize here is that a package's inclusion in the standard library as "provisional" in some release does not guarantee it will continue being part of Python in the next release. At the same time, the bar for making changes in a provisional package is quite high. We expect that most of the API of most provisional packages will be unchanged at graduation. Withdrawals are expected to be rare.
Rationale
Benefits for the core development team
Currently, the core developers are really reluctant to add new interfaces to the standard library. This is because as soon as they're published in a release, API design mistakes get locked in due to backward compatibility concerns.
By gating all major API additions through some kind of a provisional mechanism for a full release, we get one full release cycle of community feedback before we lock in the APIs with our standard backward compatibility guarantee.
We can also start integrating provisional packages with the rest of the standard library early, so long as we make it clear to packagers that the provisional packages should not be considered optional. The only difference between provisional APIs and the rest of the standard library is that provisional APIs are explicitly exempted from the usual backward compatibility guarantees.
Benefits for end users
For future end users, the broadest benefit lies in a better "out-of-the-box" experience - rather than being told "oh, the standard library tools for task X are horrible, download this 3rd party library instead", those superior tools are more likely to be just be an import away.
For environments where developers are required to conduct due diligence on their upstream dependencies (severely harming the cost-effectiveness of, or even ruling out entirely, much of the material on PyPI), the key benefit lies in ensuring that all packages in the provisional state are clearly under python-dev's aegis from at least the following perspectives:
- Licensing: Redistributed by the PSF under a Contributor Licensing Agreement.
- Documentation: The documentation of the package is published and organized via the standard Python documentation tools (i.e. ReST source, output generated with Sphinx and published on http://docs.python.org).
- Testing: The package test suites are run on the python.org buildbot fleet and results published via http://www.python.org/dev/buildbot.
- Issue management: Bugs and feature requests are handled on http://bugs.python.org
- Source control: The master repository for the software is published on http://hg.python.org.
Candidates for provisional inclusion into the standard library
For Python 3.3, there are a number of clear current candidates:
- regex (http://pypi.python.org/pypi/regex) - approved by Guido [1].
- daemon (PEP 3143)
- ipaddr (PEP 3144)
Other possible future use cases include:
- Improved HTTP modules (e.g. requests)
- HTML 5 parsing support (e.g. html5lib)
- Improved URL/URI/IRI parsing
- A standard image API (PEP 368)
- Improved encapsulation of import state (PEP 406)
- Standard event loop API (PEP 3153)
- A binary version of WSGI for Python 3 (e.g. PEP 444)
- Generic function support (e.g. simplegeneric)
Copyright
This document has been placed in the public domain.
pep-0412 Key-Sharing Dictionary
| PEP: | 412 |
|---|---|
| Title: | Key-Sharing Dictionary |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Mark Shannon <mark at hotpy.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 08-Feb-2012 |
| Python-Version: | 3.3 or 3.4 |
| Post-History: | 08-Feb-2012 |
Contents
Abstract
This PEP proposes a change in the implementation of the builtin dictionary type dict. The new implementation allows dictionaries which are used as attribute dictionaries (the __dict__ attribute of an object) to share keys with other attribute dictionaries of instances of the same class.
Motivation
The current dictionary implementation uses more memory than is necessary when used as a container for object attributes as the keys are replicated for each instance rather than being shared across many instances of the same class. Despite this, the current dictionary implementation is finely tuned and performs very well as a general-purpose mapping object.
By separating the keys (and hashes) from the values it is possible to share the keys between multiple dictionaries and improve memory use. By ensuring that keys are separated from the values only when beneficial, it is possible to retain the high-performance of the current dictionary implementation when used as a general-purpose mapping object.
Behaviour
The new dictionary behaves in the same way as the old implementation. It fully conforms to the Python API, the C API and the ABI.
Performance
Memory Usage
Reduction in memory use is directly related to the number of dictionaries with shared keys in existence at any time. These dictionaries are typically half the size of the current dictionary implementation.
Benchmarking shows that memory use is reduced by 10% to 20% for object-oriented programs with no significant change in memory use for other programs.
Speed
The performance of the new implementation is dominated by memory locality effects. When keys are not shared (for example in module dictionaries and dictionary explicitly created by dict() or {}) then performance is unchanged (within a percent or two) from the current implementation.
For the shared keys case, the new implementation tends to separate keys from values, but reduces total memory usage. This will improve performance in many cases as the effects of reduced memory usage outweigh the loss of locality, but some programs may show a small slow down.
Benchmarking shows no significant change of speed for most benchmarks. Object-oriented benchmarks show small speed ups when they create large numbers of objects of the same class (the gcbench benchmark shows a 10% speed up; this is likely to be an upper limit).
Implementation
Both the old and new dictionaries consist of a fixed-sized dict struct and a re-sizeable table. In the new dictionary the table can be further split into a keys table and values array. The keys table holds the keys and hashes and (for non-split tables) the values as well. It differs only from the original implementation in that it contains a number of fields that were previously in the dict struct. If a table is split the values in the keys table are ignored, instead the values are held in a separate array.
Split-Table dictionaries
When dictionaries are created to fill the __dict__ slot of an object, they are created in split form. The keys table is cached in the type, potentially allowing all attribute dictionaries of instances of one class to share keys. In the event of the keys of these dictionaries starting to diverge, individual dictionaries will lazily convert to the combined-table form. This ensures good memory use in the common case, and correctness in all cases.
When resizing a split dictionary it is converted to a combined table. If resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately. Since most OO code will set attributes in the __init__ method, all attributes will be set before a second instance is created and no more resizing will be necessary as all further instance dictionaries will have the correct size. For more complex use patterns, it is impossible to know what is the best approach, so the implementation allows extra insertions up to the point of a resize when it reverts to the combined table (non-shared keys).
A deletion from a split dictionary does not change the keys table, it simply removes the value from the values array.
Combined-Table dictionaries
Explicit dictionaries (dict() or {}), module dictionaries and most other dictionaries are created as combined-table dictionaries. A combined-table dictionary never becomes a split-table dictionary. Combined tables are laid out in much the same way as the tables in the old dictionary, resulting in very similar performance.
Implementation
The new dictionary implementation is available at [1].
Pros and Cons
Pros
Significant memory savings for object-oriented applications. Small improvement to speed for programs which create lots of similar objects.
Cons
Change to data structures: Third party modules which meddle with the internals of the dictionary implementation will break.
Changes to repr() output and iteration order: For most cases, this will be unchanged. However for some split-table dictionaries the iteration order will change.
Neither of these cons should be a problem. Modules which meddle with the internals of the dictionary implementation are already broken and should be fixed to use the API. The iteration order of dictionaries was never defined and has always been arbitrary; it is different for Jython and PyPy.
Alternative Implementation
An alternative implementation for split tables, which could save even more memory, is to store an index in the value field of the keys table (instead of ignoring the value field). This index would explicitly state where in the value array to look. The value array would then only require 1 field for each usable slot in the key table, rather than each slot in the key table.
This "indexed" version would reduce the size of value array by about one third. The keys table would need an extra "values_size" field, increasing the size of combined dicts by one word. The extra indirection adds more complexity to the code, potentially reducing performance a little.
The "indexed" version will not be included in this implementation, but should be considered deferred rather than rejected, pending further experimentation.
References
| [1] | Reference Implementation: https://bitbucket.org/markshannon/cpython_new_dict |
Copyright
This document has been placed in the public domain.
pep-0413 Faster evolution of the Python Standard Library
| PEP: | 413 |
|---|---|
| Title: | Faster evolution of the Python Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Withdrawn |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 2012-02-24 |
| Post-History: | 2012-02-24, 2012-02-25 |
| Resolution: | TBD |
Contents
- PEP Withdrawal
- Abstract
- Rationale
- Proposal
- User Scenarios
- Novice user, downloading Python from python.org in March 2013
- Novice user, attempting to judge currency of third party documentation
- Novice user, looking for an extension module binary release
- Extension module author, deciding whether or not to make a binary release
- Python developer, deciding priority of eliminating a Deprecation Warning
- Alternative interpreter implementor, updating with new features
- Python developer, deciding their minimum version dependency
- Python developers, attempting to reproduce a tracker issue
- CPython release managers, handling a security fix
- Effects
- Handling News Updates
- Other benefits of reduced version coupling
- Other Questions
- Acknowledgements
- References
- Copyright
PEP Withdrawal
With the acceptance of PEP 453 meaning that pip will be available to most new Python users by default, this will hopefully reduce the pressure to add new modules to the standard library before they are sufficiently mature.
The last couple of years have also seen increased usage of the model where a standard library package also has an equivalent available from the Python Package Index that also supports older versions of Python.
Given these two developments and the level of engagement throughout the Python 3.4 release cycle, the PEP author no longer feels it would be appropriate to make such a fundamental change to the standard library development process.
Abstract
This PEP proposes the adoption of a separate versioning scheme for the standard library (distinct from, but coupled to, the existing language versioning scheme) that allows accelerated releases of the Python standard library, while maintaining (or even slowing down) the current rate of change in the core language definition.
Like PEP 407, it aims to adjust the current balance between measured change that allows the broader community time to adapt and being able to keep pace with external influences that evolve more rapidly than the current release cycle can handle (this problem is particularly notable for standard library elements that relate to web technologies).
However, it's more conservative in its aims than PEP 407, seeking to restrict the increased pace of development to builtin and standard library interfaces, without affecting the rate of change for other elements such as the language syntax and version numbering as well as the CPython binary API and bytecode format.
Rationale
To quote the PEP 407 abstract:
Finding a release cycle for an open-source project is a delicate exercise in managing mutually contradicting constraints: developer manpower, availability of release management volunteers, ease of maintenance for users and third-party packagers, quick availability of new features (and behavioural changes), availability of bug fixes without pulling in new features or behavioural changes.
The current release cycle errs on the conservative side. It is adequate for people who value stability over reactivity. This PEP is an attempt to keep the stability that has become a Python trademark, while offering a more fluid release of features, by introducing the notion of long-term support versions.
I agree with the PEP 407 authors that the current release cycle of the standard library is too slow to effectively cope with the pace of change in some key programming areas (specifically, web protocols and related technologies, including databases, templating and serialisation formats).
However, I have written this competing PEP because I believe that the approach proposed in PEP 407 of offering full, potentially binary incompatible releases of CPython every 6 months places too great a burden on the wider Python ecosystem.
Under the current CPython release cycle, distributors of key binary extensions will often support Python releases even after the CPython branches enter "security fix only" mode (for example, Twisted currently ships binaries for 2.5, 2.6 and 2.7, NumPy and SciPy suport those 3 along with 3.1 and 3.2, PyGame adds a 2.4 binary release, wxPython provides both 32-bit and 64-bit binaries for 2.6 and 2.7, etc).
If CPython were to triple (or more) its rate of releases, the developers of those libraries (many of which are even more resource starved than CPython) would face an unpalatable choice: either adopt the faster release cycle themselves (up to 18 simultaneous binary releases for PyGame!), drop older Python versions more quickly, or else tell their users to stick to the CPython LTS releases (thus defeating the entire point of speeding up the CPython release cycle in the first place).
Similarly, many support tools for Python (e.g. syntax highlighters) can take quite some time to catch up with language level changes.
At a cultural level, the Python community is also accustomed to a certain meaning for Python version numbers - they're linked to deprecation periods, support periods, all sorts of things. PEP 407 proposes that collective knowledge all be swept aside, without offering a compelling rationale for why such a course of action is actually necessary (aside from, perhaps, making the lives of the CPython core developers a little easier at the expense of everyone else).
However, if we go back to the primary rationale for increasing the pace of change (i.e. more timely support for web protocols and related technologies), we can note that those only require standard library changes. That means many (perhaps even most) of the negative effects on the wider community can be avoided by explicitly limiting which parts of CPython are affected by the new release cycle, and allowing other parts to evolve at their current, more sedate, pace.
Proposal
This PEP proposes the introduction of a new kind of CPython release: "standard library releases". As with PEP 407, this will give CPython 3 kinds of release:
- Language release: "x.y.0"
- Maintenance release: "x.y.z" (where z > 0)
- Standard library release: "x.y (xy.z)" (where z > 0)
Under this scheme, an unqualified version reference (such as "3.3") would always refer to the most recent corresponding language or maintenance release. It will never be used without qualification to refer to a standard library release (at least, not by python-dev - obviously, we can only set an example, not force the rest of the Python ecosystem to go along with it).
Language releases will continue as they are now, as new versions of the Python language definition, along with a new version of the CPython interpreter and the Python standard library. Accordingly, a language release may contain any and all of the following changes:
- new language syntax
- new standard library changes (see below)
- new deprecation warnings
- removal of previously deprecated features
- changes to the emitted bytecode
- changes to the AST
- any other significant changes to the compilation toolchain
- changes to the core interpreter eval loop
- binary incompatible changes to the C ABI (although the PEP 384 stable ABI must still be preserved)
- bug fixes
Maintenance releases will also continue as they do today, being strictly limited to bug fixes for the corresponding language release. No new features or radical internal changes are permitted.
The new standard library releases will occur in parallel with each maintenance release and will be qualified with a new version identifier documenting the standard library version. Standard library releases may include the following changes:
- new features in pure Python modules
- new features in C extension modules (subject to PEP 399 compatibility requirements)
- new features in language builtins (provided the C ABI remains unaffected)
- bug fixes from the corresponding maintenance release
Standard library version identifiers are constructed by combining the major and minor version numbers for the Python language release into a single two digit number and then appending a sequential standard library version identifier.
Release Cycle
When maintenance releases are created, two new versions of Python would actually be published on python.org (using the first 3.3 maintenance release, planned for February 2013 as an example):
3.3.1 # Maintenance release 3.3 (33.1) # Standard library release
A further 6 months later, the next 3.3 maintenance release would again be accompanied by a new standard library release:
3.3.2 # Maintenance release 3.3 (33.2) # Standard library release
Again, the standard library release would be binary compatible with the previous language release, merely offering additional features at the Python level.
Finally, 18 months after the release of 3.3, a new language release would be made around the same time as the final 3.3 maintenance and standard library releases:
3.3.3 # Maintenance release 3.3 (33.3) # Standard library release 3.4.0 # Language release
The 3.4 release cycle would then follow a similar pattern to that for 3.3:
3.4.1 # Maintenance release 3.4 (34.1) # Standard library release 3.4.2 # Maintenance release 3.4 (34.2) # Standard library release 3.4.3 # Maintenance release 3.4 (34.3) # Standard library release 3.5.0 # Language release
Programmatic Version Identification
To expose the new version details programmatically, this PEP proposes the addition of a new sys.stdlib_info attribute that records the new standard library version above and beyond the underlying interpreter version. Using the initial Python 3.3 release as an example:
sys.stdlib_info(python=33, version=0, releaselevel='final', serial=0)
This information would also be included in the sys.version string:
Python 3.3.0 (33.0, default, Feb 17 2012, 23:03:41) [GCC 4.6.1]
Security Fixes and Other "Out of Cycle" Releases
For maintenance releases the process of handling out-of-cycle releases (for example, to fix a security issue or resolve a critical bug in a new release), remains the same as it is now: the minor version number is incremented and a new release is made incorporating the required bug fixes, as well as any other bug fixes that have been committed since the previous release.
For standard library releases, the process is essentially the same, but the corresponding "What's New?" document may require some tidying up for the release (as the standard library release may incorporate new features, not just bug fixes).
User Scenarios
The versioning scheme proposed above is based on a number of user scenarios that are likely to be encountered if this scheme is adopted. In each case, the scenario is described for both the status quo (i.e. slow release cycle) the versioning scheme in this PEP and the free wheeling minor version number scheme proposed in PEP 407.
To give away the ending, the point of using a separate version number is that for almost all scenarios, the important number is the language version, not the standard library version. Most users won't even need to care that the standard library version number exists. In the two identified cases where it matters, providing it as a separate number is actually clearer and more explicit than embedding the two different kinds of number into a single sequence and then tagging some of the numbers in the unified sequence as special.
Novice user, downloading Python from python.org in March 2013
Status quo: must choose between 3.3 and 2.7
This PEP: must choose between 3.3 (33.1), 3.3 and 2.7.
PEP 407: must choose between 3.4, 3.3 (LTS) and 2.7.
Verdict: explaining the meaning of a Long Term Support release is about as complicated as explaining the meaning of the proposed standard library release version numbers. I call this a tie.
Novice user, attempting to judge currency of third party documentation
Status quo: minor version differences indicate 18-24 months of language evolution
This PEP: same as status quo for language core, standard library version numbers indicate 6 months of standard library evolution.
PEP 407: minor version differences indicate 18-24 months of language evolution up to 3.3, then 6 months of language evolution thereafter.
Verdict: Since language changes and deprecations can have a much bigger effect on the accuracy of third party documentation than the addition of new features to the standard library, I'm calling this a win for the scheme in this PEP.
Novice user, looking for an extension module binary release
Status quo: look for the binary corresponding to the Python version you are running.
This PEP: same as status quo.
PEP 407 (full releases): same as status quo, but corresponding binary version is more likely to be missing (or, if it does exist, has to be found amongst a much larger list of alternatives).
PEP 407 (ABI updates limited to LTS releases): all binary release pages will need to tell users that Python 3.3, 3.4 and 3.5 all need the 3.3 binary.
Verdict: I call this a clear win for the scheme in this PEP. Absolutely nothing changes from the current situation, since the standard library version is actually irrelevant in this case (only binary extension compatibility is important).
Extension module author, deciding whether or not to make a binary release
Status quo: unless using the PEP 384 stable ABI, a new binary release is needed every time the minor version number changes.
This PEP: same as status quo.
PEP 407 (full releases): same as status quo, but becomes a far more frequent occurrence.
PEP 407 (ABI updates limited to LTS releases): before deciding, must first look up whether the new release is an LTS release or an interim release. If it is an LTS release, then a new build is necessary.
Verdict: I call this another clear win for the scheme in this PEP. As with the end user facing side of this problem, the standard library version is actually irrelevant in this case. Moving that information out to a separate number avoids creating unnecessary confusion.
Python developer, deciding priority of eliminating a Deprecation Warning
Status quo: code that triggers deprecation warnings is not guaranteed to run on a version of Python with a higher minor version number.
This PEP: same as status quo
PEP 407: unclear, as the PEP doesn't currently spell this out. Assuming the deprecation cycle is linked to LTS releases, then upgrading to a non-LTS release is safe but upgrading to the next LTS release may require avoiding the deprecated construct.
Verdict: another clear win for the scheme in this PEP since, once again, the standard library version is irrelevant in this scenario.
Alternative interpreter implementor, updating with new features
Status quo: new Python versions arrive infrequently, but are a mish-mash of standard library updates and core language definition and interpreter changes.
This PEP: standard library updates, which are easier to integrate, are made available more frequently in a form that is clearly and explicitly compatible with the previous version of the language definition. This means that, once an alternative implementation catches up to Python 3.3, they should have a much easier time incorporating standard library features as they happen (especially pure Python changes), leaving minor version number updates as the only task that requires updates to their core compilation and execution components.
PEP 407 (full releases): same as status quo, but becomes a far more frequent occurrence.
PEP 407 (language updates limited to LTS releases): unclear, as the PEP doesn't currently spell out a specific development strategy. Assuming a 3.3 compatibility branch is adopted (as proposed in this PEP), then the outcome would be much the same, but the version number signalling would be slightly less clear (since you would have to check to see if a particular release was an LTS release or not).
Verdict: while not as clear cut as some previous scenarios, I'm still calling this one in favour of the scheme in this PEP. Explicit is better than implicit, and the scheme in this PEP makes a clear split between the two different kinds of update rather than adding a separate "LTS" tag to an otherwise ordinary release number. Tagging a particular version as being special is great for communicating with version control systems and associated automated tools, but it's a lousy way to communicate information to other humans.
Python developer, deciding their minimum version dependency
Status quo: look for "version added" or "version changed" markers in the documentation, check against sys.version_info
This PEP: look for "version added" or "version changed" markers in the documentation. If written as a bare Python version, such as "3.3", check against sys.version_info. If qualified with a standard library version, such as "3.3 (33.1)", check against sys.stdlib_info.
PEP 407: same as status quo
Verdict: the scheme in this PEP actually allows third party libraries to be more explicit about their rate of adoption of standard library features. More conservative projects will likely pin their dependency to the language version and avoid features added in the standard library releases. Faster moving projects could instead declare their dependency on a particular standard library version. However, since PEP 407 does have the advantage of preserving the status quo, I'm calling this one for PEP 407 (albeit with a slim margin).
Python developers, attempting to reproduce a tracker issue
Status quo: if not already provided, ask the reporter which version of Python they're using. This is often done by asking for the first two lines displayed by the interactive prompt or the value of sys.version.
This PEP: same as the status quo (as sys.version will be updated to also include the standard library version), but may be needed on additional occasions (where the user knew enough to state their Python version, but that proved to be insufficient to reproduce the fault).
PEP 407: same as the status quo
Verdict: another marginal win for PEP 407. The new standard library version is an extra piece of information that users may need to pass back to developers when reporting issues with Python libraries (or Python itself, on our own tracker). However, by including it in sys.version, many fault reports will already include it, and it is easy to request if needed.
CPython release managers, handling a security fix
Status quo: create a new maintenance release incorporating the security fix and any other bug fixes under source control. Also create source releases for any branches open solely for security fixes.
This PEP: same as the status quo for maintenance branches. Also create a new standard library release (potentially incorporating new features along with the security fix). For security branches, create source releases for both the former maintenance branch and the standard library update branch.
PEP 407: same as the status quo for maintenance and security branches, but handling security fixes for non-LTS releases is currently an open question.
Verdict: until PEP 407 is updated to actually address this scenario, a clear win for this PEP.
Effects
Effect on development cycle
Similar to PEP 407, this PEP will break up the delivery of new features into more discrete chunks. Instead of a whole raft of changes landing all at once in a language release, each language release will be limited to 6 months worth of standard library changes, as well as any changes associated with new syntax.
Effect on workflow
This PEP proposes the creation of a single additional branch for use in the normal workflow. After the release of 3.3, the following branches would be in use:
2.7 # Maintenance branch, no change 3.3 # Maintenance branch, as for 3.2 3.3-compat # New branch, backwards compatible changes default # Language changes, standard library updates that depend on them
When working on a new feature, developers will need to decide whether or not it is an acceptable change for a standard library release. If so, then it should be checked in on 3.3-compat and then merged to default. Otherwise it should be checked in directly to default.
The "version added" and "version changed" markers for any changes made on the 3.3-compat branch would need to be flagged with both the language version and the standard library version. For example: "3.3 (33.1)".
Any changes made directly on the default branch would just be flagged with "3.4" as usual.
The 3.3-compat branch would be closed to normal development at the same time as the 3.3 maintenance branch. The 3.3-compat branch would remain open for security fixes for the same period of time as the 3.3 maintenance branch.
Effect on bugfix cycle
The effect on the bug fix workflow is essentially the same as that on the workflow for new features - there is one additional branch to pass through before the change reaches the default branch.
If critical bugs are found in a maintenance release, then new maintenance and standard library releases will be created to resolve the problem. The final part of the version number will be incremented for both the language version and the standard library version.
If critical bugs are found in a standard library release that do not affect the associated maintenance release, then only a new standard library release will be created and only the standard library's version number will be incremented.
Note that in these circumstances, the standard library release may include additional features, rather than just containing the bug fix. It is assumed that anyone that cares about receiving only bug fixes without any new features mixed in will already be relying strictly on the maintenance releases rather than using the new standard library releases.
Effect on the community
PEP 407 has this to say about the effects on the community:
People who value stability can just synchronize on the LTS releases which, with the proposed figures, would give a similar support cycle (both in duration and in stability).
I believe this statement is just plain wrong. Life isn't that simple. Instead, developers of third party modules and frameworks will come under pressure to support the full pace of the new release cycle with binary updates, teachers and book authors will receive complaints that they're only covering an "old" version of Python ("You're only using 3.3, the latest is 3.5!"), etc.
As the minor version number starts climbing 3 times faster than it has in the past, I believe perceptions of language stability would also fall (whether such opinions were justified or not).
I believe isolating the increased pace of change to the standard library, and clearly delineating it with a separate version number will greatly reassure the rest of the community that no, we're not suddenly asking them to triple their own rate of development. Instead, we're merely going to ship standard library updates for the next language release in 6-monthly installments rather than delaying them all until the next language definition update, even those changes that are backwards compatible with the previously released version of Python.
The community benefits listed in PEP 407 are equally applicable to this PEP, at least as far as the standard library is concerned:
People who value reactivity and access to new features (without taking the risk to install alpha versions or Mercurial snapshots) would get much more value from the new release cycle than currently.
People who want to contribute new features or improvements would be more motivated to do so, knowing that their contributions will be more quickly available to normal users.
If the faster release cycle encourages more people to focus on contributing to the standard library rather than proposing changes to the language definition, I don't see that as a bad thing.
Handling News Updates
What's New?
The "What's New" documents would be split out into separate documents for standard library releases and language releases. So, during the 3.3 release cycle, we would see:
- What's New in Python 3.3?
- What's New in the Python Standard Library 33.1?
- What's New in the Python Standard Library 33.2?
- What's New in the Python Standard Library 33.3?
And then finally, we would see the next language release:
- What's New in Python 3.4?
For the benefit of users that ignore standard library releases, the 3.4 What's New would link back to the What's New documents for each of the standard library releases in the 3.3 series.
NEWS
Merge conflicts on the NEWS file are already a hassle. Since this PEP proposes introduction of an additional branch into the normal workflow, resolving this becomes even more critical. While Mercurial phases may help to some degree, it would be good to eliminate the problem entirely.
One suggestion from Barry Warsaw is to adopt a non-conflicting separate-files-per-change approach, similar to that used by Twisted [2].
Given that the current manually updated NEWS file will be used for the 3.3.0 release, one possible layout for such an approach might look like:
Misc/
NEWS # Now autogenerated from news_entries
news_entries/
3.3/
NEWS # Original 3.3 NEWS file
maint.1/ # Maintenance branch changes
core/
<news entries>
builtins/
<news entries>
extensions/
<news entries>
library/
<news entries>
documentation/
<news entries>
tests/
<news entries>
compat.1/ # Compatibility branch changes
builtins/
<news entries>
extensions/
<news entries>
library/
<news entries>
documentation/
<news entries>
tests/
<news entries>
# Add maint.2, compat.2 etc as releases are made
3.4/
core/
<news entries>
builtins/
<news entries>
extensions/
<news entries>
library/
<news entries>
documentation/
<news entries>
tests/
<news entries>
# Add maint.1, compat.1 etc as releases are made
Putting the version information in the directory hierarchy isn't strictly necessary (since the NEWS file generator could figure out from the version history), but does make it easier for humans to keep the different versions in order.
Other benefits of reduced version coupling
Slowing down the language release cycle
The current release cycle is a compromise between the desire for stability in the core language definition and C extension ABI, and the desire to get new features (most notably standard library updates) into user's hands more quickly.
With the standard library release cycle decoupled (to some degree) from that of the core language definition, it provides an opportunity to actually slow down the rate of change in the language definition. The language moratorium for Python 3.2 effectively slowed that cycle down to more than 3 years (3.1: June 2009, 3.3: August 2012) without causing any major problems or complaints.
The NEWS file management scheme described above is actually designed to allow us the flexibility to slow down language releases at the same time as standard library releases become more frequent.
As a simple example, if a full two years was allowed between 3.3 and 3.4, the 3.3 release cycle would end up looking like:
3.2.4 # Maintenance release 3.3.0 # Language release 3.3.1 # Maintenance release 3.3 (33.1) # Standard library release 3.3.2 # Maintenance release 3.3 (33.2) # Standard library release 3.3.3 # Maintenance release 3.3 (33.3) # Standard library release 3.3.4 # Maintenance release 3.3 (33.4) # Standard library release 3.4.0 # Language release
The elegance of the proposed branch structure and NEWS entry layout is that this decision wouldn't really need to be made until shortly before the planned 3.4 release date. At that point, the decision could be made to postpone the 3.4 release and keep the 3.3 and 3.3-compat branches open after the 3.3.3 maintenance release and the 3.3 (33.3) standard library release, thus adding another standard library release to the cycle. The choice between another standard library release or a full language release would then be available every 6 months after that.
Further increasing the pace of standard library development
As noted in the previous section, one benefit of the scheme proposed in this PEP is that it largely decouples the language release cycle from the standard library release cycle. The standard library could be updated every 3 months, or even once a month, without having any flow on effects on the language version numbering or the perceived stability of the core language.
While that pace of development isn't practical as long as the binary installer creation for Windows and Mac OS X involves several manual steps (including manual testing) and for as long as we don't have separate "<branch>-release" trees that only receive versions that have been marked as good by the stable buildbots, it's still a useful criterion to keep in mind when considering proposed new versioning schemes: what if we eventually want to make standard library releases even faster than every 6 months?
If the practical issues were ever resolved, then the separate standard library versioning scheme in this PEP could handle it. The tagged version number approach proposed in PEP 407 could not (at least, not without a lot of user confusion and uncertainty).
Other Questions
Why not use the major version number?
The simplest and most logical solution would actually be to map the major.minor.micro version numbers to the language version, stdlib version and maintenance release version respectively.
Instead of releasing Python 3.3.0, we would instead release Python 4.0.0 and the release cycle would look like:
4.0.0 # Language release 4.0.1 # Maintenance release 4.1.0 # Standard library release 4.0.2 # Maintenance release 4.2.0 # Standard library release 4.0.3 # Maintenance release 4.3.0 # Standard library release 5.0.0 # Language release
However, the ongoing pain of the Python 2 -> Python 3 transition (and associated workarounds like the python3 and python2 symlinks to refer directly to the desired release series) means that this simple option isn't viable for historical reasons.
One way that this simple approach could be made to work is to merge the current major and minor version numbers directly into a 2-digit major version number:
33.0.0 # Language release 33.0.1 # Maintenance release 33.1.0 # Standard library release 33.0.2 # Maintenance release 33.2.0 # Standard library release 33.0.3 # Maintenance release 33.3.0 # Standard library release 34.0.0 # Language release
Why not use a four part version number?
Another simple versioning scheme would just add a "standard library" version into the existing versioning scheme:
3.3.0.0 # Language release 3.3.0.1 # Maintenance release 3.3.1.0 # Standard library release 3.3.0.2 # Maintenance release 3.3.2.0 # Standard library release 3.3.0.3 # Maintenance release 3.3.3.0 # Standard library release 3.4.0.0 # Language release
However, this scheme isn't viable due to backwards compatibility constraints on the sys.version_info structure.
Why not use a date-based versioning scheme?
Earlier versions of this PEP proposed a date-based versioning scheme for the standard library. However, such a scheme made it very difficult to handle out-of-cycle releases to fix security issues and other critical bugs in standard library releases, as it required the following steps:
- Change the release version number to the date of the current month.
- Update the What's New, NEWS and documentation to refer to the new release number.
- Make the new release.
With the sequential scheme now proposed, such releases should at most require a little tidying up of the What's New document before making the release.
Why isn't PEP 384 enough?
PEP 384 introduced the notion of a "Stable ABI" for CPython, a limited subset of the full C ABI that is guaranteed to remain stable. Extensions built against the stable ABI should be able to support all subsequent Python versions with the same binary.
This will help new projects to avoid coupling their C extension modules too closely to a specific version of CPython. For existing modules, however, migrating to the stable ABI can involve quite a lot of work (especially for extension modules that define a lot of classes). With limited development resources available, any time spent on such a change is time that could otherwise have been spent working on features that offer more direct benefits to end users.
There are also other benefits to separate versioning (as described above) that are not directly related to the question of binary compatibility with third party C extensions.
Why no binary compatible additions to the C ABI in standard library releases?
There's a case to be made that additions to the CPython C ABI could reasonably be permitted in standard library releases. This would give C extension authors the same freedom as any other package or module author to depend either on a particular language version or on a standard library version.
The PEP currently associates the interpreter version with the language version, and therefore limits major interpreter changes (including C ABI additions) to the language releases.
An alternative, internally consistent, approach would be to link the interpreter version with the standard library version, with only changes that may affect backwards compatibility limited to language releases.
Under such a scheme, the following changes would be acceptable in standard library releases:
- Standard library updates
- new features in pure Python modules
- new features in C extension modules (subject to PEP 399 compatibility requirements)
- new features in language builtins
- Interpreter implementation updates
- binary compatible additions to the C ABI
- changes to the compilation toolchain that do not affect the AST or alter the bytecode magic number
- changes to the core interpreter eval loop
- bug fixes from the corresponding maintenance release
And the following changes would be acceptable in language releases:
- new language syntax
- any updates acceptable in a standard library release
- new deprecation warnings
- removal of previously deprecated features
- changes to the AST
- changes to the emitted bytecode that require altering the magic number
- binary incompatible changes to the C ABI (although the PEP 384 stable ABI must still be preserved)
While such an approach could probably be made to work, there does not appear to be a compelling justification for it, and the approach currently described in the PEP is simpler and easier to explain.
Why not separate out the standard library entirely?
A concept that is occasionally discussed is the idea of making the standard library truly independent from the CPython reference implementation.
My personal opinion is that actually making such a change would involve a lot of work for next to no pay-off. CPython without the standard library is useless (the build chain won't even run, let alone the test suite). You also can't create a standalone pure Python standard library either, because too many "standard library modules" are actually tightly linked in to the internal details of their respective interpreters (for example, the builtins, weakref, gc, sys, inspect, ast).
Creating a separate CPython development branch that is kept compatible with the previous language release, and making releases from that branch that are identified with a separate standard library version number should provide most of the benefits of a separate standard library repository with only a fraction of the pain.
Acknowledgements
Thanks go to the PEP 407 authors for starting this discussion, as well as to those authors and Larry Hastings for initial discussions of the proposal made in this PEP.
References
| [1] | PEP 407: New release cycle and introducing long-term support versions http://www.python.org/dev/peps/pep-0407/ |
| [2] | Twisted's "topfiles" approach to NEWS generation http://twistedmatrix.com/trac/wiki/ReviewProcess#Newsfiles |
Copyright
This document has been placed in the public domain.
pep-0414 Explicit Unicode Literal for Python 3.3
| PEP: | 414 |
|---|---|
| Title: | Explicit Unicode Literal for Python 3.3 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Armin Ronacher <armin.ronacher at active-4.com>, Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Feb-2012 |
| Post-History: | 28-Feb-2012, 04-Mar-2012 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-February/116995.html |
Contents
Abstract
This document proposes the reintegration of an explicit unicode literal from Python 2.x to the Python 3.x language specification, in order to reduce the volume of changes needed when porting Unicode-aware Python 2 applications to Python 3.
BDFL Pronouncement
This PEP has been formally accepted for Python 3.3:
I'm accepting the PEP. It's about as harmless as they come. Make it so.
Proposal
This PEP proposes that Python 3.3 restore support for Python 2's Unicode literal syntax, substantially increasing the number of lines of existing Python 2 code in Unicode aware applications that will run without modification on Python 3.
Specifically, the Python 3 definition for string literal prefixes will be expanded to allow:
"u" | "U"
in addition to the currently supported:
"r" | "R"
The following will all denote ordinary Python 3 strings:
'text' "text" '''text''' """text""" u'text' u"text" u'''text''' u"""text""" U'text' U"text" U'''text''' U"""text"""
No changes are proposed to Python 3's actual Unicode handling, only to the acceptable forms for string literals.
Exclusion of "Raw" Unicode Literals
Python 2 supports a concept of "raw" Unicode literals that don't meet the conventional definition of a raw string: \uXXXX and \UXXXXXXXX escape sequences are still processed by the compiler and converted to the appropriate Unicode code points when creating the associated Unicode objects.
Python 3 has no corresponding concept - the compiler performs no preprocessing of the contents of raw string literals. This matches the behaviour of 8-bit raw string literals in Python 2.
Since such strings are rarely used and would be interpreted differently in Python 3 if permitted, it was decided that leaving them out entirely was a better choice. Code which uses them will thus still fail immediately on Python 3 (with a Syntax Error), rather than potentially producing different output.
To get equivalent behaviour that will run on both Python 2 and Python 3, either an ordinary Unicode literal can be used (with appropriate additional escaping within the string), or else string concatenation or string formatting can be combine the raw portions of the string with those that require the use of Unicode escape sequences.
Note that when using from __future__ import unicode_literals in Python 2, the nominally "raw" Unicode string literals will process \uXXXX and \UXXXXXXXX escape sequences, just like Python 2 strings explicitly marked with the "raw Unicode" prefix.
Author's Note
This PEP was originally written by Armin Ronacher, and Guido's approval was given based on that version.
The currently published version has been rewritten by Nick Coghlan to include additional historical details and rationale that were taken into account when Guido made his decision, but were not explicitly documented in Armin's version of the PEP.
Readers should be aware that many of the arguments in this PEP are not technical ones. Instead, they relate heavily to the social and personal aspects of software development.
Rationale
With the release of a Python 3 compatible version of the Web Services Gateway Interface (WSGI) specification (PEP 3333) for Python 3.2, many parts of the Python web ecosystem have been making a concerted effort to support Python 3 without adversely affecting their existing developer and user communities.
One major item of feedback from key developers in those communities, including Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), Jacob Kaplan-Moss (Django) and Kenneth Reitz (requests) is that the requirement to change the spelling of every Unicode literal in an application (regardless of how that is accomplished) is a key stumbling block for porting efforts.
In particular, unlike many of the other Python 3 changes, it isn't one that framework and library authors can easily handle on behalf of their users. Most of those users couldn't care less about the "purity" of the Python language specification, they just want their websites and applications to work as well as possible.
While it is the Python web community that has been most vocal in highlighting this concern, it is expected that other highly Unicode aware domains (such as GUI development) may run into similar issues as they (and their communities) start making concerted efforts to support Python 3.
Common Objections
Complaint: This PEP may harm adoption of Python 3.2
This complaint is interesting, as it carries within it a tacit admission that this PEP will make it easier to port Unicode aware Python 2 applications to Python 3.
There are many existing Python communities that are prepared to put up with the constraints imposed by the existing suite of porting tools, or to update their Python 2 code bases sufficiently that the problems are minimised.
This PEP is not for those communities. Instead, it is designed specifically to help people that don't want to put up with those difficulties.
However, since the proposal is for a comparatively small tweak to the language syntax with no semantic changes, it is feasible to support it as a third party import hook. While such an import hook imposes some import time overhead, and requires additional steps from each application that needs it to get the hook in place, it allows applications that target Python 3.2 to use libraries and frameworks that would otherwise only run on Python 3.3+ due to their use of unicode literal prefixes.
One such import hook project is Vinay Sajip's uprefix [4].
For those that prefer to translate their code in advance rather than converting on the fly at import time, Armin Ronacher is working on a hook that runs at install time rather than during import [5].
Combining the two approaches is of course also possible. For example, the import hook could be used for rapid edit-test cycles during local development, but the install hook for continuous integration tasks and deployment on Python 3.2.
The approaches described in this section may prove useful, for example, for applications that wish to target Python 3 on the Ubuntu 12.04 LTS release, which will ship with Python 2.7 and 3.2 as officially supported Python versions.
Complaint: Python 3 shouldn't be made worse just to support porting from Python 2
This is indeed one of the key design principles of Python 3. However, one of the key design principles of Python as a whole is that "practicality beats purity". If we're going to impose a significant burden on third party developers, we should have a solid rationale for doing so.
In most cases, the rationale for backwards incompatible Python 3 changes are either to improve code correctness (for example, stricter default separation of binary and text data and integer division upgrading to floats when necessary), reduce typical memory usage (for example, increased usage of iterators and views over concrete lists), or to remove distracting nuisances that make Python code harder to read without increasing its expressiveness (for example, the comma based syntax for naming caught exceptions). Changes backed by such reasoning are not going to be reverted, regardless of objections from Python 2 developers attempting to make the transition to Python 3.
In many cases, Python 2 offered two ways of doing things for historical reasons. For example, inequality could be tested with both != and <> and integer literals could be specified with an optional L suffix. Such redundancies have been eliminated in Python 3, which reduces the overall size of the language and improves consistency across developers.
In the original Python 3 design (up to and including Python 3.2), the explicit prefix syntax for unicode literals was deemed to fall into this category, as it is completely unnecessary in Python 3. However, the difference between those other cases and unicode literals is that the unicode literal prefix is not redundant in Python 2 code: it is a programmatically significant distinction that needs to be preserved in some fashion to avoid losing information.
While porting tools were created to help with the transition (see next section) it still creates an additional burden on heavy users of unicode strings in Python 2, solely so that future developers learning Python 3 don't need to be told "For historical reasons, string literals may have an optional u or U prefix. Never use this yourselves, it's just there to help with porting from an earlier version of the language."
Plenty of students learning Python 2 received similar warnings regarding string exceptions without being confused or irreparably stunted in their growth as Python developers. It will be the same with this feature.
This point is further reinforced by the fact that Python 3 still allows the uppercase variants of the B and R prefixes for bytes literals and raw bytes and string literals. If the potential for confusion due to string prefix variants is that significant, where was the outcry asking that these redundant prefixes be removed along with all the other redundancies that were eliminated in Python 3?
Just as support for string exceptions was eliminated from Python 2 using the normal deprecation process, support for redundant string prefix characters (specifically, B, R, u, U) may eventually be eliminated from Python 3, regardless of the current acceptance of this PEP. However, such a change will likely only occur once third party libraries supporting Python 2.7 is about as common as libraries supporting Python 2.2 or 2.3 is today.
Complaint: The WSGI "native strings" concept is an ugly hack
One reason the removal of unicode literals has provoked such concern amongst the web development community is that the updated WSGI specification had to make a few compromises to minimise the disruption for existing web servers that provide a WSGI-compatible interface (this was deemed necessary in order to make the updated standard a viable target for web application authors and web framework developers).
One of those compromises is the concept of a "native string". WSGI defines three different kinds of string:
- text strings: handled as unicode in Python 2 and str in Python 3
- native strings: handled as str in both Python 2 and Python 3
- binary data: handled as str in Python 2 and bytes in Python 3
Some developers consider WSGI's "native strings" to be an ugly hack, as they are explicitly documented as being used solely for latin-1 decoded "text", regardless of the actual encoding of the underlying data. Using this approach bypasses many of the updates to Python 3's data model that are designed to encourage correct handling of text encodings. However, it generally works due to the specific details of the problem domain - web server and web framework developers are some of the individuals most aware of how blurry the line can get between binary data and text when working with HTTP and related protocols, and how important it is to understand the implications of the encodings in use when manipulating encoded text data. At the application level most of these details are hidden from the developer by the web frameworks and support libraries (both in Python 2 and in Python 3).
In practice, native strings are a useful concept because there are some APIs (both in the standard library and in third party frameworks and packages) and some internal interpreter details that are designed primarily to work with str. These components often don't support unicode in Python 2 or bytes in Python 3, or, if they do, require additional encoding details and/or impose constraints that don't apply to the str variants.
Some example of interfaces that are best handled by using actual str instances are:
- Python identifiers (as attributes, dict keys, class names, module names, import references, etc)
- URLs for the most part as well as HTTP headers in urllib/http servers
- WSGI environment keys and CGI-inherited values
- Python source code for dynamic compilation and AST hacks
- Exception messages
- __repr__ return value
- preferred filesystem paths
- preferred OS environment
In Python 2.6 and 2.7, these distinctions are most naturally expressed as follows:
- u"": text string (unicode)
- "": native string (str)
- b"": binary data (str, also aliased as bytes)
In Python 3, the latin-1 decoded native strings are not distinguished from any other text strings:
- "": text string (str)
- "": native string (str)
- b"": binary data (bytes)
If from __future__ import unicode_literals is used to modify the behaviour of Python 2, then, along with an appropriate definition of n(), the distinction can be expressed as:
- "": text string
- n(""): native string
- b"": binary data
(While n=str works for simple cases, it can sometimes have problems due to non-ASCII source encodings)
In the common subset of Python 2 and Python 3 (with appropriate specification of a source encoding and definitions of the u() and b() helper functions), they can be expressed as:
- u(""): text string
- "": native string
- b(""): binary data
That last approach is the only variant that supports Python 2.5 and earlier.
Of all the alternatives, the format currently supported in Python 2.6 and 2.7 is by far the cleanest approach that clearly distinguishes the three desired kinds of behaviour. With this PEP, that format will also be supported in Python 3.3+. It will also be supported in Python 3.1 and 3.2 through the use of import and install hooks. While it is significantly less likely, it is also conceivable that the hooks could be adapted to allow the use of the b prefix on Python 2.5.
Complaint: The existing tools should be good enough for everyone
A commonly expressed sentiment from developers that have already successfully ported applications to Python 3 is along the lines of "if you think it's hard, you're doing it wrong" or "it's not that hard, just try it!". While it is no doubt unintentional, these responses all have the effect of telling the people that are pointing out inadequacies in the current porting toolset "there's nothing wrong with the porting tools, you just suck and don't know how to use them properly".
These responses are a case of completely missing the point of what people are complaining about. The feedback that resulted in this PEP isn't due to people complaining that ports aren't possible. Instead, the feedback is coming from people that have successfully completed ports and are objecting that they found the experience thoroughly unpleasant for the class of application that they needed to port (specifically, Unicode aware web frameworks and support libraries).
This is a subjective appraisal, and it's the reason why the Python 3 porting tools ecosystem is a case where the "one obvious way to do it" philosophy emphatically does not apply. While it was originally intended that "develop in Python 2, convert with 2to3, test both" would be the standard way to develop for both versions in parallel, in practice, the needs of different projects and developer communities have proven to be sufficiently diverse that a variety of approaches have been devised, allowing each group to select an approach that best fits their needs.
Lennart Regebro has produced an excellent overview of the available migration strategies [2], and a similar review is provided in the official porting guide [3]. (Note that the official guidance has softened to "it depends on your specific situation" since Lennart wrote his overview).
However, both of those guides are written from the founding assumption that all of the developers involved are already committed to the idea of supporting Python 3. They make no allowance for the social aspects of such a change when you're interacting with a user base that may not be especially tolerant of disruptions without a clear benefit, or are trying to persuade Python 2 focused upstream developers to accept patches that are solely about improving Python 3 forward compatibility.
With the current porting toolset, every migration strategy will result in changes to every Unicode literal in a project. No exceptions. They will be converted to either an unprefixed string literal (if the project decides to adopt the unicode_literals import) or else to a converter call like u("text").
If the unicode_literals import approach is employed, but is not adopted across the entire project at the same time, then the meaning of a bare string literal may become annoyingly ambiguous. This problem can be particularly pernicious for aggregated software, like a Django site - in such a situation, some files may end up using the unicode_literals import and others may not, creating definite potential for confusion.
While these problems are clearly solvable at a technical level, they're a completely unnecessary distraction at the social level. Developer energy should be reserved for addressing real technical difficulties associated with the Python 3 transition (like distinguishing their 8-bit text strings from their binary data). They shouldn't be punished with additional code changes (even automated ones) solely due to the fact that they have already explicitly identified their Unicode strings in Python 2.
Armin Ronacher has created an experimental extension to 2to3 which only modernizes Python code to the extent that it runs on Python 2.7 or later with support from the cross-version compatibility six library. This tool is available as python-modernize [1]. Currently, the deltas generated by this tool will affect every Unicode literal in the converted source. This will create legitimate concerns amongst upstream developers asked to accept such changes, and amongst framework users being asked to change their applications.
However, by eliminating the noise from changes to the Unicode literal syntax, many projects could be cleanly and (comparatively) non-controversially made forward compatible with Python 3.3+ just by running python-modernize and applying the recommended changes.
References
| [1] | Python-Modernize (http://github.com/mitsuhiko/python-modernize) |
| [2] | Porting to Python 3: Migration Strategies (http://python3porting.com/strategies.html) |
| [3] | Porting Python 2 Code to Python 3 (http://docs.python.org/howto/pyporting.html) |
| [4] | uprefix import hook project (https://bitbucket.org/vinay.sajip/uprefix) |
| [5] | install hook to remove unicode string prefix characters (https://github.com/mitsuhiko/unicode-literals-pep/tree/master/install-hook) |
Copyright
This document has been placed in the public domain.
pep-0415 Implement context suppression with exception attributes
| PEP: | 415 |
|---|---|
| Title: | Implement context suppression with exception attributes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benjamin Peterson <benjamin at python.org> |
| BDFL-Delegate: | Nick Coghlan |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Feb-2012 |
| Python-Version: | 3.3 |
| Post-History: | 26-Feb-2012 |
| Replaces: | 409 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-May/119467.html |
Abstract
PEP 409 introduced support for the raise exc from None construct to allow the display of the exception context to be explicitly suppressed. This PEP retains the language level changes already implemented in PEP 409, but replaces the underlying implementation mechanism with a simpler approach based on a new __suppress_context__ attribute on all BaseException instances.
PEP Acceptance
This PEP was accepted by Nick Coghlan on the 14th of May, 2012.
Rationale
PEP 409 changes __cause__ to be Ellipsis by default. Then if __cause__ is set to None by raise exc from None, no context or cause will be printed should the exception be uncaught.
The main problem with this scheme is it complicates the role of __cause__. __cause__ should indicate the cause of the exception not whether __context__ should be printed or not. This use of __cause__ is also not easily extended in the future. For example, we may someday want to allow the programmer to select which of __context__ and __cause__ will be printed. The PEP 409 implementation is not amenable to this.
The use of Ellipsis is a hack. Before PEP 409, Ellipsis was used exclusively in extended slicing. Extended slicing has nothing to do with exceptions, so it's not clear to someone inspecting an exception object why __cause__ should be set to Ellipsis. Using Ellipsis by default for __cause__ makes it asymmetrical with __context__.
Proposal
A new attribute on BaseException, __suppress_context__, will be introduced. Whenever __cause__ is set, __suppress_context__ will be set to True. In particular, raise exc from cause syntax will set exc.__suppress_context__ to True. Exception printing code will check for that attribute to determine whether context and cause will be printed. __cause__ will return to its original purpose and values.
There is precedence for __suppress_context__ with the print_line_and_file exception attribute.
To summarize, raise exc from cause will be equivalent to:
exc.__cause__ = cause raise exc
where exc.__cause__ = cause implicitly sets exc.__suppress_context__.
Patches
There is a patch on Issue 14133 [1].
Copyright
This document has been placed in the public domain.
pep-0416 Add a frozendict builtin type
| PEP: | 416 |
|---|---|
| Title: | Add a frozendict builtin type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-February-2012 |
| Python-Version: | 3.3 |
Contents
Rejection Notice
I'm rejecting this PEP. A number of reasons (not exhaustive):
- According to Raymond Hettinger, use of frozendict is low. Those that do use it tend to use it as a hint only, such as declaring global or class-level "constants": they aren't really immutable, since anyone can still assign to the name.
- There are existing idioms for avoiding mutable default values.
- The potential of optimizing code using frozendict in PyPy is unsure; a lot of other things would have to change first. The same holds for compile-time lookups in general.
- Multiple threads can agree by convention not to mutate a shared dict, there's no great need for enforcement. Multiple processes can't share dicts.
- Adding a security sandbox written in Python, even with a limited scope, is frowned upon by many, due to the inherent difficulty with ever proving that the sandbox is actually secure. Because of this we won't be adding one to the stdlib any time soon, so this use case falls outside the scope of a PEP.
On the other hand, exposing the existing read-only dict proxy as a built-in type sounds good to me. (It would need to be changed to allow calling the constructor.) GvR.
Update (2012-04-15): A new MappingProxyType type was added to the types module of Python 3.3.
Abstract
Add a new frozendict builtin type.
Rationale
A frozendict is a read-only mapping: a key cannot be added nor removed, and a key is always mapped to the same value. However, frozendict values can be not hashable. A frozendict is hashable if and only if all values are hashable.
Use cases:
- Immutable global variable like a default configuration.
- Default value of a function parameter. Avoid the issue of mutable default arguments.
- Implement a cache: frozendict can be used to store function keywords. frozendict can be used as a key of a mapping or as a member of set.
- frozendict avoids the need of a lock when the frozendict is shared by multiple threads or processes, especially hashable frozendict. It would also help to prohibe coroutines (generators + greenlets) to modify the global state.
- frozendict lookup can be done at compile time instead of runtime because the mapping is read-only. frozendict can be used instead of a preprocessor to remove conditional code at compilation, like code specific to a debug build.
- frozendict helps to implement read-only object proxies for security modules. For example, it would be possible to use frozendict type for __builtins__ mapping or type.__dict__. This is possible because frozendict is compatible with the PyDict C API.
- frozendict avoids the need of a read-only proxy in some cases. frozendict is faster than a proxy because getting an item in a frozendict is a fast lookup whereas a proxy requires a function call.
Constraints
- frozendict has to implement the Mapping abstract base class
- frozendict keys and values can be unorderable
- a frozendict is hashable if all keys and values are hashable
- frozendict hash does not depend on the items creation order
Implementation
- Add a PyFrozenDictObject structure based on PyDictObject with an extra "Py_hash_t hash;" field
- frozendict.__hash__() is implemented using hash(frozenset(self.items())) and caches the result in its private hash attribute
- Register frozendict as a collections.abc.Mapping
- frozendict can be used with PyDict_GetItem(), but PyDict_SetItem() and PyDict_DelItem() raise a TypeError
Recipe: hashable dict
To ensure that a a frozendict is hashable, values can be checked before creating the frozendict:
import itertools
def hashabledict(*args, **kw):
# ensure that all values are hashable
for key, value in itertools.chain(args, kw.items()):
if isinstance(value, (int, str, bytes, float, frozenset, complex)):
# avoid the compute the hash (which may be slow) for builtin
# types known to be hashable for any value
continue
hash(value)
# don't check the key: frozendict already checks the key
return frozendict.__new__(cls, *args, **kw)
Objections
namedtuple may fit the requiements of a frozendict.
A namedtuple is not a mapping, it does not implement the Mapping abstract base class.
frozendict can be implemented in Python using descriptors" and "frozendict just need to be practically constant.
If frozendict is used to harden Python (security purpose), it must be implemented in C. A type implemented in C is also faster.
The PEP 351 was rejected.
The PEP 351 tries to freeze an object and so may convert a mutable object to an immutable object (using a different type). frozendict doesn't convert anything: hash(frozendict) raises a TypeError if a value is not hashable. Freezing an object is not the purpose of this PEP.
Alternative: dictproxy
Python has a builtin dictproxy type used by type.__dict__ getter descriptor. This type is not public. dictproxy is a read-only view of a dictionary, but it is not read-only mapping. If a dictionary is modified, the dictproxy is also modified.
dictproxy can be used using ctypes and the Python C API, see for example the make dictproxy object via ctypes.pythonapi and type() (Python recipe 576540) [1] by Ikkei Shimomura. The recipe contains a test checking that a dictproxy is "mutable" (modify the dictionary linked to the dictproxy).
However dictproxy can be useful in some cases, where its mutable property is not an issue, to avoid a copy of the dictionary.
Existing implementations
Whitelist approach.
- Implementing an Immutable Dictionary (Python recipe 498072) by Aristotelis Mikropoulos. Similar to frozendict except that it is not truly read-only: it is possible to access to this private internal dict. It does not implement __hash__ and has an implementation issue: it is possible to call again __init__() to modify the mapping.
- PyWebmail contains an ImmutableDict type: webmail.utils.ImmutableDict. It is hashable if keys and values are hashable. It is not truly read-only: its internal dict is a public attribute.
- remember project: remember.dicts.FrozenDict. It is used to implement a cache: FrozenDict is used to store function callbacks. FrozenDict may be hashable. It has an extra supply_dict() class method to create a FrozenDict from a dict without copying the dict: store the dict as the internal dict. Implementation issue: __init__() can be called to modify the mapping and the hash may differ depending on item creation order. The mapping is not truly read-only: the internal dict is accessible in Python.
Blacklist approach: inherit from dict and override write methods to raise an exception. It is not truly read-only: it is still possible to call dict methods on such "frozen dictionary" to modify it.
- brownie: brownie.datastructures.ImmuatableDict. It is hashable if keys and values are hashable. werkzeug project has the same code: werkzeug.datastructures.ImmutableDict. ImmutableDict is used for global constant (configuration options). The Flask project uses ImmutableDict of werkzeug for its default configuration.
- SQLAchemy project: sqlachemy.util.immutabledict. It is not hashable and has an extra method: union(). immutabledict is used for the default value of parameter of some functions expecting a mapping. Example: mapper_args=immutabledict() in SqlSoup.map().
- Frozen dictionaries (Python recipe 414283) by Oren Tirosh. It is hashable if keys and values are hashable. Included in the following projects:
- lingospot: frozendict/frozendict.py
- factor-graphics: frozendict type in python/fglib/util_ext_frozendict.py
- The gsakkis-utils project written by George Sakkis includes a frozendict type: datastructs.frozendict
- characters: scripts/python/frozendict.py. It is hashable. __init__() sets __init__ to None.
- Old NLTK (1.x): nltk.util.frozendict. Keys and values must be hashable. __init__() can be called twice to modify the mapping. frozendict is used to "freeze" an object.
Hashable dict: inherit from dict and just add an __hash__ method.
- pypy.rpython.lltypesystem.lltype.frozendict. It is hashable but don't deny modification of the mapping.
- factor-graphics: hashabledict type in python/fglib/util_ext_frozendict.py
Links
- Issue #14162: PEP 416: Add a builtin frozendict type
- PEP 412: Key-Sharing Dictionary (issue #13903)
- PEP 351: The freeze protocol
- The case for immutable dictionaries; and the central misunderstanding of PEP 351
- make dictproxy object via ctypes.pythonapi and type() (Python recipe 576540) by Ikkei Shimomura.
- Python security modules implementing read-only object proxies using a C extension:
Copyright
This document has been placed in the public domain.
pep-0417 Including mock in the Standard Library
| PEP: | 417 |
|---|---|
| Title: | Including mock in the Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Michael Foord <michael at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Mar-2012 |
| Python-Version: | 3.3 |
| Post-History: | 12-Mar-2012 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-March/117507.html |
Abstract
This PEP proposes adding the mock [1] testing library to the Python standard library as unittest.mock.
Rationale
Creating mock objects for testing is a common need in Python. Many developers create ad-hoc mocks, as needed, in their test suites. This is currently what we do in the Python test suite, where a standardised mock object library would be helpful.
There are many mock object libraries available for Python [2]. Of these, mock is overwhelmingly the most popular, with as many downloads on PyPI as the other mocking libraries combined.
An advantage of mock is that it is a mocking library and not a framework. It provides a configurable and flexible mock object, without being opinionated about how you write your tests. The mock api is now well battle-tested and stable.
mock also handles safely monkeypatching and unmonkeypatching objects during the scope of a test. This is hard to do safely and many developers / projects mimic this functionality (often incorrectly). A standardised way to do this, handling the complexity of patching in the presence of the descriptor protocol (etc) is useful. People are asking for a "patch" [3] feature to unittest. Doing this via mock.patch is preferable to re-implementing part of this functionality in unittest.
Background
Addition of mock to the Python standard library was discussed and agreed to at the Python Language Summit 2012.
Open Issues
As of release 0.8, which is current at the time of writing, mock is compatible with Python 2.4-3.2. Moving into the Python standard library will allow for the removal of some Python 2 specific "compatibility hacks".
mock 0.8 introduced a new feature, "auto-speccing", obsoletes an older mock feature called "mocksignature". The "mocksignature" functionality can be removed from mock altogether prior to inclusion.
References
| [1] | mock library on PyPI |
| [2] | http://pypi.python.org/pypi?%3Aaction=search&term=mock&submit=search |
| [3] | http://bugs.python.org/issue11664 |
Copyright
This document has been placed in the public domain.
pep-0418 Add monotonic time, performance counter, and process time functions
| PEP: | 418 |
|---|---|
| Title: | Add monotonic time, performance counter, and process time functions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Cameron Simpson <cs at zip.com.au>, Jim Jewett <jimjjewett at gmail.com>, Stephen J. Turnbull <stephen at xemacs.org>, Victor Stinner <victor.stinner at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-March-2012 |
| Python-Version: | 3.3 |
Contents
- Abstract
- Rationale
- Python functions
- Alternatives: API design
- Glossary
- Hardware clocks
- NTP adjustment
- Operating system time functions
- System Standby
- Footnotes
- Links
- Acceptance
- References
- Copyright
Abstract
This PEP proposes to add time.get_clock_info(name), time.monotonic(), time.perf_counter() and time.process_time() functions to Python 3.3.
Rationale
If a program uses the system time to schedule events or to implement a timeout, it may fail to run events at the right moment or stop the timeout too early or too late when the system time is changed manually or adjusted automatically by NTP. A monotonic clock should be used instead to not be affected by system time updates: time.monotonic().
To measure the performance of a function, time.clock() can be used but it is very different on Windows and on Unix. On Windows, time.clock() includes time elapsed during sleep, whereas it does not on Unix. time.clock() resolution is very good on Windows, but very bad on Unix. The new time.perf_counter() function should be used instead to always get the most precise performance counter with a portable behaviour (ex: include time spend during sleep).
Until now, Python did not provide directly a portable function to measure CPU time. time.clock() can be used on Unix, but it has bad resolution. resource.getrusage() or os.times() can also be used on Unix, but they require to compute the sum of time spent in kernel space and user space. The new time.process_time() function acts as a portable counter that always measures CPU time (excluding time elapsed during sleep) and has the best available resolution.
Each operating system implements clocks and performance counters differently, and it is useful to know exactly which function is used and some properties of the clock like its resolution. The new time.get_clock_info() function gives access to all available information about each Python time function.
New functions:
- time.monotonic(): timeout and scheduling, not affected by system clock updates
- time.perf_counter(): benchmarking, most precise clock for short period
- time.process_time(): profiling, CPU time of the process
Users of new functions:
- time.monotonic(): concurrent.futures, multiprocessing, queue, subprocess, telnet and threading modules to implement timeout
- time.perf_counter(): trace and timeit modules, pybench program
- time.process_time(): profile module
- time.get_clock_info(): pybench program to display information about the timer like the resolution
The time.clock() function is deprecated because it is not portable: it behaves differently depending on the operating system. time.perf_counter() or time.process_time() should be used instead, depending on your requirements. time.clock() is marked as deprecated but is not planned for removal.
Limitations:
- The behaviour of clocks after a system suspend is not defined in the documentation of new functions. The behaviour depends on the operating system: see the Monotonic Clocks section below. Some recent operating systems provide two clocks, one including time elapsed during system suspsend, one not including this time. Most operating systems only provide one kind of clock.
- time.monotonic() and time.perf_counter() may or may not be adjusted. For example, CLOCK_MONOTONIC is slewed on Linux, whereas GetTickCount() is not adjusted on Windows. time.get_clock_info('monotonic')['adjustable'] can be used to check if the monotonic clock is adjustable or not.
- No time.thread_time() function is proposed by this PEP because it is not needed by Python standard library nor a common asked feature. Such function would only be available on Windows and Linux. On Linux, it is possible to use time.clock_gettime(CLOCK_THREAD_CPUTIME_ID). On Windows, ctypes or another module can be used to call the GetThreadTimes() function.
Python functions
New Functions
time.get_clock_info(name)
Get information on the specified clock. Supported clock names:
- "clock": time.clock()
- "monotonic": time.monotonic()
- "perf_counter": time.perf_counter()
- "process_time": time.process_time()
- "time": time.time()
Return a time.clock_info object which has the following attributes:
- implementation (str): name of the underlying operating system function. Examples: "QueryPerformanceCounter()", "clock_gettime(CLOCK_REALTIME)".
- monotonic (bool): True if the clock cannot go backward.
- adjustable (bool): True if the clock can be changed automatically (e.g. by a NTP daemon) or manually by the system administrator, False otherwise
- resolution (float): resolution in seconds of the clock.
time.monotonic()
Monotonic clock, i.e. cannot go backward. It is not affected by system clock updates. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid and is a number of seconds.
On Windows versions older than Vista, time.monotonic() detects GetTickCount() integer overflow (32 bits, roll-over after 49.7 days). It increases an internal epoch (reference time by) 232 each time that an overflow is detected. The epoch is stored in the process-local state and so the value of time.monotonic() may be different in two Python processes running for more than 49 days. On more recent versions of Windows and on other operating systems, time.monotonic() is system-wide.
Availability: Windows, Mac OS X, Linux, FreeBSD, OpenBSD, Solaris. Not available on GNU/Hurd.
Pseudo-code [2]:
if os.name == 'nt':
# GetTickCount64() requires Windows Vista, Server 2008 or later
if hasattr(_time, 'GetTickCount64'):
def monotonic():
return _time.GetTickCount64() * 1e-3
else:
def monotonic():
ticks = _time.GetTickCount()
if ticks < monotonic.last:
# Integer overflow detected
monotonic.delta += 2**32
monotonic.last = ticks
return (ticks + monotonic.delta) * 1e-3
monotonic.last = 0
monotonic.delta = 0
elif sys.platform == 'darwin':
def monotonic():
if monotonic.factor is None:
factor = _time.mach_timebase_info()
monotonic.factor = timebase[0] / timebase[1] * 1e-9
return _time.mach_absolute_time() * monotonic.factor
monotonic.factor = None
elif hasattr(time, "clock_gettime") and hasattr(time, "CLOCK_HIGHRES"):
def monotonic():
return time.clock_gettime(time.CLOCK_HIGHRES)
elif hasattr(time, "clock_gettime") and hasattr(time, "CLOCK_MONOTONIC"):
def monotonic():
return time.clock_gettime(time.CLOCK_MONOTONIC)
On Windows, QueryPerformanceCounter() is not used even though it has a better resolution than GetTickCount(). It is not reliable and has too many issues.
time.perf_counter()
Performance counter with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid and is a number of seconds.
It is available on all platforms.
Pseudo-code:
if os.name == 'nt':
def _win_perf_counter():
if _win_perf_counter.frequency is None:
_win_perf_counter.frequency = _time.QueryPerformanceFrequency()
return _time.QueryPerformanceCounter() / _win_perf_counter.frequency
_win_perf_counter.frequency = None
def perf_counter():
if perf_counter.use_performance_counter:
try:
return _win_perf_counter()
except OSError:
# QueryPerformanceFrequency() fails if the installed
# hardware does not support a high-resolution performance
# counter
perf_counter.use_performance_counter = False
if perf_counter.use_monotonic:
# The monotonic clock is preferred over the system time
try:
return time.monotonic()
except OSError:
perf_counter.use_monotonic = False
return time.time()
perf_counter.use_performance_counter = (os.name == 'nt')
perf_counter.use_monotonic = hasattr(time, 'monotonic')
time.process_time()
Sum of the system and user CPU time of the current process. It does not include time elapsed during sleep. It is process-wide by definition. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid.
It is available on all platforms.
Pseudo-code [2]:
if os.name == 'nt':
def process_time():
handle = _time.GetCurrentProcess()
process_times = _time.GetProcessTimes(handle)
return (process_times['UserTime'] + process_times['KernelTime']) * 1e-7
else:
try:
import resource
except ImportError:
has_resource = False
else:
has_resource = True
def process_time():
if process_time.clock_id is not None:
try:
return time.clock_gettime(process_time.clock_id)
except OSError:
process_time.clock_id = None
if process_time.use_getrusage:
try:
usage = resource.getrusage(resource.RUSAGE_SELF)
return usage[0] + usage[1]
except OSError:
process_time.use_getrusage = False
if process_time.use_times:
try:
times = _time.times()
cpu_time = times.tms_utime + times.tms_stime
return cpu_time / process_time.ticks_per_seconds
except OSError:
process_time.use_getrusage = False
return _time.clock()
if (hasattr(time, 'clock_gettime')
and hasattr(time, 'CLOCK_PROF')):
process_time.clock_id = time.CLOCK_PROF
elif (hasattr(time, 'clock_gettime')
and hasattr(time, 'CLOCK_PROCESS_CPUTIME_ID')):
process_time.clock_id = time.CLOCK_PROCESS_CPUTIME_ID
else:
process_time.clock_id = None
process_time.use_getrusage = has_resource
process_time.use_times = hasattr(_time, 'times')
if process_time.use_times:
# sysconf("SC_CLK_TCK"), or the HZ constant, or 60
process_time.ticks_per_seconds = _times.ticks_per_seconds
Existing Functions
time.time()
The system time which is usually the civil time. It is system-wide by definition. It can be set manually by the system administrator or automatically by a NTP daemon.
It is available on all platforms and cannot fail.
Pseudo-code [2]:
if os.name == "nt":
def time():
return _time.GetSystemTimeAsFileTime()
else:
def time():
if hasattr(time, "clock_gettime"):
try:
return time.clock_gettime(time.CLOCK_REALTIME)
except OSError:
# CLOCK_REALTIME is not supported (unlikely)
pass
if hasattr(_time, "gettimeofday"):
try:
return _time.gettimeofday()
except OSError:
# gettimeofday() should not fail
pass
if hasattr(_time, "ftime"):
return _time.ftime()
else:
return _time.time()
time.sleep()
Suspend execution for the given number of seconds. The actual suspension time may be less than that requested because any caught signal will terminate the time.sleep() following execution of that signal's catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
Pseudo-code [2]:
try:
import select
except ImportError:
has_select = False
else:
has_select = hasattr(select, "select")
if has_select:
def sleep(seconds):
return select.select([], [], [], seconds)
elif hasattr(_time, "delay"):
def sleep(seconds):
milliseconds = int(seconds * 1000)
_time.delay(milliseconds)
elif os.name == "nt":
def sleep(seconds):
milliseconds = int(seconds * 1000)
win32api.ResetEvent(hInterruptEvent);
win32api.WaitForSingleObject(sleep.sigint_event, milliseconds)
sleep.sigint_event = win32api.CreateEvent(NULL, TRUE, FALSE, FALSE)
# SetEvent(sleep.sigint_event) will be called by the signal handler of SIGINT
elif os.name == "os2":
def sleep(seconds):
milliseconds = int(seconds * 1000)
DosSleep(milliseconds)
else:
def sleep(seconds):
seconds = int(seconds)
_time.sleep(seconds)
Deprecated Function
time.clock()
On Unix, return the current processor time as a floating point number expressed in seconds. It is process-wide by definition. The resolution, and in fact the very definition of the meaning of "processor time", depends on that of the C function of the same name, but in any case, this is the function to use for benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter(). The resolution is typically better than one microsecond. It is system-wide.
Pseudo-code [2]:
if os.name == 'nt':
def clock():
try:
return _win_perf_counter()
except OSError:
# QueryPerformanceFrequency() fails if the installed
# hardware does not support a high-resolution performance
# counter
pass
return _time.clock()
else:
clock = _time.clock
Alternatives: API design
Other names for time.monotonic()
- time.counter()
- time.metronomic()
- time.seconds()
- time.steady(): "steady" is ambiguous: it means different things to different people. For example, on Linux, CLOCK_MONOTONIC is adjusted. If we uses the real time as the reference clock, we may say that CLOCK_MONOTONIC is steady. But CLOCK_MONOTONIC gets suspended on system suspend, whereas real time includes any time spent in suspend.
- time.timeout_clock()
- time.wallclock(): time.monotonic() is not the system time aka the "wall clock", but a monotonic clock with an unspecified starting point.
The name "time.try_monotonic()" was also proposed for an older version of time.monotonic() which would fall back to the system time when no monotonic clock was available.
Other names for time.perf_counter()
- time.high_precision()
- time.highres()
- time.hires()
- time.performance_counter()
- time.timer()
Only expose operating system clocks
To not have to define high-level clocks, which is a difficult task, a simpler approach is to only expose operating system clocks. time.clock_gettime() and related clock identifiers were already added to Python 3.3 for example.
time.monotonic(): Fallback to system time
If no monotonic clock is available, time.monotonic() falls back to the system time.
Issues:
- It is hard to define such a function correctly in the documentation: is it monotonic? Is it steady? Is it adjusted?
- Some users want to decide what to do when no monotonic clock is available: use another clock, display an error, or do something else.
Different APIs were proposed to define such function.
One function with a flag: time.monotonic(fallback=True)
- time.monotonic(fallback=True) falls back to the system time if no monotonic clock is available or if the monotonic clock failed.
- time.monotonic(fallback=False) raises OSError if monotonic clock fails and NotImplementedError if the system does not provide a monotonic clock
A keyword argument that gets passed as a constant in the caller is usually poor API.
Raising NotImplementedError for a function is something uncommon in Python and should be avoided.
One time.monotonic() function, no flag
time.monotonic() returns (time: float, is_monotonic: bool).
An alternative is to use a function attribute: time.monotonic.is_monotonic. The attribute value would be None before the first call to time.monotonic().
Choosing the clock from a list of constraints
The PEP as proposed offers a few new clocks, but their guarantees are deliberately loose in order to offer useful clocks on different platforms. This inherently embeds policy in the calls, and the caller must thus choose a policy.
The "choose a clock" approach suggests an additional API to let callers implement their own policy if necessary by making most platform clocks available and letting the caller pick amongst them. The PEP's suggested clocks are still expected to be available for the common simple use cases.
To do this two facilities are needed: an enumeration of clocks, and metadata on the clocks to enable the user to evaluate their suitability.
The primary interface is a function make simple choices easy: the caller can use time.get_clock(*flags) with some combination of flags. This includes at least:
- time.MONOTONIC: clock cannot go backward
- time.STEADY: clock rate is steady
- time.ADJUSTED: clock may be adjusted, for example by NTP
- time.HIGHRES: clock with the highest resolution
It returns a clock object with a .now() method returning the current time. The clock object is annotated with metadata describing the clock feature set; its .flags field will contain at least all the requested flags.
time.get_clock() returns None if no matching clock is found and so calls can be chained using the or operator. Example of a simple policy decision:
T = get_clock(MONOTONIC) or get_clock(STEADY) or get_clock() t = T.now()
The available clocks always at least include a wrapper for time.time(), so a final call with no flags can always be used to obtain a working clock.
Examples of flags of system clocks:
- QueryPerformanceCounter: MONOTONIC | HIGHRES
- GetTickCount: MONOTONIC | STEADY
- CLOCK_MONOTONIC: MONOTONIC | STEADY (or only MONOTONIC on Linux)
- CLOCK_MONOTONIC_RAW: MONOTONIC | STEADY
- gettimeofday(): (no flag)
The clock objects contain other metadata including the clock flags with additional feature flags above those listed above, the name of the underlying OS facility, and clock precisions.
time.get_clock() still chooses a single clock; an enumeration facility is also required. The most obvious method is to offer time.get_clocks() with the same signature as time.get_clock(), but returning a sequence of all clocks matching the requested flags. Requesting no flags would thus enumerate all available clocks, allowing the caller to make an arbitrary choice amongst them based on their metadata.
Example partial implementation: clockutils.py.
Working around operating system bugs?
Should Python ensure that a monotonic clock is truly monotonic by computing the maximum with the clock value and the previous value?
Since it's relatively straightforward to cache the last value returned using a static variable, it might be interesting to use this to make sure that the values returned are indeed monotonic.
- Virtual machines provide less reliable clocks.
- QueryPerformanceCounter() has known bugs (only one is not fixed yet)
Python may only work around a specific known operating system bug: KB274323 [4] contains a code example to workaround the bug (use GetTickCount() to detect QueryPerformanceCounter() leap).
Issues with "correcting" non-monotonicities:
- if the clock is accidentally set forward by an hour and then back again, you wouldn't have a useful clock for an hour
- the cache is not shared between processes so different processes wouldn't see the same clock value
Glossary
| Accuracy: | The amount of deviation of measurements by a given instrument from true values. See also Accuracy and precision. Inaccuracy in clocks may be caused by lack of precision, drift, or an incorrect initial setting of the clock (e.g., timing of threads is inherently inaccurate because perfect synchronization in resetting counters is quite difficult). |
|---|---|
| Adjusted: | Resetting a clock to the correct time. This may be done either with a <Step> or by <Slewing>. |
| Civil Time: | Time of day; external to the system. 10:45:13am is a Civil time; 45 seconds is not. Provided by existing function time.localtime() and time.gmtime(). Not changed by this PEP. |
| Clock: | An instrument for measuring time. Different clocks have different characteristics; for example, a clock with nanosecond <precision> may start to <drift> after a few minutes, while a less precise clock remained accurate for days. This PEP is primarily concerned with clocks which use a unit of seconds. |
| Counter: | A clock which increments each time a certain event occurs. A counter is strictly monotonic, but not a monotonic clock. It can be used to generate a unique (and ordered) timestamp, but these timestamps cannot be mapped to <civil time>; tick creation may well be bursty, with several advances in the same millisecond followed by several days without any advance. |
| CPU Time: | A measure of how much CPU effort has been spent on a certain task. CPU seconds are often normalized (so that a variable number can occur in the same actual second). CPU seconds can be important when profiling, but they do not map directly to user response time, nor are they directly comparable to (real time) seconds. |
| Drift: | The accumulated error against "true" time, as defined externally to the system. Drift may be due to imprecision, or to a difference between the average rate at which clock time advances and that of real time. |
| Epoch: | The reference point of a clock. For clocks providing <civil time>, this is often midnight as the day (and year) rolled over to January 1, 1970. For a <clock_monotonic> clock, the epoch may be undefined (represented as None). |
| Latency: | Delay. By the time a clock call returns, the <real time> has advanced, possibly by more than the precision of the clock. |
| Monotonic: | The characteristics expected of a monotonic clock in practice. Moving in at most one direction; for clocks, that direction is forward. The <clock> should also be <steady>, and should be convertible to a unit of seconds. The tradeoffs often include lack of a defined <epoch> or mapping to <Civil Time>. |
| Precision: | The amount of deviation among measurements of the same physical value by a single instrument. Imprecision in clocks may be caused by a fluctuation of the rate at which clock time advances relative to real time, including clock adjustment by slewing. |
| Process Time: | Time elapsed since the process began. It is typically measured in <CPU time> rather than <real time>, and typically does not advance while the process is suspended. |
| Real Time: | Time in the real world. This differs from <Civil time> in that it is not <adjusted>, but they should otherwise advance in lockstep. It is not related to the "real time" of "Real Time [Operating] Systems". It is sometimes called "wall clock time" to avoid that ambiguity; unfortunately, that introduces different ambiguities. |
| Resolution: | The smallest difference between two physical values that results in a different measurement by a given instrument. |
| Slew: | A slight change to a clock's speed, usually intended to correct <drift> with respect to an external authority. |
| Stability: | Persistence of accuracy. A measure of expected <drift>. |
| Steady: | A clock with high <stability> and relatively high <accuracy> and <precision>. In practice, it is often used to indicate a <clock_monotonic> clock, but places greater emphasis on the consistency of the duration between subsequent ticks. |
| Step: | An instantaneous change in the represented time. Instead of speeding or slowing the clock (<slew>), a single offset is permanently added. |
| System Time: | Time as represented by the Operating System. |
| Thread Time: | Time elapsed since the thread began. It is typically measured in <CPU time> rather than <real time>, and typically does not advance while the thread is idle. |
| Wallclock: | What the clock on the wall says. This is typically used as a synonym for <real time>; unfortunately, wall time is itself ambiguous. |
Hardware clocks
List of hardware clocks
- HPET: An High Precision Event Timer (HPET) chip consists of a 64-bit up-counter (main counter) counting at least at 10 MHz and a set of up to 256 comparators (at least 3). Each HPET can have up to 32 timers. HPET can cause around 3 seconds of drift per day.
- TSC (Time Stamp Counter): Historically, the TSC increased with every internal processor clock cycle, but now the rate is usually constant (even if the processor changes frequency) and usually equals the maximum processor frequency. Multiple cores have different TSC values. Hibernation of system will reset TSC value. The RDTSC instruction can be used to read this counter. CPU frequency scaling for power saving.
- ACPI Power Management Timer: ACPI 24-bit timer with a frequency of 3.5 MHz (3,579,545 Hz).
- Cyclone: The Cyclone timer uses a 32-bit counter on IBM Extended X-Architecture (EXA) chipsets which include computers that use the IBM "Summit" series chipsets (ex: x440). This is available in IA32 and IA64 architectures.
- PIT (programmable interrupt timer): Intel 8253/8254 chipsets with a configurable frequency in range 18.2 Hz - 1.2 MHz. It uses a 16-bit counter.
- RTC (Real-time clock). Most RTCs use a crystal oscillator with a frequency of 32,768 Hz.
Linux clocksource
There were 4 implementations of the time in the Linux kernel: UTIME (1996), timer wheel (1997), HRT (2001) and hrtimers (2007). The latter is the result of the "high-res-timers" project started by George Anzinger in 2001, with contributions by Thomas Gleixner and Douglas Niehaus. The hrtimers implementation was merged into Linux 2.6.21, released in 2007.
hrtimers supports various clock sources. It sets a priority to each source to decide which one will be used. Linux supports the following clock sources:
- tsc
- hpet
- pit
- pmtmr: ACPI Power Management Timer
- cyclone
High-resolution timers are not supported on all hardware architectures. They are at least provided on x86/x86_64, ARM and PowerPC.
clock_getres() returns 1 nanosecond for CLOCK_REALTIME and CLOCK_MONOTONIC regardless of underlying clock source. Read Re: clock_getres() and real resolution from Thomas Gleixner (9 Feb 2012) for an explanation.
The /sys/devices/system/clocksource/clocksource0 directory contains two useful files:
- available_clocksource: list of available clock sources
- current_clocksource: clock source currently used. It is possible to change the current clocksource by writing the name of a clocksource into this file.
/proc/timer_list contains the list of all hardware timers.
Read also the time(7) manual page: "overview of time and timers".
FreeBSD timecounter
kern.timecounter.choice lists available hardware clocks with their priority. The sysctl program can be used to change the timecounter. Example:
# dmesg | grep Timecounter Timecounter "i8254" frequency 1193182 Hz quality 0 Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 Timecounter "HPET" frequency 100000000 Hz quality 900 Timecounter "TSC" frequency 3411154800 Hz quality 800 Timecounters tick every 10.000 msec # sysctl kern.timecounter.choice kern.timecounter.choice: TSC(800) HPET(900) ACPI-safe(850) i8254(0) dummy(-1000000) # sysctl kern.timecounter.hardware="ACPI-fast" kern.timecounter.hardware: HPET -> ACPI-fast
Available clocks:
- "TSC": Time Stamp Counter of the processor
- "HPET": High Precision Event Timer
- "ACPI-fast": ACPI Power Management timer (fast mode)
- "ACPI-safe": ACPI Power Management timer (safe mode)
- "i8254": PIT with Intel 8254 chipset
The commit 222222 (May 2011) decreased ACPI-fast timecounter quality to 900 and increased HPET timecounter quality to 950: "HPET on modern platforms usually have better resolution and lower latency than ACPI timer".
Read Timecounters: Efficient and precise timekeeping in SMP kernels by Poul-Henning Kamp (2002) for the FreeBSD Project.
Performance
Reading a hardware clock has a cost. The following table compares the performance of different hardware clocks on Linux 3.3 with Intel Core i7-2600 at 3.40GHz (8 cores). The bench_time.c program was used to fill these tables.
| Function | TSC | ACPI PM | HPET |
|---|---|---|---|
| time() | 2 ns | 2 ns | 2 ns |
| CLOCK_REALTIME_COARSE | 10 ns | 10 ns | 10 ns |
| CLOCK_MONOTONIC_COARSE | 12 ns | 13 ns | 12 ns |
| CLOCK_THREAD_CPUTIME_ID | 134 ns | 135 ns | 135 ns |
| CLOCK_PROCESS_CPUTIME_ID | 127 ns | 129 ns | 129 ns |
| clock() | 146 ns | 146 ns | 143 ns |
| gettimeofday() | 23 ns | 726 ns | 637 ns |
| CLOCK_MONOTONIC_RAW | 31 ns | 716 ns | 607 ns |
| CLOCK_REALTIME | 27 ns | 707 ns | 629 ns |
| CLOCK_MONOTONIC | 27 ns | 723 ns | 635 ns |
FreeBSD 8.0 in kvm with hardware virtualization:
| Function | TSC | ACPI-Safe | HPET | i8254 |
|---|---|---|---|---|
| time() | 191 ns | 188 ns | 189 ns | 188 ns |
| CLOCK_SECOND | 187 ns | 184 ns | 187 ns | 183 ns |
| CLOCK_REALTIME_FAST | 189 ns | 180 ns | 187 ns | 190 ns |
| CLOCK_UPTIME_FAST | 191 ns | 185 ns | 186 ns | 196 ns |
| CLOCK_MONOTONIC_FAST | 188 ns | 187 ns | 188 ns | 189 ns |
| CLOCK_THREAD_CPUTIME_ID | 208 ns | 206 ns | 207 ns | 220 ns |
| CLOCK_VIRTUAL | 280 ns | 279 ns | 283 ns | 296 ns |
| CLOCK_PROF | 289 ns | 280 ns | 282 ns | 286 ns |
| clock() | 342 ns | 340 ns | 337 ns | 344 ns |
| CLOCK_UPTIME_PRECISE | 197 ns | 10380 ns | 4402 ns | 4097 ns |
| CLOCK_REALTIME | 196 ns | 10376 ns | 4337 ns | 4054 ns |
| CLOCK_MONOTONIC_PRECISE | 198 ns | 10493 ns | 4413 ns | 3958 ns |
| CLOCK_UPTIME | 197 ns | 10523 ns | 4458 ns | 4058 ns |
| gettimeofday() | 202 ns | 10524 ns | 4186 ns | 3962 ns |
| CLOCK_REALTIME_PRECISE | 197 ns | 10599 ns | 4394 ns | 4060 ns |
| CLOCK_MONOTONIC | 201 ns | 10766 ns | 4498 ns | 3943 ns |
Each function was called 100,000 times and CLOCK_MONOTONIC was used to get the time before and after. The benchmark was run 5 times, keeping the minimum time.
NTP adjustment
NTP has different methods to adjust a clock:
- "slewing": change the clock frequency to be slightly faster or slower (which is done with adjtime()). Since the slew rate is limited to 0.5 millisecond per second, each second of adjustment requires an amortization interval of 2000 seconds. Thus, an adjustment of many seconds can take hours or days to amortize.
- "stepping": jump by a large amount in a single discrete step (which is done with settimeofday())
By default, the time is slewed if the offset is less than 128 ms, but stepped otherwise.
Slewing is generally desirable (i.e. we should use CLOCK_MONOTONIC, not CLOCK_MONOTONIC_RAW) if one wishes to measure "real" time (and not a time-like object like CPU cycles). This is because the clock on the other end of the NTP connection from you is probably better at keeping time: hopefully that thirty-five thousand dollars of Cesium timekeeping goodness is doing something better than your PC's $3 quartz crystal, after all.
Get more detail in the documentation of the NTP daemon.
Operating system time functions
Monotonic Clocks
| Name | C Resolution | Adjusted | Include Sleep | Include Suspend |
|---|---|---|---|---|
| gethrtime() | 1 ns | No | Yes | Yes |
| CLOCK_HIGHRES | 1 ns | No | Yes | Yes |
| CLOCK_MONOTONIC | 1 ns | Slewed on Linux | Yes | No |
| CLOCK_MONOTONIC_COARSE | 1 ns | Slewed on Linux | Yes | No |
| CLOCK_MONOTONIC_RAW | 1 ns | No | Yes | No |
| CLOCK_BOOTTIME | 1 ns | ? | Yes | Yes |
| CLOCK_UPTIME | 1 ns | No | Yes | ? |
| mach_absolute_time() | 1 ns | No | Yes | No |
| QueryPerformanceCounter() | - | No | Yes | ? |
| GetTickCount[64]() | 1 ms | No | Yes | Yes |
| timeGetTime() | 1 ms | No | Yes | ? |
The "C Resolution" column is the resolution of the underlying C structure.
Examples of clock resolution on x86_64:
| Name | Operating system | OS Resolution | Python Resolution |
|---|---|---|---|
| QueryPerformanceCounter | Windows Seven | 10 ns | 10 ns |
| CLOCK_HIGHRES | SunOS 5.11 | 2 ns | 265 ns |
| CLOCK_MONOTONIC | Linux 3.0 | 1 ns | 322 ns |
| CLOCK_MONOTONIC_RAW | Linux 3.3 | 1 ns | 628 ns |
| CLOCK_BOOTTIME | Linux 3.3 | 1 ns | 628 ns |
| mach_absolute_time() | Mac OS 10.6 | 1 ns | 3 µs |
| CLOCK_MONOTONIC | FreeBSD 8.2 | 11 ns | 5 µs |
| CLOCK_MONOTONIC | OpenBSD 5.0 | 10 ms | 5 µs |
| CLOCK_UPTIME | FreeBSD 8.2 | 11 ns | 6 µs |
| CLOCK_MONOTONIC_COARSE | Linux 3.3 | 1 ms | 1 ms |
| CLOCK_MONOTONIC_COARSE | Linux 3.0 | 4 ms | 4 ms |
| GetTickCount64() | Windows Seven | 16 ms | 15 ms |
The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.
mach_absolute_time
Mac OS X provides a monotonic clock: mach_absolute_time(). It is based on absolute elapsed time since system boot. It is not adjusted and cannot be set.
mach_timebase_info() gives a fraction to convert the clock value to a number of nanoseconds. See also the Technical Q&A QA1398.
mach_absolute_time() stops during a sleep on a PowerPC CPU, but not on an Intel CPU: Different behaviour of mach_absolute_time() on i386/ppc.
CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_BOOTTIME
CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW represent monotonic time since some unspecified starting point. They cannot be set. The resolution can be read using clock_getres().
Documentation: refer to the manual page of your operating system. Examples:
CLOCK_MONOTONIC is available at least on the following operating systems:
- DragonFly BSD, FreeBSD >= 5.0, OpenBSD, NetBSD
- Linux
- Solaris
The following operating systems don't support CLOCK_MONOTONIC:
- GNU/Hurd (see open issues/ clock_gettime)
- Mac OS X
- Windows
On Linux, NTP may adjust the CLOCK_MONOTONIC rate (slewed), but it cannot jump backward.
CLOCK_MONOTONIC_RAW is specific to Linux. It is similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments. CLOCK_MONOTONIC_RAW requires Linux 2.6.28 or later.
Linux 2.6.39 and glibc 2.14 introduces a new clock: CLOCK_BOOTTIME. CLOCK_BOOTTIME is idential to CLOCK_MONOTONIC, except that it also includes any time spent in suspend. Read also Waking systems from suspend (March, 2011).
CLOCK_MONOTONIC stops while the machine is suspended.
Linux provides also CLOCK_MONOTONIC_COARSE since Linux 2.6.32. It is similar to CLOCK_MONOTONIC, less precise but faster.
clock_gettime() fails if the system does not support the specified clock, even if the standard C library supports it. For example, CLOCK_MONOTONIC_RAW requires a kernel version 2.6.28 or later.
Windows: QueryPerformanceCounter
High-resolution performance counter. It is monotonic. The frequency of the counter can be read using QueryPerformanceFrequency(). The resolution is 1 / QueryPerformanceFrequency().
It has a much higher resolution, but has lower long term precision than GetTickCount() and timeGetTime() clocks. For example, it will drift compared to the low precision clocks.
Documentation:
Hardware clocks used by QueryPerformanceCounter:
- Windows XP: RDTSC instruction of Intel processors, the clock frequency is the frequency of the processor (between 200 MHz and 3 GHz, usually greater than 1 GHz nowadays).
- Windows 2000: ACPI power management timer, frequency = 3,549,545 Hz. It can be forced through the "/usepmtimer" flag in boot.ini.
QueryPerformanceFrequency() should only be called once: the frequency will not change while the system is running. It fails if the installed hardware does not support a high-resolution performance counter.
QueryPerformanceCounter() cannot be adjusted: SetSystemTimeAdjustment() only adjusts the system time.
Bugs:
- The performance counter value may unexpectedly leap forward because of a hardware bug, see KB274323 [4].
- On VirtualBox, QueryPerformanceCounter() does not increment the high part every time the low part overflows, see Monotonic timers (2009).
- VirtualBox had a bug in its HPET virtualized device: QueryPerformanceCounter() did jump forward by approx. 42 seconds (issue #8707).
- Windows XP had a bug (see KB896256 [3]): on a multiprocessor computer, QueryPerformanceCounter() returned a different value for each processor. The bug was fixed in Windows XP SP2.
- Issues with processor with variable frequency: the frequency is changed depending on the workload to reduce memory consumption.
- Chromium don't use QueryPerformanceCounter() on Athlon X2 CPUs (model 15) because "QueryPerformanceCounter is unreliable" (see base/time_win.cc in Chromium source code)
Windows: GetTickCount(), GetTickCount64()
GetTickCount() and GetTickCount64() are monotonic, cannot fail and are not adjusted by SetSystemTimeAdjustment(). MSDN documentation: GetTickCount(), GetTickCount64(). The resolution can be read using GetSystemTimeAdjustment().
The elapsed time retrieved by GetTickCount() or GetTickCount64() includes time the system spends in sleep or hibernation.
GetTickCount64() was added to Windows Vista and Windows Server 2008.
It is possible to improve the precision using the undocumented NtSetTimerResolution() function. There are applications using this undocumented function, example: Timer Resolution.
WaitForSingleObject() uses the same timer as GetTickCount() with the same precision.
Windows: timeGetTime
The timeGetTime function retrieves the system time, in milliseconds. The system time is the time elapsed since Windows was started. Read the timeGetTime() documentation.
The return type of timeGetTime() is a 32-bit unsigned integer. As GetTickCount(), timeGetTime() rolls over after 2^32 milliseconds (49.7 days).
The elapsed time retrieved by timeGetTime() includes time the system spends in sleep.
The default precision of the timeGetTime function can be five milliseconds or more, depending on the machine.
timeBeginPeriod() can be used to increase the precision of timeGetTime() up to 1 millisecond, but it negatively affects power consumption. Calling timeBeginPeriod() also affects the granularity of some other timing calls, such as CreateWaitableTimer(), WaitForSingleObject() and Sleep().
Note
timeGetTime() and timeBeginPeriod() are part the Windows multimedia library and so require to link the program against winmm or to dynamically load the library.
Solaris: CLOCK_HIGHRES
The Solaris OS has a CLOCK_HIGHRES timer that attempts to use an optimal hardware source, and may give close to nanosecond resolution. CLOCK_HIGHRES is the nonadjustable, high-resolution clock. For timers created with a clockid_t value of CLOCK_HIGHRES, the system will attempt to use an optimal hardware source.
The resolution of CLOCK_HIGHRES can be read using clock_getres().
Solaris: gethrtime
The gethrtime() function returns the current high-resolution real time. Time is expressed as nanoseconds since some arbitrary time in the past; it is not correlated in any way to the time of day, and thus is not subject to resetting or drifting by way of adjtime() or settimeofday(). The hires timer is ideally suited to performance measurement tasks, where cheap, accurate interval timing is required.
The linearity of gethrtime() is not preserved across a suspend-resume cycle (Bug 4272663).
Read the gethrtime() manual page of Solaris 11.
On Solaris, gethrtime() is the same as clock_gettime(CLOCK_MONOTONIC).
System Time
| Name | C Resolution | Include Sleep | Include Suspend |
|---|---|---|---|
| CLOCK_REALTIME | 1 ns | Yes | Yes |
| CLOCK_REALTIME_COARSE | 1 ns | Yes | Yes |
| GetSystemTimeAsFileTime | 100 ns | Yes | Yes |
| gettimeofday() | 1 µs | Yes | Yes |
| ftime() | 1 ms | Yes | Yes |
| time() | 1 sec | Yes | Yes |
The "C Resolution" column is the resolution of the underlying C structure.
Examples of clock resolution on x86_64:
| Name | Operating system | OS Resolution | Python Resolution |
|---|---|---|---|
| CLOCK_REALTIME | SunOS 5.11 | 10 ms | 238 ns |
| CLOCK_REALTIME | Linux 3.0 | 1 ns | 238 ns |
| gettimeofday() | Mac OS 10.6 | 1 µs | 4 µs |
| CLOCK_REALTIME | FreeBSD 8.2 | 11 ns | 6 µs |
| CLOCK_REALTIME | OpenBSD 5.0 | 10 ms | 5 µs |
| CLOCK_REALTIME_COARSE | Linux 3.3 | 1 ms | 1 ms |
| CLOCK_REALTIME_COARSE | Linux 3.0 | 4 ms | 4 ms |
| GetSystemTimeAsFileTime() | Windows Seven | 16 ms | 1 ms |
| ftime() | Windows Seven | - | 1 ms |
The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.
Windows: GetSystemTimeAsFileTime
The system time can be read using GetSystemTimeAsFileTime(), ftime() and time(). The resolution of the system time can be read using GetSystemTimeAdjustment().
Read the GetSystemTimeAsFileTime() documentation.
The system time can be set using SetSystemTime().
System time on UNIX
gettimeofday(), ftime(), time() and clock_gettime(CLOCK_REALTIME) return the system time. The resolution of CLOCK_REALTIME can be read using clock_getres().
The system time can be set using settimeofday() or clock_settime(CLOCK_REALTIME).
Linux provides also CLOCK_REALTIME_COARSE since Linux 2.6.32. It is similar to CLOCK_REALTIME, less precise but faster.
Alexander Shishkin proposed an API for Linux to be notified when the system clock is changed: timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes (4th version of the API, March 2011). The API is not accepted yet, but CLOCK_BOOTTIME provides a similar feature.
Process Time
The process time cannot be set. It is not monotonic: the clocks stop while the process is idle.
| Name | C Resolution | Include Sleep | Include Suspend |
|---|---|---|---|
| GetProcessTimes() | 100 ns | No | No |
| CLOCK_PROCESS_CPUTIME_ID | 1 ns | No | No |
| getrusage(RUSAGE_SELF) | 1 µs | No | No |
| times() | - | No | No |
| clock() | - | Yes on Windows, No otherwise | No |
The "C Resolution" column is the resolution of the underlying C structure.
Examples of clock resolution on x86_64:
| Name | Operating system | OS Resolution | Python Resolution |
|---|---|---|---|
| CLOCK_PROCESS_CPUTIME_ID | Linux 3.3 | 1 ns | 1 ns |
| CLOCK_PROF | FreeBSD 8.2 | 10 ms | 1 µs |
| getrusage(RUSAGE_SELF) | FreeBSD 8.2 | - | 1 µs |
| getrusage(RUSAGE_SELF) | SunOS 5.11 | - | 1 µs |
| CLOCK_PROCESS_CPUTIME_ID | Linux 3.0 | 1 ns | 1 µs |
| getrusage(RUSAGE_SELF) | Mac OS 10.6 | - | 5 µs |
| clock() | Mac OS 10.6 | 1 µs | 5 µs |
| CLOCK_PROF | OpenBSD 5.0 | - | 5 µs |
| getrusage(RUSAGE_SELF) | Linux 3.0 | - | 4 ms |
| getrusage(RUSAGE_SELF) | OpenBSD 5.0 | - | 8 ms |
| clock() | FreeBSD 8.2 | 8 ms | 8 ms |
| clock() | Linux 3.0 | 1 µs | 10 ms |
| times() | Linux 3.0 | 10 ms | 10 ms |
| clock() | OpenBSD 5.0 | 10 ms | 10 ms |
| times() | OpenBSD 5.0 | 10 ms | 10 ms |
| times() | Mac OS 10.6 | 10 ms | 10 ms |
| clock() | SunOS 5.11 | 1 µs | 10 ms |
| times() | SunOS 5.11 | 1 µs | 10 ms |
| GetProcessTimes() | Windows Seven | 16 ms | 16 ms |
| clock() | Windows Seven | 1 ms | 1 ms |
The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.
Functions
- Windows: GetProcessTimes(). The resolution can be read using GetSystemTimeAdjustment().
- clock_gettime(CLOCK_PROCESS_CPUTIME_ID): High-resolution per-process timer from the CPU. The resolution can be read using clock_getres().
- clock(). The resolution is 1 / CLOCKS_PER_SEC.
- Windows: The elapsed wall-clock time since the start of the process (elapsed time in seconds times CLOCKS_PER_SEC). Include time elapsed during sleep. It can fail.
- UNIX: returns an approximation of processor time used by the program.
- getrusage(RUSAGE_SELF) returns a structure of resource usage of the currenet process. ru_utime is user CPU time and ru_stime is the system CPU time.
- times(): structure of process times. The resolution is 1 / ticks_per_seconds, where ticks_per_seconds is sysconf(_SC_CLK_TCK) or the HZ constant.
Python source code includes a portable library to get the process time (CPU time): Tools/pybench/systimes.py.
See also the QueryProcessCycleTime() function (sum of the cycle time of all threads) and clock_getcpuclockid().
Thread Time
The thread time cannot be set. It is not monotonic: the clocks stop while the thread is idle.
| Name | C Resolution | Include Sleep | Include Suspend |
|---|---|---|---|
| CLOCK_THREAD_CPUTIME_ID | 1 ns | Yes | Epoch changes |
| GetThreadTimes() | 100 ns | No | ? |
The "C Resolution" column is the resolution of the underlying C structure.
Examples of clock resolution on x86_64:
| Name | Operating system | OS Resolution | Python Resolution |
|---|---|---|---|
| CLOCK_THREAD_CPUTIME_ID | FreeBSD 8.2 | 1 µs | 1 µs |
| CLOCK_THREAD_CPUTIME_ID | Linux 3.3 | 1 ns | 649 ns |
| GetThreadTimes() | Windows Seven | 16 ms | 16 ms |
The "OS Resolution" is the resolution announced by the operating system. The "Python Resolution" is the smallest difference between two calls to the time function computed in Python using the clock_resolution.py program.
Functions
- Windows: GetThreadTimes(). The resolution can be read using GetSystemTimeAdjustment().
- clock_gettime(CLOCK_THREAD_CPUTIME_ID): Thread-specific CPU-time clock. It uses a number of CPU cycles, not a number of seconds. The resolution can be read using of clock_getres().
See also the QueryThreadCycleTime() function (cycle time for the specified thread) and pthread_getcpuclockid().
Windows: QueryUnbiasedInterruptTime
Gets the current unbiased interrupt time from the biased interrupt time and the current sleep bias amount. This time is not affected by power management sleep transitions.
The elapsed time retrieved by the QueryUnbiasedInterruptTime function includes only time that the system spends in the working state. QueryUnbiasedInterruptTime() is not monotonic.
QueryUnbiasedInterruptTime() was introduced in Windows 7.
See also QueryIdleProcessorCycleTime() function (cycle time for the idle thread of each processor)
Sleep
Suspend execution of the process for the given number of seconds. Sleep is not affected by system time updates. Sleep is paused during system suspend. For example, if a process sleeps for 60 seconds and the system is suspended for 30 seconds in the middle of the sleep, the sleep duration is 90 seconds in the real time.
Sleep can be interrupted by a signal: the function fails with EINTR.
| Name | C Resolution |
|---|---|
| nanosleep() | 1 ns |
| clock_nanosleep() | 1 ns |
| usleep() | 1 µs |
| delay() | 1 µs |
| sleep() | 1 sec |
Other functions:
| Name | C Resolution |
|---|---|
| sigtimedwait() | 1 ns |
| pthread_cond_timedwait() | 1 ns |
| sem_timedwait() | 1 ns |
| select() | 1 µs |
| epoll() | 1 ms |
| poll() | 1 ms |
| WaitForSingleObject() | 1 ms |
The "C Resolution" column is the resolution of the underlying C structure.
Functions
- sleep(seconds)
- usleep(microseconds)
- nanosleep(nanoseconds, remaining): Linux manpage of nanosleep()
- delay(milliseconds)
clock_nanosleep
clock_nanosleep(clock_id, flags, nanoseconds, remaining): Linux manpage of clock_nanosleep().
If flags is TIMER_ABSTIME, then request is interpreted as an absolute time as measured by the clock, clock_id. If request is less than or equal to the current value of the clock, then clock_nanosleep() returns immediately without suspending the calling thread.
POSIX.1 specifies that changing the value of the CLOCK_REALTIME clock via clock_settime(2) shall have no effect on a thread that is blocked on a relative clock_nanosleep().
select()
select(nfds, readfds, writefds, exceptfs, timeout).
Since Linux 2.6.28, select() uses high-resolution timers to handle the timeout. A process has a "slack" attribute to configure the precision of the timeout, the default slack is 50 microseconds. Before Linux 2.6.28, timeouts for select() were handled by the main timing subsystem at a jiffy-level resolution. Read also High- (but not too high-) resolution timeouts and Timer slack.
Other functions
- poll(), epoll()
- sigtimedwait(). POSIX: "If the Monotonic Clock option is supported, the CLOCK_MONOTONIC clock shall be used to measure the time interval specified by the timeout argument."
- pthread_cond_timedwait(), pthread_condattr_setclock(). "The default value of the clock attribute shall refer to the system time."
- sem_timedwait(): "If the Timers option is supported, the timeout shall be based on the CLOCK_REALTIME clock. If the Timers option is not supported, the timeout shall be based on the system time as returned by the time() function. The precision of the timeout shall be the precision of the clock on which it is based."
- WaitForSingleObject(): use the same timer than GetTickCount() with the same precision.
System Standby
The ACPI power state "S3" is a system standby mode, also called "Suspend to RAM". RAM remains powered.
On Windows, the WM_POWERBROADCAST message is sent to Windows applications to notify them of power-management events (ex: owner status has changed).
For Mac OS X, read Registering and unregistering for sleep and wake notifications (Technical Q&A QA1340).
Footnotes
| [2] | (1, 2, 3, 4, 5) "_time" is an hypothetical module only used for the example. The time module is implemented in C and so there is no need for such a module. |
Links
Related Python issues:
- Issue #12822: NewGIL should use CLOCK_MONOTONIC if possible.
- Issue #14222: Use time.steady() to implement timeout
- Issue #14309: Deprecate time.clock()
- Issue #14397: Use GetTickCount/GetTickCount64 instead of QueryPerformanceCounter for monotonic clock
- Issue #14428: Implementation of the PEP 418
- Issue #14555: clock_gettime/settime/getres: Add more clock identifiers
Libraries exposing monotonic clocks:
- Java: System.nanoTime
- Qt library: QElapsedTimer
- glib library: g_get_monotonic_time () uses GetTickCount64()/GetTickCount() on Windows, clock_gettime(CLOCK_MONOTONIC) on UNIX or falls back to the system clock
- python-monotonic-time (github)
- Monoclock.nano_count() uses clock_gettime(CLOCK_MONOTONIC) and returns a number of nanoseconds
- monotonic_clock by Thomas Habets
- Perl: Time::HiRes exposes clock_gettime(CLOCK_MONOTONIC)
- Ruby: AbsoluteTime.now: use clock_gettime(CLOCK_MONOTONIC), mach_absolute_time() or gettimeofday(). "AbsoluteTime.monotonic?" method indicates if AbsoluteTime.now is monotonic or not.
- libpthread: POSIX thread library for Windows (clock.c)
- Boost.Chrono uses:
- system_clock:
- mac = gettimeofday()
- posix = clock_gettime(CLOCK_REALTIME)
- win = GetSystemTimeAsFileTime()
- steady_clock:
- mac = mach_absolute_time()
- posix = clock_gettime(CLOCK_MONOTONIC)
- win = QueryPerformanceCounter()
- high_resolution_clock:
- steady_clock, if available system_clock, otherwise
- system_clock:
Time:
- Twisted issue #2424: Add reactor option to start with monotonic clock
- gettimeofday() should never be used to measure time by Thomas Habets (2010-09-05)
- hrtimers - subsystem for high-resolution kernel timers
- C++ Timeout Specification by Lawrence Crowl (2010-08-19)
- Windows: Game Timing and Multicore Processors by Chuck Walbourn (December 2005)
- Implement a Continuously Updating, High-Resolution Time Provider for Windows by Johan Nilsson (March 2004)
- clockspeed uses a hardware tick counter to compensate for a persistently fast or slow system time, by D. J. Bernstein (1998)
- Retrieving system time lists hardware clocks and time functions with their resolution and epoch or range
- On Windows, the JavaScript runtime of Firefox interpolates GetSystemTimeAsFileTime() with QueryPerformanceCounter() to get an higher resolution. See the Bug 363258 - bad millisecond resolution for (new Date).getTime() / Date.now() on Windows.
- When microseconds matter: How the IBM High Resolution Time Stamp Facility accurately measures itty bits of time, by W. Nathaniel Mills, III (Apr 2002)
- Win32 Performance Measurement Options by Matthew Wilson (May, 2003)
- Counter Availability and Characteristics for Feed-forward Based Synchronization by Timothy Broomhead, Julien Ridoux, Darryl Veitch (2009)
- System Management Interrupt (SMI) issues:
- System Management Interrupt Free Hardware by Keith Mannthey (2009)
- IBM Real-Time "SMI Free" mode driver by Keith Mannthey (Feb 2009)
- Fixing Realtime problems caused by SMI on Ubuntu
- [RFC] simple SMI detector by Jon Masters (Jan 2009)
- [PATCH 2.6.34-rc3] A nonintrusive SMI sniffer for x86 by Joe Korty (2010-04)
Acceptance
The PEP was accepted on 2012-04-28 by Guido van Rossum [1]. The PEP implementation has since been committed to the repository.
References
| [1] | http://mail.python.org/pipermail/python-dev/2012-April/119094.html |
| [3] | http://support.microsoft.com/?id=896256 |
| [4] | (1, 2) http://support.microsoft.com/?id=274323 |
Copyright
This document has been placed in the public domain.
pep-0419 Protecting cleanup statements from interruptions
| PEP: | 419 |
|---|---|
| Title: | Protecting cleanup statements from interruptions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Colomiets <paul at colomiets.name> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 06-Apr-2012 |
| Python-Version: | 3.3 |
Contents
Abstract
This PEP proposes a way to protect Python code from being interrupted inside a finally clause or during context manager cleanup.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
Rationale
Python has two nice ways to do cleanup. One is a finally statement and the other is a context manager (usually called using a with statement). However, neither is protected from interruption by KeyboardInterrupt or GeneratorExit caused by generator.throw(). For example:
lock.acquire()
try:
print('starting')
do_something()
finally:
print('finished')
lock.release()
If KeyboardInterrupt occurs just after the second print() call, the lock will not be released. Similarly, the following code using the with statement is affected:
from threading import Lock
class MyLock:
def __init__(self):
self._lock_impl = Lock()
def __enter__(self):
self._lock_impl.acquire()
print("LOCKED")
def __exit__(self):
print("UNLOCKING")
self._lock_impl.release()
lock = MyLock()
with lock:
do_something
If KeyboardInterrupt occurs near any of the print() calls, the lock will never be released.
Coroutine Use Case
A similar case occurs with coroutines. Usually coroutine libraries want to interrupt the coroutine with a timeout. The generator.throw() method works for this use case, but there is no way of knowing if the coroutine is currently suspended from inside a finally clause.
An example that uses yield-based coroutines follows. The code looks similar using any of the popular coroutine libraries Monocle [1], Bluelet [2], or Twisted [3].
def run_locked():
yield connection.sendall('LOCK')
try:
yield do_something()
yield do_something_else()
finally:
yield connection.sendall('UNLOCK')
with timeout(5):
yield run_locked()
In the example above, yield something means to pause executing the current coroutine and to execute coroutine something until it finishes execution. Therefore the coroutine library itself needs to maintain a stack of generators. The connection.sendall() call waits until the socket is writable and does a similar thing to what socket.sendall() does.
The with statement ensures that all code is executed within 5 seconds timeout. It does so by registering a callback in the main loop, which calls generator.throw() on the top-most frame in the coroutine stack when a timeout happens.
The greenlets extension works in a similar way, except that it doesn't need yield to enter a new stack frame. Otherwise considerations are similar.
Specification
Frame Flag 'f_in_cleanup'
A new flag on the frame object is proposed. It is set to True if this frame is currently executing a finally clause. Internally, the flag must be implemented as a counter of nested finally statements currently being executed.
The internal counter also needs to be incremented during execution of the SETUP_WITH and WITH_CLEANUP bytecodes, and decremented when execution for these bytecodes is finished. This allows to also protect __enter__() and __exit__() methods.
Function 'sys.setcleanuphook'
A new function for the sys module is proposed. This function sets a callback which is executed every time f_in_cleanup becomes false. Callbacks get a frame object as their sole argument, so that they can figure out where they are called from.
The setting is thread local and must be stored in the PyThreadState structure.
Inspect Module Enhancements
Two new functions are proposed for the inspect module: isframeincleanup() and getcleanupframe().
isframeincleanup(), given a frame or generator object as its sole argument, returns the value of the f_in_cleanup attribute of a frame itself or of the gi_frame attribute of a generator.
getcleanupframe(), given a frame object as its sole argument, returns the innermost frame which has a true value of f_in_cleanup, or None if no frames in the stack have a nonzero value for that attribute. It starts to inspect from the specified frame and walks to outer frames using f_back pointers, just like getouterframes() does.
Example
An example implementation of a SIGINT handler that interrupts safely might look like:
import inspect, sys, functools
def sigint_handler(sig, frame):
if inspect.getcleanupframe(frame) is None:
raise KeyboardInterrupt()
sys.setcleanuphook(functools.partial(sigint_handler, 0))
A coroutine example is out of scope of this document, because its implementation depends very much on a trampoline (or main loop) used by coroutine library.
Unresolved Issues
Interruption Inside With Statement Expression
Given the statement
with open(filename):
do_something()
Python can be interrupted after open() is called, but before the SETUP_WITH bytecode is executed. There are two possible decisions:
Protect with expressions. This would require another bytecode, since currently there is no way of recognizing the start of the with expression.
Let the user write a wrapper if he considers it important for the use-case. A safe wrapper might look like this:
class FileWrapper(object): def __init__(self, filename, mode): self.filename = filename self.mode = mode def __enter__(self): self.file = open(self.filename, self.mode) def __exit__(self): self.file.close()Alternatively it can be written using the contextmanager() decorator:
@contextmanager def open_wrapper(filename, mode): file = open(filename, mode) try: yield file finally: file.close()This code is safe, as the first part of the generator (before yield) is executed inside the SETUP_WITH bytecode of the caller.
Exception Propagation
Sometimes a finally clause or an __enter__()/__exit__() method can raise an exception. Usually this is not a problem, since more important exceptions like KeyboardInterrupt or SystemExit should be raised instead. But it may be nice to be able to keep the original exception inside a __context__ attribute. So the cleanup hook signature may grow an exception argument:
def sigint_handler(sig, frame)
if inspect.getcleanupframe(frame) is None:
raise KeyboardInterrupt()
sys.setcleanuphook(retry_sigint)
def retry_sigint(frame, exception=None):
if inspect.getcleanupframe(frame) is None:
raise KeyboardInterrupt() from exception
Note
There is no need to have three arguments like in the __exit__ method since there is a __traceback__ attribute in exception in Python 3.
However, this will set the __cause__ for the exception, which is not exactly what's intended. So some hidden interpreter logic may be used to put a __context__ attribute on every exception raised in a cleanup hook.
Interruption Between Acquiring Resource and Try Block
The example from the first section is not totally safe. Let's take a closer look:
lock.acquire()
try:
do_something()
finally:
lock.release()
The problem might occur if the code is interrupted just after lock.acquire() is executed but before the try block is entered.
There is no way the code can be fixed unmodified. The actual fix depends very much on the use case. Usually code can be fixed using a with statement:
with lock:
do_something()
However, for coroutines one usually can't use the with statement because you need to yield for both the acquire and release operations. So the code might be rewritten like this:
try:
yield lock.acquire()
do_something()
finally:
yield lock.release()
The actual locking code might need more code to support this use case, but the implementation is usually trivial, like this: check if the lock has been acquired and unlock if it is.
Handling EINTR Inside a Finally
Even if a signal handler is prepared to check the f_in_cleanup flag, InterruptedError might be raised in the cleanup handler, because the respective system call returned an EINTR error. The primary use cases are prepared to handle this:
- Posix mutexes never return EINTR
- Networking libraries are always prepared to handle EINTR
- Coroutine libraries are usually interrupted with the throw() method, not with a signal
The platform-specific function siginterrupt() might be used to remove the need to handle EINTR. However, it may have hardly predictable consequences, for example SIGINT a handler is never called if the main thread is stuck inside an IO routine.
A better approach would be to have the code, which is usually used in cleanup handlers, be prepared to handle InterruptedError explicitly. An example of such code might be a file-based lock implementation.
signal.pthread_sigmask can be used to block signals inside cleanup handlers which can be interrupted with EINTR.
Setting Interruption Context Inside Finally Itself
Some coroutine libraries may need to set a timeout for the finally clause itself. For example:
try:
do_something()
finally:
with timeout(0.5):
try:
yield do_slow_cleanup()
finally:
yield do_fast_cleanup()
With current semantics, timeout will either protect the whole with block or nothing at all, depending on the implementation of each library. What the author intended is to treat do_slow_cleanup as ordinary code, and do_fast_cleanup as a cleanup (a non-interruptible one).
A similar case might occur when using greenlets or tasklets.
This case can be fixed by exposing f_in_cleanup as a counter, and by calling a cleanup hook on each decrement. A coroutine library may then remember the value at timeout start, and compare it on each hook execution.
But in practice, the example is considered to be too obscure to take into account.
Modifying KeyboardInterrupt
It should be decided if the default SIGINT handler should be modified to use the described mechanism. The initial proposition is to keep old behavior, for two reasons:
- Most application do not care about cleanup on exit (either they do not have external state, or they modify it in crash-safe way).
- Cleanup may take too much time, not giving user a chance to interrupt an application.
The latter case can be fixed by allowing an unsafe break if a SIGINT handler is called twice, but it seems not worth the complexity.
Alternative Python Implementations Support
We consider f_in_cleanup an implementation detail. The actual implementation may have some fake frame-like object passed to signal handler, cleanup hook and returned from getcleanupframe(). The only requirement is that the inspect module functions work as expected on these objects. For this reason, we also allow to pass a generator object to the isframeincleanup() function, which removes the need to use the gi_frame attribute.
It might be necessary to specify that getcleanupframe() must return the same object that will be passed to cleanup hook at the next invocation.
Alternative Names
The original proposal had a f_in_finally frame attribute, as the original intention was to protect finally clauses. But as it grew up to protecting __enter__ and __exit__ methods too, the f_in_cleanup name seems better. Although the __enter__ method is not a cleanup routine, it at least relates to cleanup done by context managers.
setcleanuphook, isframeincleanup and getcleanupframe can be unobscured to set_cleanup_hook, is_frame_in_cleanup and get_cleanup_frame, although they follow the naming convention of their respective modules.
Alternative Proposals
Propagating 'f_in_cleanup' Flag Automatically
This can make getcleanupframe() unnecessary. But for yield-based coroutines you need to propagate it yourself. Making it writable leads to somewhat unpredictable behavior of setcleanuphook().
Add Bytecodes 'INCR_CLEANUP', 'DECR_CLEANUP'
These bytecodes can be used to protect the expression inside the with statement, as well as making counter increments more explicit and easy to debug (visible inside a disassembly). Some middle ground might be chosen, like END_FINALLY and SETUP_WITH implicitly decrementing the counter (END_FINALLY is present at end of every with suite).
However, adding new bytecodes must be considered very carefully.
Expose 'f_in_cleanup' as a Counter
The original intention was to expose a minimum of needed functionality. However, as we consider the frame flag f_in_cleanup an implementation detail, we may expose it as a counter.
Similarly, if we have a counter we may need to have the cleanup hook called on every counter decrement. It's unlikely to have much performance impact as nested finally clauses are an uncommon case.
Add code object flag 'CO_CLEANUP'
As an alternative to set the flag inside the SETUP_WITH and WITH_CLEANUP bytecodes, we can introduce a flag CO_CLEANUP. When the interpreter starts to execute code with CO_CLEANUP set, it sets f_in_cleanup for the whole function body. This flag is set for code objects of __enter__ and __exit__ special methods. Technically it might be set on functions called __enter__ and __exit__.
This seems to be less clear solution. It also covers the case where __enter__ and __exit__ are called manually. This may be accepted either as a feature or as an unnecessary side-effect (or, though unlikely, as a bug).
It may also impose a problem when __enter__ or __exit__ functions are implemented in C, as there is no code object to check for the f_in_cleanup flag.
Have Cleanup Callback on Frame Object Itself
The frame object may be extended to have a f_cleanup_callback member which is called when f_in_cleanup is reset to 0. This would help to register different callbacks to different coroutines.
Despite its apparent beauty, this solution doesn't add anything, as the two primary use cases are:
- Setting the callback in a signal handler. The callback is inherently a single one for this case.
- Use a single callback per loop for the coroutine use case. Here, in almost all cases, there is only one loop per thread.
No Cleanup Hook
The original proposal included no cleanup hook specification, as there are a few ways to achieve the same using current tools:
- Using sys.settrace() and the f_trace callback. This may impose some problem to debugging, and has a big performance impact (although interrupting doesn't happen very often).
- Sleeping a bit more and trying again. For a coroutine library this is easy. For signals it may be achieved using signal.alert.
Both methods are considered too impractical and a way to catch exit from finally clauses is proposed.
References
| [1] | Monocle https://github.com/saucelabs/monocle |
| [2] | Bluelet https://github.com/sampsyo/bluelet |
| [3] | Twisted: inlineCallbacks http://twistedmatrix.com/documents/8.1.0/api/twisted.internet.defer.html |
| [4] | Original discussion http://mail.python.org/pipermail/python-ideas/2012-April/014705.html |
| [5] | Issue #14730: Implementation of the PEP 419 http://bugs.python.org/issue14730 |
Copyright
This document has been placed in the public domain.
pep-0420 Implicit Namespace Packages
| PEP: | 420 |
|---|---|
| Title: | Implicit Namespace Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Eric V. Smith <eric at trueblade.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 19-Apr-2012 |
| Python-Version: | 3.3 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-May/119651.html |
Contents
Abstract
Namespace packages are a mechanism for splitting a single Python package across multiple directories on disk. In current Python versions, an algorithm to compute the packages __path__ must be formulated. With the enhancement proposed here, the import machinery itself will construct the list of directories that make up the package. This PEP builds upon previous work, documented in PEP 382 and PEP 402. Those PEPs have since been rejected in favor of this one. An implementation of this PEP is at [1].
Terminology
Within this PEP:
- "package" refers to Python packages as defined by Python's import statement.
- "distribution" refers to separately installable sets of Python modules as stored in the Python package index, and installed by distutils or setuptools.
- "vendor package" refers to groups of files installed by an operating system's packaging mechanism (e.g. Debian or Redhat packages install on Linux systems).
- "regular package" refers to packages as they are implemented in Python 3.2 and earlier.
- "portion" refers to a set of files in a single directory (possibly stored in a zip file) that contribute to a namespace package.
- "legacy portion" refers to a portion that uses __path__ manipulation in order to implement namespace packages.
This PEP defines a new type of package, the "namespace package".
Namespace packages today
Python currently provides pkgutil.extend_path to denote a package as a namespace package. The recommended way of using it is to put:
from pkgutil import extend_path __path__ = extend_path(__path__, __name__)
in the package's __init__.py. Every distribution needs to provide the same contents in its __init__.py, so that extend_path is invoked independent of which portion of the package gets imported first. As a consequence, the package's __init__.py cannot practically define any names as it depends on the order of the package fragments on sys.path to determine which portion is imported first. As a special feature, extend_path reads files named <packagename>.pkg which allows declaration of additional portions.
setuptools provides a similar function named pkg_resources.declare_namespace that is used in the form:
import pkg_resources pkg_resources.declare_namespace(__name__)
In the portion's __init__.py, no assignment to __path__ is necessary, as declare_namespace modifies the package __path__ through sys.modules. As a special feature, declare_namespace also supports zip files, and registers the package name internally so that future additions to sys.path by setuptools can properly add additional portions to each package.
setuptools allows declaring namespace packages in a distribution's setup.py, so that distribution developers don't need to put the magic __path__ modification into __init__.py themselves.
See PEP 402's "The Problem" section [2] for additional motivations for namespace packages. Note that PEP 402 has been rejected, but the motivating use cases are still valid.
Rationale
The current imperative approach to namespace packages has led to multiple slightly-incompatible mechanisms for providing namespace packages. For example, pkgutil supports *.pkg files; setuptools doesn't. Likewise, setuptools supports inspecting zip files, and supports adding portions to its _namespace_packages variable, whereas pkgutil doesn't.
Namespace packages are designed to support being split across multiple directories (and hence found via multiple sys.path entries). In this configuration, it doesn't matter if multiple portions all provide an __init__.py file, so long as each portion correctly initializes the namespace package. However, Linux distribution vendors (amongst others) prefer to combine the separate portions and install them all into the same file system directory. This creates a potential for conflict, as the portions are now attempting to provide the same file on the target system - something that is not allowed by many package managers. Allowing implicit namespace packages means that the requirement to provide an __init__.py file can be dropped completely, and affected portions can be installed into a common directory or split across multiple directories as distributions see fit.
A namespace package will not be constrained by a fixed __path__, computed from the parent path at namespace package creation time. Consider the standard library encodings package:
- Suppose that encodings becomes a namespace package.
- It sometimes gets imported during interpreter startup to initialize the standard io streams.
- An application modifies sys.path after startup and wants to contribute additional encodings from new path entries.
- An attempt is made to import an encoding from an encodings portion that is found on a path entry added in step 3.
If the import system was restricted to only finding portions along the value of sys.path that existed at the time the encodings namespace package was created, the additional paths added in step 3 would never be searched for the additional portions imported in step 4. In addition, if step 2 were sometimes skipped (due to some runtime flag or other condition), then the path items added in step 3 would indeed be used the first time a portion was imported. Thus this PEP requires that the list of path entries be dynamically computed when each portion is loaded. It is expected that the import machinery will do this efficiently by caching __path__ values and only refreshing them when it detects that the parent path has changed. In the case of a top-level package like encodings, this parent path would be sys.path.
Specification
Regular packages will continue to have an __init__.py and will reside in a single directory.
Namespace packages cannot contain an __init__.py. As a consequence, pkgutil.extend_path and pkg_resources.declare_namespace become obsolete for purposes of namespace package creation. There will be no marker file or directory for specifying a namespace package.
During import processing, the import machinery will continue to iterate over each directory in the parent path as it does in Python 3.2. While looking for a module or package named "foo", for each directory in the parent path:
- If <directory>/foo/__init__.py is found, a regular package is imported and returned.
- If not, but <directory>/foo.{py,pyc,so,pyd} is found, a module is imported and returned. The exact list of extension varies by platform and whether the -O flag is specified. The list here is representative.
- If not, but <directory>/foo is found and is a directory, it is recorded and the scan continues with the next directory in the parent path.
- Otherwise the scan continues with the next directory in the parent path.
If the scan completes without returning a module or package, and at least one directory was recorded, then a namespace package is created. The new namespace package:
- Has a __path__ attribute set to an iterable of the path strings that were found and recorded during the scan.
- Does not have a __file__ attribute.
Note that if "import foo" is executed and "foo" is found as a namespace package (using the above rules), then "foo" is immediately created as a package. The creation of the namespace package is not deferred until a sub-level import occurs.
A namespace package is not fundamentally different from a regular package. It is just a different way of creating packages. Once a namespace package is created, there is no functional difference between it and a regular package.
Dynamic path computation
The import machinery will behave as if a namespace package's __path__ is recomputed before each portion is loaded.
For performance reasons, it is expected that this will be achieved by detecting that the parent path has changed. If no change has taken place, then no __path__ recomputation is required. The implementation must ensure that changes to the contents of the parent path are detected, as well as detecting the replacement of the parent path with a new path entry list object.
Impact on import finders and loaders
PEP 302 defines "finders" that are called to search path elements. These finders' find_module methods return either a "loader" object or None.
For a finder to contribute to namespace packages, it must implement a new find_loader(fullname) method. fullname has the same meaning as for find_module. find_loader always returns a 2-tuple of (loader, <iterable-of-path-entries>). loader may be None, in which case <iterable-of-path-entries> (which may be empty) is added to the list of recorded path entries and path searching continues. If loader is not None, it is immediately used to load a module or regular package.
Even if loader is returned and is not None, <iterable-of-path-entries> must still contain the path entries for the package. This allows code such as pkgutil.extend_path() to compute path entries for packages that it does not load.
Note that multiple path entries per finder are allowed. This is to support the case where a finder discovers multiple namespace portions for a given fullname. Many finders will support only a single namespace package portion per find_loader call, in which case this iterable will contain only a single string.
The import machinery will call find_loader if it exists, else fall back to find_module. Legacy finders which implement find_module but not find_loader will be unable to contribute portions to a namespace package.
The specification expands PEP 302 loaders to include an optional method called module_repr() which if present, is used to generate module object reprs. See the section below for further details.
Differences between namespace packages and regular packages
Namespace packages and regular packages are very similar. The differences are:
- Portions of namespace packages need not all come from the same directory structure, or even from the same loader. Regular packages are self-contained: all parts live in the same directory hierarchy.
- Namespace packages have no __file__ attribute.
- Namespace packages' __path__ attribute is a read-only iterable of strings, which is automatically updated when the parent path is modified.
- Namespace packages have no __init__.py module.
- Namespace packages have a different type of object for their __loader__ attribute.
Namespace packages in the standard library
It is possible, and this PEP explicitly allows, that parts of the standard library be implemented as namespace packages. When and if any standard library packages become namespace packages is outside the scope of this PEP.
Migrating from legacy namespace packages
As described above, prior to this PEP pkgutil.extend_path() was used by legacy portions to create namespace packages. Because it is likely not practical for all existing portions of a namespace package to be migrated to this PEP at once, extend_path() will be modified to also recognize PEP 420 namespace packages. This will allow some portions of a namespace to be legacy portions while others are migrated to PEP 420. These hybrid namespace packages will not have the dynamic path computation that normal namespace packages have, since extend_path() never provided this functionality in the past.
Packaging Implications
Multiple portions of a namespace package can be installed into the same directory, or into separate directories. For this section, suppose there are two portions which define "foo.bar" and "foo.baz". "foo" itself is a namespace package.
If these are installed in the same location, a single directory "foo" would be in a directory that is on sys.path. Inside "foo" would be two directories, "bar" and "baz". If "foo.bar" is removed (perhaps by an OS package manager), care must be taken not to remove the "foo/baz" or "foo" directories. Note that in this case "foo" will be a namespace package (because it lacks an __init__.py), even though all of its portions are in the same directory.
Note that "foo.bar" and "foo.baz" can be installed into the same "foo" directory because they will not have any files in common.
If the portions are installed in different locations, two different "foo" directories would be in directories that are on sys.path. "foo/bar" would be in one of these sys.path entries, and "foo/baz" would be in the other. Upon removal of "foo.bar", the "foo/bar" and corresponding "foo" directories can be completely removed. But "foo/baz" and its corresponding "foo" directory cannot be removed.
It is also possible to have the "foo.bar" portion installed in a directory on sys.path, and have the "foo.baz" portion provided in a zip file, also on sys.path.
Examples
Nested namespace packages
This example uses the following directory structure:
Lib/test/namespace_pkgs
project1
parent
child
one.py
project2
parent
child
two.py
Here, both parent and child are namespace packages: Portions of them exist in different directories, and they do not have __init__.py files.
Here we add the parent directories to sys.path, and show that the portions are correctly found:
>>> import sys >>> sys.path += ['Lib/test/namespace_pkgs/project1', 'Lib/test/namespace_pkgs/project2'] >>> import parent.child.one >>> parent.__path__ _NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent']) >>> parent.child.__path__ _NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child']) >>> import parent.child.two >>>
Dynamic path computation
This example uses a similar directory structure, but adds a third portion:
Lib/test/namespace_pkgs
project1
parent
child
one.py
project2
parent
child
two.py
project3
parent
child
three.py
We add project1 and project2 to sys.path, then import parent.child.one and parent.child.two. Then we add the project3 to sys.path and when parent.child.three is imported, project3/parent is automatically added to parent.__path__:
# add the first two parent paths to sys.path
>>> import sys
>>> sys.path += ['Lib/test/namespace_pkgs/project1', 'Lib/test/namespace_pkgs/project2']
# parent.child.one can be imported, because project1 was added to sys.path:
>>> import parent.child.one
>>> parent.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent'])
# parent.child.__path__ contains project1/parent/child and project2/parent/child, but not project3/parent/child:
>>> parent.child.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child'])
# parent.child.two can be imported, because project2 was added to sys.path:
>>> import parent.child.two
# we cannot import parent.child.three, because project3 is not in the path:
>>> import parent.child.three
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<frozen importlib._bootstrap>", line 1286, in _find_and_load
File "<frozen importlib._bootstrap>", line 1250, in _find_and_load_unlocked
ImportError: No module named 'parent.child.three'
# now add project3 to sys.path:
>>> sys.path.append('Lib/test/namespace_pkgs/project3')
# and now parent.child.three can be imported:
>>> import parent.child.three
# project3/parent has been added to parent.__path__:
>>> parent.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent', 'Lib/test/namespace_pkgs/project2/parent', 'Lib/test/namespace_pkgs/project3/parent'])
# and project3/parent/child has been added to parent.child.__path__
>>> parent.child.__path__
_NamespacePath(['Lib/test/namespace_pkgs/project1/parent/child', 'Lib/test/namespace_pkgs/project2/parent/child', 'Lib/test/namespace_pkgs/project3/parent/child'])
>>>
Discussion
At PyCon 2012, we had a discussion about namespace packages at which PEP 382 and PEP 402 were rejected, to be replaced by this PEP [3].
There is no intention to remove support of regular packages. If a developer knows that her package will never be a portion of a namespace package, then there is a performance advantage to it being a regular package (with an __init__.py). Creation and loading of a regular package can take place immediately when it is located along the path. With namespace packages, all entries in the path must be scanned before the package is created.
Note that an ImportWarning will no longer be raised for a directory lacking an __init__.py file. Such a directory will now be imported as a namespace package, whereas in prior Python versions an ImportWarning would be raised.
Nick Coghlan presented a list of his objections to this proposal [4]. They are:
- Implicit package directories go against the Zen of Python.
- Implicit package directories pose awkward backwards compatibility challenges.
- Implicit package directories introduce ambiguity into file system layouts.
- Implicit package directories will permanently entrench current newbie-hostile behavior in __main__.
Nick later gave a detailed response to his own objections [5], which is summarized here:
The inclusion of namespace packages in the standard library was motivated by Martin v. Lรถwis, who wanted the encodings package to become a namespace package [6]. While this PEP allows for standard library packages to become namespaces, it defers a decision on encodings.
find_module versus find_loader
An early draft of this PEP specified a change to the find_module method in order to support namespace packages. It would be modified to return a string in the case where a namespace package portion was discovered.
However, this caused a problem with existing code outside of the standard library which calls find_module. Because this code would not be upgraded in concert with changes required by this PEP, it would fail when it would receive unexpected return values from find_module. Because of this incompatibility, this PEP now specifies that finders that want to provide namespace portions must implement the find_loader method, described above.
The use case for supporting multiple portions per find_loader call is given in [7].
Dynamic path computation
Guido raised a concern that automatic dynamic path computation was an unnecessary feature [8]. Later in that thread, PJ Eby and Nick Coghlan presented arguments as to why dynamic computation would minimize surprise to Python users. The conclusion of that discussion has been included in this PEP's Rationale section.
An earlier version of this PEP required that dynamic path computation could only take affect if the parent path object were modified in-place. That is, this would work:
sys.path.append('new-dir')
But this would not:
sys.path = sys.path + ['new-dir']
In the same thread [8], it was pointed out that this restriction is not required. If the parent path is looked up by name instead of by holding a reference to it, then there is no restriction on how the parent path is modified or replaced. For a top-level namespace package, the lookup would be the module named "sys" then its attribute "path". For a namespace package nested inside a package foo, the lookup would be for the module named "foo" then its attribute "__path__".
Module reprs
Previously, module reprs were hard coded based on assumptions about a module's __file__ attribute. If this attribute existed and was a string, it was assumed to be a file system path, and the module object's repr would include this in its value. The only exception was that PEP 302 reserved missing __file__ attributes to built-in modules, and in CPython, this assumption was baked into the module object's implementation. Because of this restriction, some modules contained contrived __file__ values that did not reflect file system paths, and which could cause unexpected problems later (e.g. os.path.join() on a non-path __file__ would return gibberish).
This PEP relaxes this constraint, and leaves the setting of __file__ to the purview of the loader producing the module. Loaders may opt to leave __file__ unset if no file system path is appropriate. Loaders may also set additional reserved attributes on the module if useful. This means that the definitive way to determine the origin of a module is to check its __loader__ attribute.
For example, namespace packages as described in this PEP will have no __file__ attribute because no corresponding file exists. In order to provide flexibility and descriptiveness in the reprs of such modules, a new optional protocol is added to PEP 302 loaders. Loaders can implement a module_repr() method which takes a single argument, the module object. This method should return the string to be used verbatim as the repr of the module. The rules for producing a module repr are now standardized as:
- If the module has an __loader__ and that loader has a module_repr() method, call it with a single argument, which is the module object. The value returned is used as the module's repr.
- If an exception occurs in module_repr(), the exception is caught and discarded, and the calculation of the module's repr continues as if module_repr() did not exist.
- If the module has an __file__ attribute, this is used as part of the module's repr.
- If the module has no __file__ but does have an __loader__, then the loader's repr is used as part of the module's repr.
- Otherwise, just use the module's __name__ in the repr.
Here is a snippet showing how namespace module reprs are calculated from its loader:
class NamespaceLoader:
@classmethod
def module_repr(cls, module):
return "<module '{}' (namespace)>".format(module.__name__)
Built-in module reprs would no longer need to be hard-coded, but instead would come from their loader as well:
class BuiltinImporter:
@classmethod
def module_repr(cls, module):
return "<module '{}' (built-in)>".format(module.__name__)
Here are some example reprs of different types of modules with different sets of the related attributes:
>>> import email
>>> email
<module 'email' from '/home/barry/projects/python/pep-420/Lib/email/__init__.py'>
>>> m = type(email)('foo')
>>> m
<module 'foo'>
>>> m.__file__ = 'zippy:/de/do/dah'
>>> m
<module 'foo' from 'zippy:/de/do/dah'>
>>> class Loader: pass
...
>>> m.__loader__ = Loader
>>> del m.__file__
>>> m
<module 'foo' (<class '__main__.Loader'>)>
>>> class NewLoader:
... @classmethod
... def module_repr(cls, module):
... return '<mystery module!>'
...
>>> m.__loader__ = NewLoader
>>> m
<mystery module!>
>>>
References
| [1] | PEP 420 branch (http://hg.python.org/features/pep-420) |
| [2] | PEP 402's description of use cases for namespace packages (http://www.python.org/dev/peps/pep-0402/#the-problem) |
| [3] | PyCon 2012 Namespace Package discussion outcome (http://mail.python.org/pipermail/import-sig/2012-March/000421.html) |
| [4] | Nick Coghlan's objection to the lack of marker files or directories (http://mail.python.org/pipermail/import-sig/2012-March/000423.html) |
| [5] | Nick Coghlan's response to his initial objections (http://mail.python.org/pipermail/import-sig/2012-April/000464.html) |
| [6] | Martin v. Lรถwis's suggestion to make encodings a namespace package (http://mail.python.org/pipermail/import-sig/2012-May/000540.html) |
| [7] | Use case for multiple portions per find_loader call (http://mail.python.org/pipermail/import-sig/2012-May/000585.html) |
| [8] | (1, 2) Discussion about dynamic path computation (http://mail.python.org/pipermail/python-dev/2012-May/119560.html) |
Copyright
This document has been placed in the public domain.
pep-0421 Adding sys.implementation
| PEP: | 421 |
|---|---|
| Title: | Adding sys.implementation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Eric Snow <ericsnowcurrently at gmail.com> |
| BDFL-Delegate: | Barry Warsaw |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-April-2012 |
| Post-History: | 26-April-2012 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-May/119683.html |
Contents
Abstract
This PEP introduces a new attribute for the sys module: sys.implementation. The attribute holds consolidated information about the implementation of the running interpreter. Thus sys.implementation is the source to which the standard library may look for implementation-specific information.
The proposal in this PEP is in line with a broader emphasis on making Python friendlier to alternate implementations. It describes the new variable and the constraints on what that variable contains. The PEP also explains some immediate use cases for sys.implementation.
Motivation
For a number of years now, the distinction between Python-the-language and CPython (the reference implementation) has been growing. Most of this change is due to the emergence of Jython, IronPython, and PyPy as viable alternate implementations of Python.
Consider, however, the nearly two decades of CPython-centric Python (i.e. most of its existence). That focus has understandably contributed to quite a few CPython-specific artifacts both in the standard library and exposed in the interpreter. Though the core developers have made an effort in recent years to address this, quite a few of the artifacts remain.
Part of the solution is presented in this PEP: a single namespace in which to consolidate implementation specifics. This will help focus efforts to differentiate the implementation specifics from the language. Additionally, it will foster a multiple-implementation mindset.
Proposal
We will add a new attribute to the sys module, called sys.implementation, as an object with attribute-access (as opposed to a mapping). It will contain implementation-specific information.
The attributes of this object will remain fixed during interpreter execution and through the course of an implementation version. This ensures behaviors don't change between versions which depend on attributes of sys.implementation.
The object has each of the attributes described in the Required Attributes section below. Those attribute names will never start with an underscore. The standard library and the language definition will rely only on those required attributes.
This proposal takes a conservative approach in requiring only a small number of attributes. As more become appropriate, they may be added with discretion, as described in Adding New Required Attributes.
While this PEP places no other constraints on sys.implementation, it also recommends that no one rely on capabilities outside those described here. The only exception to that recommendation is for attributes starting with an underscore. Implementers may use those as appropriate to store per-implementation data.
Required Attributes
These are attributes in sys.implementation on which the standard library and language definition will rely, meaning implementers must define them:
- name
- A lower-case identifier representing the implementation. Examples include 'pypy', 'jython', 'ironpython', and 'cpython'.
- version
- The version of the implementation, as opposed to the version of the language it implements. This value conforms to the format described in Version Format.
- hexversion
- The version of the implementation in the same hexadecimal format as sys.hexversion.
- cache_tag
- A string used for the PEP 3147 cache tag [12]. It would normally be a composite of the name and version (e.g. 'cpython-33' for CPython 3.3). However, an implementation may explicitly use a different cache tag. If cache_tag is set to None, it indicates that module caching should be disabled.
Adding New Required Attributes
In time more required attributes will be added to sys.implementation. However, each must have a meaningful use case across all Python implementations in order to be considered. This is made most clear by a use case in the standard library or language specification.
All proposals for new required attributes will go through the normal PEP process. Such a PEP need not be long, just long enough. It will need to sufficiently spell out the rationale for the new attribute, its use cases, and the impact it will have on the various Python implementations.
Version Format
A main point of sys.implementation is to contain information that will be used internally in the standard library. In order to facilitate the usefulness of the version attribute, its value should be in a consistent format across implementations.
As such, the format of sys.implementation.version will follow that of sys.version_info, which is effectively a named tuple. It is a familiar format and generally consistent with normal version format conventions.
Rationale
The status quo for implementation-specific information gives us that information in a more fragile, harder to maintain way. It is spread out over different modules or inferred from other information, as we see with platform.python_implementation().
This PEP is the main alternative to that approach. It consolidates the implementation-specific information into a single namespace and makes explicit that which was implicit.
Type Considerations
It's very easy to get bogged down in discussions about the type of sys.implementation. However, its purpose is to support the standard library and language definition. As such, there isn't much that really matters regarding its type, as opposed to a feature that would be more generally used. Thus characteristics like immutability and sequence-ness have been disregarded.
The only real choice has been between an object with attribute access and a mapping with item access. This PEP espouses dotted access to reflect the relatively fixed nature of the namespace.
Non-Required Attributes
Earlier versions of this PEP included a required attribute called metadata that held any non-required, per-implementation data [17]. However, this proved to be an unnecessary addition considering the purpose of sys.implementation.
Ultimately, non-required attributes are virtually ignored in this PEP. They have no impact other than that careless use may collide with future required attributes. That, however, is but a marginal concern for sys.implementation.
Why a Part of sys?
The sys module holds the new namespace because sys is the depot for interpreter-centric variables and functions. Many implementation-specific attributes are already found in sys.
Why Strict Constraints on Any of the Values?
As already noted in Version Format, values in sys.implementation are intended for use by the standard library. Constraining those values, essentially specifying an API for them, allows them to be used consistently, regardless of how they are otherwise implemented. However, care should be take to not over-specify the constraints.
Discussion
The topic of sys.implementation came up on the python-ideas list in 2009, where the reception was broadly positive [1]. I revived the discussion recently while working on a pure-python imp.get_tag() [2]. Discussion has been ongoing [3]. The messages in issue #14673 [19] are also relevant.
A good part of the recent discussion centered on the type to use for sys.implementation.
Use-cases
platform.python_implementation()
"explicit is better than implicit"
The platform module determines the python implementation by looking for clues in a couple different sys variables [11]. However, this approach is fragile, requiring changes to the standard library each time an implementation changes. Beyond that, support in platform is limited to those implementations that core developers have blessed by special-casing them in the platform module.
With sys.implementation the various implementations would explicitly set the values in their own version of the sys module.
Another concern is that the platform module is part of the stdlib, which ideally should minimize implementation details such as would be moved to sys.implementation.
Any overlap between sys.implementation and the platform module would simply defer to sys.implementation (with the same interface in platform wrapping it).
Cache Tag Generation in Frozen Importlib
PEP 3147 defined the use of a module cache and cache tags for file names. The importlib bootstrap code, frozen into the Python binary as of 3.3, uses the cache tags during the import process. Part of the project to bootstrap importlib has been to clean code out of Python/import.c [21] that did not need to be there any longer.
The cache tag defined in Python/import.c was hard-coded to "cpython" MAJOR MINOR [12]. For importlib the options are either hard-coding it in the same way, or guessing the implementation in the same way as does platform.python_implementation().
As long as the hard-coded tag is limited to CPython-specific code, it is livable. However, inasmuch as other Python implementations use the importlib code to work with the module cache, a hard-coded tag would become a problem.
Directly using the platform module in this case is a non-starter. Any module used in the importlib bootstrap must be built-in or frozen, neither of which apply to the platform module. This is the point that led to the recent interest in sys.implementation.
Regardless of the outcome for the implementation name used, another problem relates to the version used in the cache tag. That version is likely to be the implementation version rather than the language version. However, the implementation version is not readily identified anywhere in the standard library.
Implementation-Specific Tests
Currently there are a number of implementation-specific tests in the test suite under Lib/test. The test support module (Lib/test/support.py [20]) provides some functionality for dealing with these tests. However, like the platform module, test.support must do some guessing that sys.implementation would render unnecessary.
Jython's os.name Hack
In Jython, os.name is set to 'java' to accommodate special treatment of the java environment in the standard library [15] [16]. Unfortunately it masks the os name that would otherwise go there. sys.implementation would help obviate the need for this special case. Currently Jython sets os._name for the normal os.name value.
The Problem With sys.(version|version_info|hexversion)
Earlier versions of this PEP made the mistake of calling sys.version_info (and friends) the version of the Python language, in contrast to the implementation. However, this is not the case. Instead, it is the version of the CPython implementation. Incidentally, the first two components of sys.version_info (major and minor) also reflect the version of the language definition.
As Barry Warsaw noted, the "semantics of sys.version_info have been sufficiently squishy in the past" [18]. With sys.implementation we have the opportunity to improve this situation by first establishing an explicit location for the version of the implementation.
This PEP makes no other effort to directly clarify the semantics of sys.version_info. Regardless, having an explicit version for the implementation will definitely help to clarify the distinction from the language version.
Feedback From Other Python Implementers
IronPython
Jeff Hardy responded to a request for feedback [4]. He said, "I'll probably add it the day after it's approved" [6]. He also gave useful feedback on both the type of sys.implementation and on the metadata attribute (which has since been removed from the PEP).
Jython
In 2009 Frank Wierzbicki said this (relative to Jython implementing the required attributes) [8]:
Speaking for Jython, so far it looks like something we would adopt soonish after it was accepted (it looks pretty useful to me).
PyPy
Some of the PyPy developers have responded to a request for feedback [9]. Armin Rigo said the following [10]:
For myself, I can only say that it looks like a good idea, which we will happily adhere to when we migrate to Python 3.3.
He also expressed support for keeping the required list small. Both Armin and Laura Creighton indicated that an effort to better catalog Python's implementation would be welcome. Such an effort, for which this PEP is a small start, will be considered separately.
Past Efforts
PEP 3139
PEP 3139, from 2008, recommended a clean-up of the sys module in part by extracting implementation-specific variables and functions into a separate module. PEP 421 is less ambitious version of that idea. While PEP 3139 was rejected, its goals are reflected in PEP 421 to a large extent, though with a much lighter approach.
The Bigger Picture
It's worth noting again that this PEP is a small part of a larger on-going effort to identify the implementation-specific parts of Python and mitigate their impact on alternate implementations.
sys.implementation is a focal point for implementation-specific data, acting as a nexus for cooperation between the language, the standard library, and the different implementations. As time goes by it is feasible that sys.implementation will assume current attributes of sys and other builtin/stdlib modules, where appropriate. In this way, it is a PEP 3137-lite, but starting as small as possible.
However, as already noted, many other efforts predate sys.implementation. Neither is it necessarily a major part of the effort. Rather, consider it as part of the infrastructure of the effort to make Python friendlier to alternate implementations.
Alternatives
Since the single-namespace-under-sys approach is relatively straightforward, no alternatives have been considered for this PEP.
Examples of Other Attributes
These are examples only and not part of the proposal. Most of them were suggested during previous discussions, but did not fit into the goals of this PEP. (See Adding New Required Attributes if they get you excited.)
- common_name
- The case-sensitive name by which the implementation is known.
- vcs_url
- A URL for the main VCS repository for the implementation project.
- vcs_revision_id
- A value that identifies the VCS revision of the implementation.
- build_toolchain
- The tools used to build the interpreter.
- build_date
- The timestamp of when the interpreter was built.
- homepage
- The URL of the implementation's website.
- site_prefix
- The preferred site prefix for the implementation.
- runtime
- The run-time environment in which the interpreter is running, as in "Common Language Runtime" (.NET CLR) or "Java Runtime Executable".
- gc_type
- The type of garbage collection used, like "reference counting" or "mark and sweep".
Open Issues
Currently none.
Implementation
The implementation of this PEP is covered in issue #14673 [19].
References
| [1] | The 2009 sys.implementation discussion: http://mail.python.org/pipermail/python-dev/2009-October/092893.html |
| [2] | The initial 2012 discussion: http://mail.python.org/pipermail/python-ideas/2012-March/014555.html (and http://mail.python.org/pipermail/python-ideas/2012-April/014878.html) |
| [3] | Feedback on the PEP: http://mail.python.org/pipermail/python-ideas/2012-April/014954.html |
| [4] | Feedback from the IronPython developers: http://mail.python.org/pipermail/ironpython-users/2012-May/015980.html |
| [5] | (2009) Dino Viehland offers his opinion: http://mail.python.org/pipermail/python-dev/2009-October/092894.html |
| [6] | (2012) Jeff Hardy offers his opinion: http://mail.python.org/pipermail/ironpython-users/2012-May/015981.html |
| [7] | Feedback from the Jython developers: ??? |
| [8] | (2009) Frank Wierzbicki offers his opinion: http://mail.python.org/pipermail/python-dev/2009-October/092974.html |
| [9] | Feedback from the PyPy developers: http://mail.python.org/pipermail/pypy-dev/2012-May/009883.html |
| [10] | (2012) Armin Rigo offers his opinion: http://mail.python.org/pipermail/pypy-dev/2012-May/009884.html |
| [11] | The platform code which divines the implementation name: http://hg.python.org/cpython/file/2f563908ebc5/Lib/platform.py#l1247 |
| [12] | (1, 2) The definition for cache tags in PEP 3147: http://www.python.org/dev/peps/pep-3147/#id53 |
| [13] | The original implementation of the cache tag in CPython: http://hg.python.org/cpython/file/2f563908ebc5/Python/import.c#l121 |
| [14] | Examples of implementation-specific handling in test.support: * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l509 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1246 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1252 * http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l1275 |
| [15] | The standard library entry for os.name: http://docs.python.org/3.3/library/os.html#os.name |
| [16] | The use of os.name as 'java' in the stdlib test suite. http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py#l512 |
| [17] | Nick Coghlan's proposal for sys.implementation.metadata: http://mail.python.org/pipermail/python-ideas/2012-May/014984.html |
| [18] | Feedback from Barry Warsaw: http://mail.python.org/pipermail/python-dev/2012-May/119374.html |
| [19] | (1, 2) http://bugs.python.org/issue14673 |
| [20] | http://hg.python.org/cpython/file/2f563908ebc5/Lib/test/support.py |
| [21] | http://hg.python.org/cpython/file/2f563908ebc5/Python/import.c |
Copyright
This document has been placed in the public domain.
pep-0422 Simpler customisation of class creation
| PEP: | 422 |
|---|---|
| Title: | Simpler customisation of class creation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Daniel Urban <urban.dani+py at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 5-Jun-2012 |
| Python-Version: | 3.5 |
| Post-History: | 5-Jun-2012, 10-Feb-2013 |
Contents
Abstract
Currently, customising class creation requires the use of a custom metaclass. This custom metaclass then persists for the entire lifecycle of the class, creating the potential for spurious metaclass conflicts.
This PEP proposes to instead support a wide range of customisation scenarios through a new namespace parameter in the class header, and a new __autodecorate__ hook in the class body.
The new mechanism should be easier to understand and use than implementing a custom metaclass, and thus should provide a gentler introduction to the full power Python's metaclass machinery.
PEP Withdrawal
This proposal has been withdrawn in favour of Martin Teichmann's proposal in PEP 487, which achieves the same goals through a simpler, easier to use __init_subclass__ hook that simply isn't invoked for the base class that defines the hook.
Background
For an already created class cls, the term "metaclass" has a clear meaning: it is the value of type(cls).
During class creation, it has another meaning: it is also used to refer to the metaclass hint that may be provided as part of the class definition. While in many cases these two meanings end up referring to one and the same object, there are two situations where that is not the case:
- If the metaclass hint refers to an instance of type, then it is considered as a candidate metaclass along with the metaclasses of all of the parents of the class being defined. If a more appropriate metaclass is found amongst the candidates, then it will be used instead of the one given in the metaclass hint.
- Otherwise, an explicit metaclass hint is assumed to be a factory function and is called directly to create the class object. In this case, the final metaclass will be determined by the factory function definition. In the typical case (where the factory functions just calls type, or, in Python 3.3 or later, types.new_class) the actual metaclass is then determined based on the parent classes.
It is notable that only the actual metaclass is inherited - a factory function used as a metaclass hook sees only the class currently being defined, and is not invoked for any subclasses.
In Python 3, the metaclass hint is provided using the metaclass=Meta keyword syntax in the class header. This allows the __prepare__ method on the metaclass to be used to create the locals() namespace used during execution of the class body (for example, specifying the use of collections.OrderedDict instead of a regular dict).
In Python 2, there was no __prepare__ method (that API was added for Python 3 by PEP 3115). Instead, a class body could set the __metaclass__ attribute, and the class creation process would extract that value from the class namespace to use as the metaclass hint. There is published code [1] that makes use of this feature.
Another new feature in Python 3 is the zero-argument form of the super() builtin, introduced by PEP 3135. This feature uses an implicit __class__ reference to the class being defined to replace the "by name" references required in Python 2. Just as code invoked during execution of a Python 2 metaclass could not call methods that referenced the class by name (as the name had not yet been bound in the containing scope), similarly, Python 3 metaclasses cannot call methods that rely on the implicit __class__ reference (as it is not populated until after the metaclass has returned control to the class creation machinery).
Finally, when a class uses a custom metaclass, it can pose additional challenges to the use of multiple inheritance, as a new class cannot inherit from parent classes with unrelated metaclasses. This means that it is impossible to add a metaclass to an already published class: such an addition is a backwards incompatible change due to the risk of metaclass conflicts.
Proposal
This PEP proposes that a new mechanism to customise class creation be added to Python 3.4 that meets the following criteria:
- Integrates nicely with class inheritance structures (including mixins and multiple inheritance)
- Integrates nicely with the implicit __class__ reference and zero-argument super() syntax introduced by PEP 3135
- Can be added to an existing base class without a significant risk of introducing backwards compatibility problems
- Restores the ability for class namespaces to have some influence on the class creation process (above and beyond populating the namespace itself), but potentially without the full flexibility of the Python 2 style __metaclass__ hook
One mechanism that can achieve this goal is to add a new implicit class decoration hook, modelled directly on the existing explicit class decorators, but defined in the class body or in a parent class, rather than being part of the class definition header.
Specifically, it is proposed that class definitions be able to provide a class initialisation hook as follows:
class Example:
def __autodecorate__(cls):
# This is invoked after the class is created, but before any
# explicit decorators are called
# The usual super() mechanisms are used to correctly support
# multiple inheritance. The class decorator style signature helps
# ensure that invoking the parent class is as simple as possible.
cls = super().__autodecorate__()
return cls
To simplify the cooperative multiple inheritance case, object will gain a default implementation of the hook that returns the class unmodified:
- class object:
- def __autodecorate__(cls):
- return cls
If a metaclass wishes to block implicit class decoration for some reason, it must arrange for cls.__autodecorate__ to trigger AttributeError.
If present on the created object, this new hook will be called by the class creation machinery after the __class__ reference has been initialised. For types.new_class(), it will be called as the last step before returning the created class object. __autodecorate__ is implicitly converted to a class method when the class is created (prior to the hook being invoked).
Note, that when __autodecorate__ is called, the name of the class is not yet bound to the new class object. As a consequence, the two argument form of super() cannot be used to call methods (e.g., super(Example, cls) wouldn't work in the example above). However, the zero argument form of super() works as expected, since the __class__ reference is already initialised.
This general proposal is not a new idea (it was first suggested for inclusion in the language definition more than 10 years ago [2], and a similar mechanism has long been supported by Zope's ExtensionClass [3]), but the situation has changed sufficiently in recent years that the idea is worth reconsidering for inclusion as a native language feature.
In addition, the introduction of the metaclass __prepare__ method in PEP 3115 allows a further enhancement that was not possible in Python 2: this PEP also proposes that type.__prepare__ be updated to accept a factory function as a namespace keyword-only argument. If present, the value provided as the namespace argument will be called without arguments to create the result of type.__prepare__ instead of using a freshly created dictionary instance. For example, the following will use an ordered dictionary as the class namespace:
class OrderedExample(namespace=collections.OrderedDict):
def __autodecorate__(cls):
# cls.__dict__ is still a read-only proxy to the class namespace,
# but the underlying storage is an OrderedDict instance
Note
This PEP, along with the existing ability to use __prepare__ to share a single namespace amongst multiple class objects, highlights a possible issue with the attribute lookup caching: when the underlying mapping is updated by other means, the attribute lookup cache is not invalidated correctly (this is a key part of the reason class __dict__ attributes produce a read-only view of the underlying storage).
Since the optimisation provided by that cache is highly desirable, the use of a preexisting namespace as the class namespace may need to be declared as officially unsupported (since the observed behaviour is rather strange when the caches get out of sync).
Key Benefits
Easier use of custom namespaces for a class
Currently, to use a different type (such as collections.OrderedDict) for a class namespace, or to use a pre-populated namespace, it is necessary to write and use a custom metaclass. With this PEP, using a custom namespace becomes as simple as specifying an appropriate factory function in the class header.
Easier inheritance of definition time behaviour
Understanding Python's metaclasses requires a deep understanding of the type system and the class construction process. This is legitimately seen as challenging, due to the need to keep multiple moving parts (the code, the metaclass hint, the actual metaclass, the class object, instances of the class object) clearly distinct in your mind. Even when you know the rules, it's still easy to make a mistake if you're not being extremely careful. An earlier version of this PEP actually included such a mistake: it stated "subclass of type" for a constraint that is actually "instance of type".
Understanding the proposed implicit class decoration hook only requires understanding decorators and ordinary method inheritance, which isn't quite as daunting a task. The new hook provides a more gradual path towards understanding all of the phases involved in the class definition process.
Reduced chance of metaclass conflicts
One of the big issues that makes library authors reluctant to use metaclasses (even when they would be appropriate) is the risk of metaclass conflicts. These occur whenever two unrelated metaclasses are used by the desired parents of a class definition. This risk also makes it very difficult to add a metaclass to a class that has previously been published without one.
By contrast, adding an __autodecorate__ method to an existing type poses a similar level of risk to adding an __init__ method: technically, there is a risk of breaking poorly implemented subclasses, but when that occurs, it is recognised as a bug in the subclass rather than the library author breaching backwards compatibility guarantees. In fact, due to the constrained signature of __autodecorate__, the risk in this case is actually even lower than in the case of __init__.
Integrates cleanly with PEP 3135
Unlike code that runs as part of the metaclass, code that runs as part of the new hook will be able to freely invoke class methods that rely on the implicit __class__ reference introduced by PEP 3135, including methods that use the zero argument form of super().
Replaces many use cases for dynamic setting of __metaclass__
For use cases that don't involve completely replacing the defined class, Python 2 code that dynamically set __metaclass__ can now dynamically set __autodecorate__ instead. For more advanced use cases, introduction of an explicit metaclass (possibly made available as a required base class) will still be necessary in order to support Python 3.
Design Notes
Determining if the class being decorated is the base class
In the body of an __autodecorate__ method, as in any other class method, __class__ will be bound to the class declaring the method, while the value passed in may be a subclass.
This makes it relatively straightforward to skip processing the base class if necessary:
class Example:
def __autodecorate__(cls):
cls = super().__autodecorate__()
# Don't process the base class
if cls is __class__:
return
# Process subclasses here
...
Replacing a class with a different kind of object
As an implicit decorator, __autodecorate__ is able to relatively easily replace the defined class with a different kind of object. Technically custom metaclasses and even __new__ methods can already do this implicitly, but the decorator model makes such code much easier to understand and implement.
class BuildDict:
def __autodecorate__(cls):
cls = super().__autodecorate__()
# Don't process the base class
if cls is __class__:
return
# Convert subclasses to ordinary dictionaries
return cls.__dict__.copy()
It's not clear why anyone would ever do this implicitly based on inheritance rather than just using an explicit decorator, but the possibility seems worth noting.
Open Questions
Is the namespace concept worth the extra complexity?
Unlike the new __autodecorate__ hook the proposed namespace keyword argument is not automatically inherited by subclasses. Given the way this proposal is currently written , the only way to get a special namespace used consistently in subclasses is still to write a custom metaclass with a suitable __prepare__ implementation.
Changing the custom namespace factory to also be inherited would significantly increase the complexity of this proposal, and introduce a number of the same potential base class conflict issues as arise with the use of custom metaclasses.
Eric Snow has put forward a separate proposal to instead make the execution namespace for class bodies an ordered dictionary by default, and capture the class attribute definition order for future reference as an attribute (e.g. __definition_order__) on the class object.
Eric's suggested approach may be a better choice for a new default behaviour for type that combines well with the proposed __autodecorate__ hook, leaving the more complex configurable namespace factory idea to a custom metaclass like the one shown below.
New Ways of Using Classes
The new namespace keyword in the class header enables a number of interesting options for controlling the way a class is initialised, including some aspects of the object models of both Javascript and Ruby.
All of the examples below are actually possible today through the use of a custom metaclass:
class CustomNamespace(type):
@classmethod
def __prepare__(meta, name, bases, *, namespace=None, **kwds):
parent_namespace = super().__prepare__(name, bases, **kwds)
return namespace() if namespace is not None else parent_namespace
def __new__(meta, name, bases, ns, *, namespace=None, **kwds):
return super().__new__(meta, name, bases, ns, **kwds)
def __init__(cls, name, bases, ns, *, namespace=None, **kwds):
return super().__init__(name, bases, ns, **kwds)
The advantage of implementing the new keyword directly in type.__prepare__ is that the only persistent effect is then the change in the underlying storage of the class attributes. The metaclass of the class remains unchanged, eliminating many of the drawbacks typically associated with these kinds of customisations.
Order preserving classes
class OrderedClass(namespace=collections.OrderedDict):
a = 1
b = 2
c = 3
Prepopulated namespaces
seed_data = dict(a=1, b=2, c=3)
class PrepopulatedClass(namespace=seed_data.copy):
pass
Cloning a prototype class
class NewClass(namespace=Prototype.__dict__.copy):
pass
Extending a class
Note
Just because the PEP makes it possible to do this relatively cleanly doesn't mean anyone should do this!
from collections import MutableMapping
# The MutableMapping + dict combination should give something that
# generally behaves correctly as a mapping, while still being accepted
# as a class namespace
class ClassNamespace(MutableMapping, dict):
def __init__(self, cls):
self._cls = cls
def __len__(self):
return len(dir(self._cls))
def __iter__(self):
for attr in dir(self._cls):
yield attr
def __contains__(self, attr):
return hasattr(self._cls, attr)
def __getitem__(self, attr):
return getattr(self._cls, attr)
def __setitem__(self, attr, value):
setattr(self._cls, attr, value)
def __delitem__(self, attr):
delattr(self._cls, attr)
def extend(cls):
return lambda: ClassNamespace(cls)
class Example:
pass
class ExtendedExample(namespace=extend(Example)):
a = 1
b = 2
c = 3
>>> Example.a, Example.b, Example.c
(1, 2, 3)
Rejected Design Options
Calling __autodecorate__ from type.__init__
Calling the new hook automatically from type.__init__, would achieve most of the goals of this PEP. However, using that approach would mean that __autodecorate__ implementations would be unable to call any methods that relied on the __class__ reference (or used the zero-argument form of super()), and could not make use of those features themselves.
The current design instead ensures that the implicit decorator hook is able to do anything an explicit decorator can do by running it after the initial class creation is already complete.
Calling the automatic decoration hook __init_class__
Earlier versions of the PEP used the name __init_class__ for the name of the new hook. There were three significant problems with this name:
- it was hard to remember if the correct spelling was __init_class__ or __class_init__
- the use of "init" in the name suggested the signature should match that of type.__init__, which is not the case
- the use of "init" in the name suggested the method would be run as part of initial class object creation, which is not the case
The new name __autodecorate__ was chosen to make it clear that the new initialisation hook is most usefully thought of as an implicitly invoked class decorator, rather than as being like an __init__ method.
Requiring an explicit decorator on __autodecorate__
Originally, this PEP required the explicit use of @classmethod on the __autodecorate__ decorator. It was made implicit since there's no sensible interpretation for leaving it out, and that case would need to be detected anyway in order to give a useful error message.
This decision was reinforced after noticing that the user experience of defining __prepare__ and forgetting the @classmethod method decorator is singularly incomprehensible (particularly since PEP 3115 documents it as an ordinary method, and the current documentation doesn't explicitly say anything one way or the other).
Making __autodecorate__ implicitly static, like __new__
While it accepts the class to be instantiated as the first argument, __new__ is actually implicitly treated as a static method rather than as a class method. This allows it to be readily extracted from its defining class and called directly on a subclass, rather than being coupled to the class object it is retrieved from.
Such behaviour initially appears to be potentially useful for the new __autodecorate__ hook, as it would allow __autodecorate__ methods to readily be used as explicit decorators on other classes.
However, that apparent support would be an illusion as it would only work correctly if invoked on a subclass, in which case the method can just as readily be retrieved from the subclass and called that way. Unlike __new__, there's no issue with potentially changing method signatures at different points in the inheritance chain.
Passing in the namespace directly rather than a factory function
At one point, this PEP proposed that the class namespace be passed directly as a keyword argument, rather than passing a factory function. However, this encourages an unsupported behaviour (that is, passing the same namespace to multiple classes, or retaining direct write access to a mapping used as a class namespace), so the API was switched to the factory function version.
Reference Implementation
A reference implementation for __autodecorate__ has been posted to the issue tracker [4]. It uses the original __init_class__ naming. does not yet allow the implicit decorator to replace the class with a different object and does not implement the suggested namespace parameter for type.__prepare__.
TODO
- address the 5 points in http://mail.python.org/pipermail/python-dev/2013-February/123970.html
References
| [1] | http://mail.python.org/pipermail/python-dev/2012-June/119878.html |
| [2] | http://mail.python.org/pipermail/python-dev/2001-November/018651.html |
| [3] | http://docs.zope.org/zope_secrets/extensionclass.html |
| [4] | http://bugs.python.org/issue17044 |
Copyright
This document has been placed in the public domain.
pep-0423 Naming conventions and recipes related to packaging
| PEP: | 423 |
|---|---|
| Title: | Naming conventions and recipes related to packaging |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benoit Bryon <benoit at marmelune.net> |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Deferred |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 24-May-2012 |
| Post-History: |
Contents
- Abstract
- PEP Deferral
- Terminology
- Relationship with other PEPs
- Overview
- If in doubt, ask
- Top-level namespace relates to code ownership
- Use a single name
- Follow PEP 8 for syntax of package and module names
- Pick memorable names
- Pick meaningful names
- Use packaging metadata
- Avoid deep nesting
- Conventions for communities or related projects
- Register names with PyPI
- Recipes
- References
- Copyright
Abstract
This document deals with:
- names of Python projects,
- names of Python packages or modules being distributed,
- namespace packages.
It provides guidelines and recipes for distribution authors:
- new projects should follow the guidelines below.
- existing projects should be aware of these guidelines and can follow specific recipes for existing projects.
PEP Deferral
Further consideration of this PEP has been deferred at least until after PEP 426 (package metadata 2.0) and related updates have been resolved.
Terminology
Reference is packaging terminology in Python documentation [1].
Relationship with other PEPs
- PEP 8 [2] deals with code style guide, including names of Python packages and modules. It covers syntax of package/modules names.
- PEP 345 [3] deals with packaging metadata, and defines name argument of the packaging.core.setup() function.
- PEP 420 [4] deals with namespace packages. It brings support of namespace packages to Python core. Before, namespaces packages were implemented by external libraries.
- PEP 3108 [5] deals with transition between Python 2.x and Python 3.x applied to standard library: some modules to be deleted, some to be renamed. It points out that naming conventions matter and is an example of transition plan.
Overview
Here is a summarized list of guidelines you should follow to choose names:
- understand and respect namespace ownership.
- if your project is related to another project or community:
- search for conventions in main project's documentation, because projects should organize community contributions.
- follow specific project or related community conventions, if any.
- if there is no convention, follow a standard naming pattern.
- make sure your project name is unique, i.e. avoid duplicates:
- make sure distributed packages and modules names are unique, unless you explicitely want to distribute alternatives to existing packages or modules. Using the same value for package/module name and project name is the recommended way to achieve this.
- distribute only one package or module at a time, unless you know what you are doing. It makes it possible to apply the "use a single name" rule, and thus make names consistent.
- make it easy to discover and remember your project:
- avoid deep nesting. Flat things are easier to use and remember
than nested ones:
- one or two namespace levels are recommended, because they are almost always enough.
- even if not recommended, three levels are, de facto, a common case.
- in most cases, you should not need more than three levels.
- follow PEP 8 for syntax of package and module names.
- if you followed specific conventions, or if your project is intended to receive contributions from the community, organize community contributions.
- if still in doubt, ask.
If in doubt, ask
If you feel unsure after reading this document, ask Python community [6] on IRC or on a mailing list.
Top-level namespace relates to code ownership
This helps avoid clashes between project names.
Ownership could be:
- an individual. Example: gp.fileupload [7] is owned and maintained by Gael Pasgrimaud.
- an organization.
Examples:
- zest.releaser [8] is owned and maintained by Zest Software.
- Django [9] is owned and maintained by the Django Software Fundation.
- a group or community. Example: sphinx [10] is maintained by developers of the Sphinx project, not only by its author, Georg Brandl.
- a group or community related to another package. Example: collective.recaptcha [12] is owned by its author: David Glick, Groundwire. But the "collective" namespace is owned by Plone community.
Respect ownership
Understand the purpose of namespace before you use it.
Don't plug into a namespace you don't own, unless explicitely authorized.
As an example, don't plug in "django.contrib" namespace because it is managed by Django's core contributors.
Exceptions can be defined by project authors. See Organize community contributions below.
Also, this rule applies to non-Python projects.
As an example, don't use "apache" as top-level namespace: "Apache" is the name of an existing project (in the case of "Apache", it is also a trademark).
Private (including closed-source) projects use a namespace
... because private projects are owned by somebody. So apply the ownership rule.
For internal/customer projects, use your company name as the namespace.
This rule applies to closed-source projects.
As an example, if you are creating a "climbing" project for the "Python Sport" company: use "pythonsport.climbing" name, even if it is closed source.
Individual projects use a namespace
... because they are owned by individuals. So apply the ownership rule.
There is no shame in releasing a project as open source even if it has an "internal" or "individual" name.
If the project comes to a point where the author wants to change ownership (i.e. the project no longer belongs to an individual), keep in mind it is easy to rename the project.
Community-owned projects can avoid namespace packages
If your project is generic enough (i.e. it is not a contrib to another product or framework), you can avoid namespace packages. The base condition is generally that your project is owned by a group (i.e. the development team) which is dedicated to this project.
Only use a "shared" namespace if you really intend the code to be community owned.
As an example, sphinx [10] project belongs to the Sphinx development team. There is no need to have some "sphinx" namespace package with only one "sphinx.sphinx" project inside.
In doubt, use an individual/organization namespace
If your project is really experimental, best choice is to use an individual or organization namespace:
- it allows projects to be released early.
- it won't block a name if the project is abandoned.
- it doesn't block future changes. When a project becomes mature and there is no reason to keep individual ownership, it remains possible to rename the project.
Use a single name
Distribute only one package (or only one module) per project, and use package (or module) name as project name.
It avoids possible confusion between project name and distributed package or module name.
It makes the name consistent.
It is explicit: when one sees project name, he guesses package/module name, and vice versa.
It also limits implicit clashes between package/module names. By using a single name, when you register a project name to PyPI [11], you also perform a basic package/module name availability verification.
As an example, pipeline [13], python-pipeline [14] and django-pipeline [15] all distribute a package or module called "pipeline". So installing two of them leads to errors. This issue wouldn't have occurred if these distributions used a single name.
Yes:
- Package name: "kheops.pyramid", i.e. import kheops.pyramid
- Project name: "kheops.pyramid", i.e. pip install kheops.pyramid
No:
- Package name: "kheops"
- Project name: "KheopsPyramid"
Note
For historical reasons, PyPI [11] contains many distributions where project and distributed package/module names differ.
Multiple packages/modules should be rare
Technically, Python distributions can provide multiple packages and/or modules. See setup script reference [16] for details.
Some distributions actually do. As an example, setuptools [17] and distribute [18] are both declaring "pkg_resources", "easy_install" and "site" modules in addition to respective "setuptools" and "distribute" packages.
Consider this use case as exceptional. In most cases, you don't need this feature. So a distribution should provide only one package or module at a time.
Distinct names should be rare
A notable exception to the Use a single name rule is when you explicitely need distinct names.
As an example, the Pillow [19] project provides an alternative to the original PIL [20] distribution. Both projects distribute a "PIL" package.
Consider this use case as exceptional. In most cases, you don't need this feature. So a distributed package name should be equal to project name.
Follow PEP 8 for syntax of package and module names
PEP 8 [2] applies to names of Python packages and modules.
If you Use a single name, PEP 8 [2] also applies to project names. The exceptions are namespace packages, where dots are required in project name.
Pick memorable names
One important thing about a project name is that it be memorable.
As an example, celery [21] is not a meaningful name. At first, it is not obvious that it deals with message queuing. But it is memorable, partly because it can be used to feed a RabbitMQ [22] server.
Pick meaningful names
Ask yourself "how would I describe in one sentence what this name is for?", and then "could anyone have guessed that by looking at the name?".
As an example, DateUtils [23] is a meaningful name. It is obvious that it deals with utilities for dates.
When you are using namespaces, try to make each part meaningful.
Use packaging metadata
Consider project names as unique identifiers on PyPI:
- it is important that these identifiers remain human-readable.
- it is even better when these identifiers are meaningful.
- but the primary purpose of identifiers is not to classify or describe projects.
Classifiers and keywords metadata are made for categorization of distributions. Summary and description metadata are meant to describe the project.
As an example, there is a "Framework :: Twisted [24]" classifier. Even if names are quite heterogeneous (they don't follow a particular pattern), we get the list.
In order to Organize community contributions, conventions about names and namespaces matter, but conventions about metadata should be even more important.
As an example, we can find Plone portlets in many places:
- plone.portlet.*
- collective.portlet.*
- collective.portlets.*
- collective.*.portlets
- some vendor-related projects such as "quintagroup.portlet.cumulus"
- and even projects where "portlet" pattern doesn't appear in the name.
Even if Plone community has conventions, using the name to categorize distributions is inapropriate. It's impossible to get the full list of distributions that provide portlets for Plone by filtering on names. But it would be possible if all these distributions used "Framework :: Plone" classifier and "portlet" keyword.
Avoid deep nesting
The Zen of Python [25] says "Flat is better than nested".
Two levels is almost always enough
Don't define everything in deeply nested hierarchies: you will end up with projects and packages like "pythonsport.common.maps.forest". This type of name is both verbose and cumbersome (e.g. if you have many imports from the package).
Furthermore, big hierarchies tend to break down over time as the boundaries between different packages blur.
The consensus is that two levels of nesting are preferred.
For example, we have plone.principalsource instead of plone.source.principal or something like that. The name is shorter, the package structure is simpler, and there would be very little to gain from having three levels of nesting here. It would be impractical to try to put all "core Plone" sources (a source is kind of vocabulary) into the plone.source.* namespace, in part because some sources are part of other packages, and in part because sources already exist in other places. Had we made a new namespace, it would be inconsistently used from the start.
Yes: "pyranha"
Yes: "pythonsport.climbing"
Yes: "pythonsport.forestmap"
No: "pythonsport.maps.forest"
Use only one level for ownership
Don't use 3 levels to set individual/organization ownership in a community namespace.
As an example, let's consider:
- you are pluging into a community namespace, such as "collective".
- and you want to add a more restrictive "ownership" level, to avoid clashes inside the community.
In such a case, you'd better use the most restrictive ownership level as first level.
As an example, where "collective" is a major community namespace that "gergovie" belongs to, and "vercingetorix" it the name of "gergovie" author:
No: "collective.vercingetorix.gergovie"
Yes: "vercingetorix.gergovie"
Don't use more than 3 levels
Technically, you can create deeply nested hierarchies. However, in most cases, you shouldn't need it.
Note
Even communities where namespaces are standard don't use more than 3 levels.
Register names with PyPI
PyPI [11] is the central place for distributions in Python community. So, it is also the place where to register project and package names.
See Registering with the Package Index [27] for details.
Recipes
The following recipes will help you follow the guidelines and conventions above.
How to check for name availability?
Before you choose a project name, make sure it hasn't already been registered in the following locations:
As an example, you could also check in various locations such as popular code hosting services, but keep in mind that PyPI is the only place you can register for names in Python community.
That's why it is important you register names with PyPI.
Also make sure the names of distributed packages or modules haven't already been registered:
- in the Python Standard Library [28].
- inside projects at PyPI. There is currently no helper for that. Notice that the more projects follow the use a single name rule, the easier is the verification.
- you may ask the community.
The use a single name rule also helps you avoid clashes with package names: if a project name is available, then the package name has good chances to be available too.
How to rename a project?
Renaming a project is possible, but keep in mind that it will cause some confusions. So, pay particular attention to README and documentation, so that users understand what happened.
- First of all, do not remove legacy distributions from PyPI. Because some users may be using them.
- Copy the legacy project, then change names (project and
package/module). Pay attention to, at least:
- packaging files,
- folder name that contains source files,
- documentation, including README,
- import statements in code.
- Assign Obsoletes-Dist metadata to new distribution in setup.cfg file. See PEP 345 about Obsolete-Dist [29] and setup.cfg specification [30].
- Release a new version of the renamed project, then publish it.
- Edit legacy project:
- add dependency to new project,
- drop everything except packaging stuff,
- add the Development Status :: 7 - Inactive classifier in setup script,
- publish a new release.
So, users of the legacy package:
- can continue using the legacy distributions at a deprecated version,
- can upgrade to last version of legacy distribution, which is empty...
- ... and automatically download new distribution as a dependency of the legacy one.
Users who discover the legacy project see it is inactive.
Improved handling of renamed projects on PyPI
If many projects follow Renaming howto recipe, then many legacy distributions will have the following characteristics:
- Development Status :: 7 - Inactive classifier.
- latest version is empty, except packaging stuff.
- lastest version "redirects" to another distribution. E.g. it has a single dependency on the renamed project.
- referenced as Obsoletes-Dist in a newer distribution.
So it will be possible to detect renamed projects and improve readability on PyPI. So that users can focus on active distributions. But this feature is not required now. There is no urge. It won't be covered in this document.
How to apply naming guidelines on existing projects?
There is no obligation for existing projects to be renamed. The choice is left to project authors and mainteners for obvious reasons.
However, project authors are invited to:
- at least, state about current naming.
- then plan and promote migration.
- optionally actually rename existing project or distributed packages/modules.
State about current naming
The important thing, at first, is that you state about current choices:
- Ask yourself "why did I choose the current name?", then document it.
- If there are differences with the guidelines provided in this document, you should tell your users.
- If possible, create issues in the project's bugtracker, at least for record. Then you are free to resolve them later, or maybe mark them as "wontfix".
Projects that are meant to receive contributions from community should also organize community contributions.
Promote migrations
Every Python developer should migrate whenever possible, or promote the migrations in their respective communities.
Apply these guidelines on your projects, then the community will see it is safe.
In particular, "leaders" such as authors of popular projects are influential, they have power and, thus, responsability over communities.
Apply these guidelines on popular projects, then communities will adopt the conventions too.
Projects should promote migrations when they release a new (major) version, particularly if this version introduces support for Python 3.x, new standard library's packaging or namespace packages.
Opportunity
As of Python 3.3 being developed:
- many projects are not Python 3.x compatible. It includes "big" products or frameworks. It means that many projects will have to do a migration to support Python 3.x.
- packaging (aka distutils2) is on the starting blocks. When it is released, projects will be invited to migrate and use new packaging.
- PEP 420 [4] brings official support of namespace packages to Python.
It means that most active projects should be about to migrate in the next year(s) to support Python 3.x, new packaging or new namespace packages.
Such an opportunity is unique and won't come again soon! So let's introduce and promote naming conventions as soon as possible (i.e. now).
References
Additional background:
- Martin Aspeli's article about names [31]. Some parts of this document are quotes from this article.
- in development official packaging documentation [32].
- The Hitchhiker's Guide to Packaging [33], which has an empty placeholder for "naming specification".
References and footnotes:
| [1] | http://docs.python.org/dev/packaging/introduction.html#general-python-terminology |
| [2] | (1, 2, 3) http://www.python.org/dev/peps/pep-0008/#package-and-module-names |
| [3] | http://www.python.org/dev/peps/pep-0345/ |
| [4] | (1, 2) http://www.python.org/dev/peps/pep-0420/ |
| [5] | http://www.python.org/dev/peps/pep-3108/ |
| [6] | http://www.python.org/community/ |
| [7] | http://pypi.python.org/pypi/gp.fileupload/ |
| [8] | http://pypi.python.org/pypi/zest.releaser/ |
| [9] | http://djangoproject.com/ |
| [10] | (1, 2) http://sphinx.pocoo.org |
| [11] | (1, 2, 3, 4) http://pypi.python.org |
| [12] | http://pypi.python.org/pypi/collective.recaptcha/ |
| [13] | http://pypi.python.org/pypi/pipeline/ |
| [14] | http://pypi.python.org/pypi/python-pipeline/ |
| [15] | http://pypi.python.org/pypi/django-pipeline/ |
| [16] | http://docs.python.org/dev/packaging/setupscript.html |
| [17] | http://pypi.python.org/pypi/setuptools |
| [18] | http://packages.python.org/distribute/ |
| [19] | http://pypi.python.org/pypi/Pillow/ |
| [20] | http://pypi.python.org/pypi/PIL/ |
| [21] | http://pypi.python.org/pypi/celery/ |
| [22] | http://www.rabbitmq.com |
| [23] | http://pypi.python.org/pypi/DateUtils/ |
| [24] | http://pypi.python.org/pypi?:action=browse&show=all&c=525 |
| [25] | http://www.python.org/dev/peps/pep-0020/ |
| [26] | http://plone.org/community/develop |
| [27] | http://docs.python.org/dev/packaging/packageindex.html |
| [28] | http://docs.python.org/library/index.html |
| [29] | http://www.python.org/dev/peps/pep-0345/#obsoletes-dist-multiple-use |
| [30] | http://docs.python.org/dev/packaging/setupcfg.html |
| [31] | http://www.martinaspeli.net/articles/the-naming-of-things-package-names-and-namespaces |
| [32] | http://docs.python.org/dev/packaging/ |
| [33] | http://guide.python-distribute.org/specification.html#naming-specification |
Copyright
This document has been placed in the public domain.
pep-0424 A method for exposing a length hint
| PEP: | 424 |
|---|---|
| Title: | A method for exposing a length hint |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Alex Gaynor <alex.gaynor at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 14-July-2012 |
| Python-Version: | 3.4 |
| Post-History: | http://mail.python.org/pipermail/python-dev/2012-July/120920.html |
Contents
Abstract
CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).
Specification
This PEP formally documents __length_hint__ for other interpreters and non-standard-library Python modules to implement.
__length_hint__ must return an integer (else a TypeError is raised) or NotImplemented, and is not required to be accurate. It may return a value that is either larger or smaller than the actual size of the container. A return value of NotImplemented indicates that there is no finite length estimate. It may not return a negative value (else a ValueError is raised).
In addition, a new function operator.length_hint hint is added, with the following semantics (which define how __length_hint__ should be used):
def length_hint(obj, default=0):
"""Return an estimate of the number of items in obj.
This is useful for presizing containers when building from an
iterable.
If the object supports len(), the result will be
exact. Otherwise, it may over- or under-estimate by an
arbitrary amount. The result will be an integer >= 0.
"""
try:
return len(obj)
except TypeError:
try:
get_hint = type(obj).__length_hint__
except AttributeError:
return default
try:
hint = get_hint(obj)
except TypeError:
return default
if hint is NotImplemented:
return default
if not isinstance(hint, int):
raise TypeError("Length hint must be an integer, not %r" %
type(hint))
if hint < 0:
raise ValueError("__length_hint__() should return >= 0")
return hint
Rationale
Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.
Copyright
This document has been placed into the public domain.
pep-0425 Compatibility Tags for Built Distributions
| PEP: | 425 |
|---|---|
| Title: | Compatibility Tags for Built Distributions |
| Version: | $Revision$ |
| Last-Modified: | 07-Aug-2012 |
| Author: | Daniel Holth <dholth at gmail.com> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Jul-2012 |
| Python-Version: | 3.4 |
| Post-History: | 8-Aug-2012, 18-Oct-2012, 15-Feb-2013 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-February/124116.html |
Contents
Abstract
This PEP specifies a tagging system to indicate with which versions of Python a built or binary distribution is compatible. A set of three tags indicate which Python implementation and language version, ABI, and platform a built distribution requires. The tags are terse because they will be included in filenames.
PEP Acceptance
This PEP was accepted by Nick Coghlan on 17th February, 2013.
Rationale
Today "python setup.py bdist" generates the same filename on PyPy and CPython, but an incompatible archive, making it inconvenient to share built distributions in the same folder or index. Instead, built distributions should have a file naming convention that includes enough information to decide whether or not a particular archive is compatible with a particular implementation.
Previous efforts come from a time where CPython was the only important implementation and the ABI was the same as the Python language release. This specification improves upon the older schemes by including the Python implementation, language version, ABI, and platform as a set of tags.
By comparing the tags it supports with the tags listed by the distribution, an installer can make an educated decision about whether to download a particular built distribution without having to read its full metadata.
Overview
The tag format is {python tag}-{abi tag}-{platform tag}
- python tag
- ‘py27’, ‘cp33’
- abi tag
- ‘cp32dmu’, ‘none’
- platform tag
- ‘linux_x86_64’, ‘any’
For example, the tag py27-none-any indicates compatible with Python 2.7 (any Python 2.7 implementation) with no abi requirement, on any platform.
Use
The wheel built package format includes these tags in its filenames, of the form {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl. Other package formats may have their own conventions.
Details
Python Tag
The Python tag indicates the implementation and version required by a distribution. Major implementations have abbreviated codes, initially:
- py: Generic Python (does not require implementation-specific features)
- cp: CPython
- ip: IronPython
- pp: PyPy
- jy: Jython
Other Python implementations should use sys.implementation.name.
The version is py_version_nodot. CPython gets away with no dot, but if one is needed the underscore _ is used instead. PyPy should probably use its own versions here pp18, pp19.
The version can be just the major version 2 or 3 py2, py3 for many pure-Python distributions.
Importantly, major-version-only tags like py2 and py3 are not shorthand for py20 and py30. Instead, these tags mean the packager intentionally released a cross-version-compatible distribution.
A single-source Python 2/3 compatible distribution can use the compound tag py2.py3. See Compressed Tag Sets, below.
ABI Tag
The ABI tag indicates which Python ABI is required by any included extension modules. For implementation-specific ABIs, the implementation is abbreviated in the same way as the Python Tag, e.g. cp33d would be the CPython 3.3 ABI with debugging.
The CPython stable ABI is abi3 as in the shared library suffix.
Implementations with a very unstable ABI may use the first 6 bytes (as 8 base64-encoded characters) of the SHA-256 hash of ther source code revision and compiler flags, etc, but will probably not have a great need to distribute binary distributions. Each implementation's community may decide how to best use the ABI tag.
Platform Tag
The platform tag is simply distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _.
- win32
- linux_i386
- linux_x86_64
Use
The tags are used by installers to decide which built distribution (if any) to download from a list of potential built distributions. The installer maintains a list of (pyver, abi, arch) tuples that it will support. If the built distribution's tag is in the list, then it can be installed.
It is recommended that installers try to choose the most feature complete built distribution available (the one most specific to the installation environment) by default before falling back to pure Python versions published for older Python releases. Installers are also recommended to provide a way to configure and re-order the list of allowed compatibility tags; for example, a user might accept only the *-none-any tags to only download built packages that advertise themselves as being pure Python.
Another desirable installer feature might be to include "re-compile from source if possible" as more preferable than some of the compatible but legacy pre-built options.
This example list is for an installer running under CPython 3.3 on a linux_x86_64 system. It is in order from most-preferred (a distribution with a compiled extension module, built for the current version of Python) to least-preferred (a pure-Python distribution built with an older version of Python):
- cp33-cp33m-linux_x86_64
- cp33-abi3-linux_x86_64
- cp3-abi3-linux_x86_64
- cp33-none-linux_x86_64*
- cp3-none-linux_x86_64*
- py33-none-linux_x86_64*
- py3-none-linux_x86_64*
- cp33-none-any
- cp3-none-any
- py33-none-any
- py3-none-any
- py32-none-any
- py31-none-any
- py30-none-any
- Built distributions may be platform specific for reasons other than C extensions, such as by including a native executable invoked as a subprocess.
Sometimes there will be more than one supported built distribution for a particular version of a package. For example, a packager could release a package tagged cp33-abi3-linux_x86_64 that contains an optional C extension and the same distribution tagged py3-none-any that does not. The index of the tag in the supported tags list breaks the tie, and the package with the C extension is installed in preference to the package without because that tag appears first in the list.
Compressed Tag Sets
To allow for compact filenames of bdists that work with more than one compatibility tag triple, each tag in a filename can instead be a '.'-separated, sorted, set of tags. For example, pip, a pure-Python package that is written to run under Python 2 and 3 with the same source code, could distribute a bdist with the tag py2.py3-none-any. The full list of simple tags is:
for x in pytag.split('.'):
for y in abitag.split('.'):
for z in archtag.split('.'):
yield '-'.join((x, y, z))
A bdist format that implements this scheme should include the expanded tags in bdist-specific metadata. This compression scheme can generate large numbers of unsupported tags and "impossible" tags that are supported by no Python implementation e.g. "cp33-cp31u-win64", so use it sparingly.
FAQ
- What tags are used by default?
- Tools should use the most-preferred architecture dependent tag e.g. cp33-cp33m-win32 or the most-preferred pure python tag e.g. py33-none-any by default. If the packager overrides the default it indicates that they intended to provide cross-Python compatibility.
- What tag do I use if my distribution uses a feature exclusive to the newest version of Python?
- Compatibility tags aid installers in selecting the most compatible build of a single version of a distribution. For example, when there is no Python 3.3 compatible build of beaglevote-1.2.0 (it uses a Python 3.4 exclusive feature) it may still use the py3-none-any tag instead of the py34-none-any tag. A Python 3.3 user must combine other qualifiers, such as a requirement for the older release beaglevote-1.1.0 that does not use the new feature, to get a compatible build.
- Why isn't there a . in the Python version number?
- CPython has lasted 20+ years without a 3-digit major release. This should continue for some time. Other implementations may use _ as a delimeter, since both - and . delimit the surrounding filename.
- Why normalise hyphens and other non-alphanumeric characters to underscores?
- To avoid conflicting with the "." and "-" characters that separate components of the filename, and for better compatibility with the widest range of filesystem limitations for filenames (including being usable in URL paths without quoting).
- Why not use special character <X> rather than "." or "-"?
- Either because that character is inconvenient or potentially confusing in some contexts (for example, "+" must be quoted in URLs, "~" is used to denote the user's home directory in POSIX), or because the advantages weren't sufficiently compelling to justify changing the existing reference implementation for the wheel format defined in PEP 427 (for example, using "," rather than "." to separate components in a compressed tag).
- Who will maintain the registry of abbreviated implementations?
- New two-letter abbreviations can be requested on the python-dev mailing list. As a rule of thumb, abbreviations are reserved for the current 4 most prominent implementations.
- Does the compatibility tag go into METADATA or PKG-INFO?
- No. The compatibility tag is part of the built distribution's metadata. METADATA / PKG-INFO should be valid for an entire distribution, not a single build of that distribution.
- Why didn't you mention my favorite Python implementation?
- The abbreviated tags facilitate sharing compiled Python code in a public index. Your Python implementation can use this specification too, but with longer tags. Recall that all "pure Python" built distributions just use 'py'.
- Why is the ABI tag (the second tag) sometimes "none" in the reference implementation?
- Since Python 2 does not have an easy way to get to the SOABI (the concept comes from newer versions of Python 3) the reference implentation at the time of writing guesses "none". Ideally it would detect "py27(d|m|u)" analagous to newer versions of Python, but in the meantime "none" is a good enough way to say "don't know".
References
| [1] | Egg Filename-Embedded Metadata (http://peak.telecommunity.com/DevCenter/EggFormats#filename-embedded-metadata) |
| [2] | Creating Built Distributions (http://docs.python.org/distutils/builtdist.html) |
| [3] | PEP 3147 -- PYC Repository Directories (http://www.python.org/dev/peps/pep-3147/) |
Acknowledgements
The author thanks Paul Moore, Nick Coghlan, Mark Abramowitz, and Mr. Michele Lacchia for their valuable help and advice.
Copyright
This document has been placed in the public domain.
pep-0426 Metadata for Python Software Packages 2.0
| PEP: | 426 |
|---|---|
| Title: | Metadata for Python Software Packages 2.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Daniel Holth <dholth at gmail.com>, Donald Stufft <donald at stufft.io> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | Distutils SIG <distutils-sig at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 440 |
| Created: | 30 Aug 2012 |
| Post-History: | 14 Nov 2012, 5 Feb 2013, 7 Feb 2013, 9 Feb 2013, 27 May 2013, 20 Jun 2013, 23 Jun 2013, 14 Jul 2013, 21 Dec 2013 |
| Replaces: | 345 |
Contents
- Abstract
- Purpose
- A Note on Time Frames
- Development, Distribution and Deployment of Python Software
- Metadata format
- Core metadata
- Source code metadata
- Semantic dependencies
- Metadata Extensions
- Extras (optional dependencies)
- Environment markers
- Updating the metadata specification
- Appendix A: Conversion notes for legacy metadata
- Appendix B: Mapping dependency declarations to an RPM SPEC file
- Appendix C: Summary of differences from PEP 345
- Metadata-Version semantics
- Switching to a JSON compatible format
- Changing the version scheme
- Source labels
- Support for different kinds of dependencies
- Support for optional dependencies for distributions
- Support for metadata extensions
- Changes to environment markers
- Updated contact information
- Changes to project URLs
- Changes to platform support
- Updated obsolescence mechanism
- Included text documents
- Appendix D: Deferred features
- Appendix E: Rejected features
- Separate lists for conditional and unconditional dependencies
- Disallowing underscores in distribution names
- Allowing the use of Unicode in distribution names
- Single list for conditional and unconditional dependencies
- Depending on source labels
- Alternative dependencies
- Compatible release comparisons in environment markers
- Conditional provides
- References
- Copyright
Abstract
This PEP describes a mechanism for publishing and exchanging metadata related to Python distributions. It includes specifics of the field names, and their semantics and usage.
This document specifies version 2.0 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314. Version 1.2 is specified in PEP 345.
Version 2.0 of the metadata format migrates from a custom key-value format to a JSON-compatible in-memory representation.
This version also adds fields designed to make third-party packaging of Python software easier, defines a formal extension mechanism, and adds support for optional dependencies. Finally, this version addresses several issues with the previous iteration of the standard version identification scheme.
Note
"I" in this doc refers to Nick Coghlan. Daniel and Donald either wrote or contributed to earlier versions, and have been providing feedback as this JSON-based rewrite has taken shape. Daniel and Donald have also been vetting the proposal as we go to ensure it is practical to implement for both clients and index servers.
Metadata 2.0 represents a major upgrade to the Python packaging ecosystem, and attempts to incorporate experience gained over the 15 years(!) since distutils was first added to the standard library. Some of that is just incorporating existing practices from setuptools/pip/etc, some of it is copying from other distribution systems (like Linux distros or other development language communities) and some of it is attempting to solve problems which haven't yet been well solved by anyone (like supporting clean conversion of Python source packages to distro policy compliant source packages for at least Debian and Fedora, and perhaps other platform specific distribution systems).
There will eventually be a suite of PEPs covering various aspects of the metadata 2.0 format and related systems:
- this PEP, covering the core metadata format
- PEP 440, covering the versioning identification and selection scheme
- PEP 459, covering several standard extensions
- a yet-to-be-written PEP to define v2.0 of the sdist format
- an updated wheel PEP (v1.1) to add pydist.json (and possibly convert the wheel metadata file from Key:Value to JSON)
- an updated installation database PEP to add pydist.json
- a PEP to standardise the expected command line interface for setup.py as an interface to an application's build system (rather than requiring that the build system support the distutils command system)
It's going to take a while to work through all of these and make them a reality. The main change from our last attempt at this is that we're trying to design the different pieces so we can implement them independently of each other, without requiring users to switch to a whole new tool chain (although they may have to upgrade their existing ones to start enjoying the benefits in their own work).
Many of the inline notes in this version of the PEP are there to aid reviewers that are familiar with the old metadata standards. Before this version is finalised, most of that content will be moved down to the "rationale" section at the end of the document, as it would otherwise be an irrelevant distraction for future readers.
Purpose
The purpose of this PEP is to define a common metadata interchange format for communication between software publication tools and software integration tools in the Python ecosystem. One key aim is to support full dependency analysis in that ecosystem without requiring the execution of arbitrary Python code by those doing the analysis. Another aim is to encourage good software distribution practices by default, while continuing to support the current practices of almost all existing users of the Python Package Index (both publishers and integrators). Finally, the aim is to support an upgrade path from the existing setuptools defined dependency and entry point metadata formats that is transparent to end users.
The design draws on the Python community's 15 years of experience with distutils based software distribution, and incorporates ideas and concepts from other distribution systems, including Python's setuptools, pip and other projects, Ruby's gems, Perl's CPAN, Node.js's npm, PHP's composer and Linux packaging systems such as RPM and APT.
While the specifics of this format are aimed at the Python ecosystem, some of the ideas may also be useful in the future evolution of other dependency management ecosystems.
A Note on Time Frames
There's a lot of work going on in the Python packaging space at the moment. In the near term (up until the release of Python 3.4), those efforts are focused on the existing metadata standards, both those defined in Python Enhancement Proposals, and the de facto standards defined by the setuptools project.
This PEP is about setting out a longer term goal for the ecosystem that captures those existing capabilities in a format that is easier to work with. There are still a number of key open questions (mostly related to source based distribution), and those won't be able to receive proper attention from the development community until the other near term concerns have been resolved.
At this point in time, the PEP is quite possibly still overengineered, as we're still trying to make sure we have all the use cases covered. The "transparent upgrade path from setuptools" goal brings in a lot of required functionality though, and then the aim of supporting automated creation of policy compliant downstream packages for Linux distributions adds more. However, we've at least reached the point where we're taking a critical look at the core metadata, and are pushing as much functionality out to standard metadata extensions as we can.
Development, Distribution and Deployment of Python Software
The metadata design in this PEP is based on a particular conceptual model of the software development and distribution process. This model consists of the following phases:
- Software development: this phase involves working with a source checkout for a particular application to add features and fix bugs. It is expected that developers in this phase will need to be able to build the software, run the software's automated test suite, run project specific utility scripts and publish the software.
- Software publication: this phase involves taking the developed software and making it available for use by software integrators. This includes creating the descriptive metadata defined in this PEP, as well as making the software available (typically by uploading it to an index server).
- Software integration: this phase involves taking published software components and combining them into a coherent, integrated system. This may be done directly using Python specific cross-platform tools, or it may be handled through conversion to development language neutral platform specific packaging systems.
- Software deployment: this phase involves taking integrated software components and deploying them on to the target system where the software will actually execute.
The publication and integration phases are collectively referred to as the distribution phase, and the individual software components distributed in that phase are formally referred to as "distributions", but are more colloquially known as "packages" (relying on context to disambiguate them from the "module with submodules" kind of Python package).
The exact details of these phases will vary greatly for particular use cases. Deploying a web application to a public Platform-as-a-Service provider, publishing a new release of a web framework or scientific library, creating an integrated Linux distribution or upgrading a custom application running in a secure enclave are all situations this metadata design should be able to handle.
The complexity of the metadata described in this PEP thus arises directly from the actual complexities associated with software development, distribution and deployment in a wide range of scenarios.
Supporting definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
"Projects" are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [3].
"Releases" are uniquely identified snapshots of a project.
"Distributions" are the packaged files which are used to publish and distribute a release.
Depending on context, "package" may refer to either a distribution, or to an importable Python module that has a __path__ attribute and hence may also have importable submodules.
"Source archive" and "VCS checkout" both refer to the raw source code for a release, prior to creation of an sdist or binary archive.
An "sdist" is a publication format providing the distribution metadata and and any source files that are essential to creating a binary archive for the distribution. Creating a binary archive from an sdist requires that the appropriate build tools be available on the system.
"Binary archives" only require that prebuilt files be moved to the correct location on the target system. As Python is a dynamically bound cross-platform language, many so-called "binary" archives will contain only pure Python source code.
"Contributors" are individuals and organizations that work together to develop a software component.
"Publishers" are individuals and organizations that make software components available for integration (typically by uploading distributions to an index server)
"Integrators" are individuals and organizations that incorporate published distributions as components of an application or larger system.
"Build tools" are automated tools intended to run on development systems, producing source and binary distribution archives. Build tools may also be invoked by integration tools in order to build software distributed as sdists rather than prebuilt binary archives.
"Index servers" are active distribution registries which publish version and dependency metadata and place constraints on the permitted metadata.
"Public index servers" are index servers which allow distribution uploads from untrusted third parties. The Python Package Index [3] is a public index server.
"Publication tools" are automated tools intended to run on development systems and upload source and binary distribution archives to index servers.
"Integration tools" are automated tools that consume the metadata and distribution archives published by an index server or other designated source, and make use of them in some fashion, such as installing them or converting them to a platform specific packaging format.
"Installation tools" are integration tools specifically intended to run on deployment targets, consuming source and binary distribution archives from an index server or other designated location and deploying them to the target system.
"Automated tools" is a collective term covering build tools, index servers, publication tools, integration tools and any other software that produces or consumes distribution version and dependency metadata.
"Legacy metadata" refers to earlier versions of this metadata specification, along with the supporting metadata file formats defined by the setuptools project.
"Distro" is used as the preferred term for Linux distributions, to help avoid confusion with the Python-specific meaning of the term "distribution".
"Dist" is the preferred abbreviation for "distributions" in the sense defined in this PEP.
"Qualified name" is a dotted Python identifier. For imported modules and packages, the qualified name is available as the __name__ attribute, while for functions and classes it is available as the __qualname__ attribute.
A "fully qualified name" uniquely locates an object in the Python module namespace. For imported modules and packages, it is the same as the qualified name. For other Python objects, the fully qualified name consists of the qualified name of the containing module or package, a colon (:) and the qualified name of the object relative to the containing module or package.
A "prefixed name" starts with a qualified name, but is not necessarily a qualified name - it may contain additional dot separated segments which are not valid identifiers.
Integration and deployment of distributions
The primary purpose of the distribution metadata is to support integration and deployment of distributions as part of larger applications and systems.
Integration and deployment can in turn be broken down into further substeps.
- Build: the build step is the process of turning a VCS checkout, source archive or sdist into a binary archive. Dependencies must be available in order to build and create a binary archive of the distribution (including any documentation that is installed on target systems).
- Installation: the installation step involves getting the distribution and all of its runtime dependencies onto the target system. In this step, the distribution may already be on the system (when upgrading or reinstalling) or else it may be a completely new installation.
- Runtime: this is normal usage of a distribution after it has been installed on the target system.
These three steps may all occur directly on the target system. Alternatively the build step may be separated out by using binary archives provided by the publisher of the distribution, or by creating the binary archives on a separate system prior to deployment. The advantage of the latter approach is that it minimizes the dependencies that need to be installed on deployment targets (as the build dependencies will be needed only on the build systems).
The published metadata for distributions SHOULD allow integrators, with the aid of build and integration tools, to:
- obtain the original source code that was used to create a distribution
- identify and retrieve the dependencies (if any) required to use a distribution
- identify and retrieve the dependencies (if any) required to build a distribution from source
- identify and retrieve the dependencies (if any) required to run a distribution's test suite
- find resources on using and contributing to the project
- access sufficiently rich metadata to support contacting distribution publishers through appropriate channels, as well as finding distributions that are relevant to particular problems
Development and publication of distributions
The secondary purpose of the distribution metadata is to support effective collaboration amongst software contributors and publishers during the development phase.
The published metadata for distributions SHOULD allow contributors and publishers, with the aid of build and publication tools, to:
- perform all the same activities needed to effectively integrate and deploy the distribution
- identify and retrieve the additional dependencies needed to develop and publish the distribution
- specify the dependencies (if any) required to use the distribution
- specify the dependencies (if any) required to build the distribution from source
- specify the dependencies (if any) required to run the distribution's test suite
- specify the additional dependencies (if any) required to develop and publish the distribution
Standard build system
Note
The standard build system currently described in the PEP is a draft based on existing practices for projects using distutils or setuptools as their build system (or other projects, like d2to1, that expose a setup.py file for backwards compatibility with existing tools)
The specification doesn't currently cover expected argument support for the commands, which is a limitation that needs to be addressed before the PEP can be considered ready for acceptance.
It is also possible that the "meta build system" will be separated out into a distinct PEP in the coming months (similar to the separation of the versioning and requirement specification standard out to PEP 440).
If a suitable API can be worked out, then it may even be possible to switch to a more declarative API for build system specification.
Both development and integration of distributions relies on the ability to build extension modules and perform other operations in a distribution independent manner.
The current iteration of the metadata relies on the distutils/setuptools commands system to support these necessary development and integration activities:
- python setup.py dist_info: generate distribution metadata in place given a source archive or VCS checkout
- python setup.py sdist: create an sdist from a source archive or VCS checkout
- python setup.py build_ext --inplace: build extension modules in place given an sdist, source archive or VCS checkout
- python setup.py test: run the distribution's test suite in place given an sdist, source archive or VCS checkout
- python setup.py bdist_wheel: create a binary archive from an sdist, source archive or VCS checkout
Metadata format
The format defined in this PEP is an in-memory representation of Python distribution metadata as a string-keyed dictionary. Permitted values for individual entries are strings, lists of strings, and additional nested string-keyed dictionaries.
Except where otherwise noted, dictionary keys in distribution metadata MUST be valid Python identifiers in order to support attribute based metadata access APIs.
The individual field descriptions show examples of the key name and value as they would be serialised as part of a JSON mapping.
The fields identified as core metadata are required. Automated tools MUST NOT accept distributions with missing core metadata as valid Python distributions.
All other fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, except for those operations which specifically require the omitted fields.
Automated tools MUST NOT insert dummy data for missing fields. If a valid value is not provided for a required field then the metadata and the associated distribution MUST be rejected as invalid. If a valid value is not provided for an optional field, that field MUST be omitted entirely. Automated tools MAY automatically derive valid values from other information sources (such as a version control system).
Automated tools, especially public index servers, MAY impose additional length restrictions on metadata beyond those enumerated in this PEP. Such limits SHOULD be imposed where necessary to protect the integrity of a service, based on the available resources and the service provider's judgment of reasonable metadata capacity requirements.
Metadata files
The information defined in this PEP is serialised to pydist.json files for some use cases. These are files containing UTF-8 encoded JSON metadata.
Each metadata file consists of a single serialised mapping, with fields as described in this PEP. When serialising metadata, automated tools SHOULD lexically sort any keys and list elements in order to simplify reviews of any changes.
There are three standard locations for these metadata files:
- as a {distribution}-{version}.dist-info/pydist.json file in an sdist source distribution archive
- as a {distribution}-{version}.dist-info/pydist.json file in a wheel binary distribution archive
- as a {distribution}-{version}.dist-info/pydist.json file in a local Python installation database
Note
These locations are to be confirmed, since they depend on the definition of sdist 2.0 and the revised installation database standard. There will also be a wheel 1.1 format update after this PEP is approved that mandates provision of 2.0+ metadata.
Note that these metadata files SHOULD NOT be processed if the version of the containing location is too low to indicate that they are valid. Specifically, unversioned sdist archives, unversioned installation database directories and version 1.0 of the wheel specification do not cover pydist.json files.
Other tools involved in Python distribution MAY also use this format.
As JSON files are generally awkward to edit by hand, it is RECOMMENDED that these metadata files be generated by build tools based on other input formats (such as setup.py) rather than being used directly as a data input format. Generating the metadata as part of the publication process also helps to deal with version specific fields (including the source URL and the version field itself).
For backwards compatibility with older installation tools, metadata 2.0 files MAY be distributed alongside legacy metadata.
Index servers MAY allow distributions to be uploaded and installation tools MAY allow distributions to be installed with only legacy metadata.
Automated tools MAY attempt to automatically translate legacy metadata to the format described in this PEP. Advice for doing so effectively is given in Appendix A.
Metadata validation
A jsonschema description of the distribution metadata is available.
This schema does NOT currently handle validation of some of the more complex string fields (instead treating them as opaque strings).
Except where otherwise noted, all URL fields in the metadata MUST comply with RFC 3986.
Note
The current version of the schema file covers the previous draft of the PEP, and has not yet been updated for the split into the essential dependency resolution metadata and multiple standard extensions.
Core metadata
This section specifies the core metadata fields that are required for every Python distribution.
Publication tools MUST ensure at least these fields are present when publishing a distribution.
Index servers MUST ensure at least these fields are present in the metadata when distributions are uploaded.
Installation tools MUST refuse to install distributions with one or more of these fields missing by default, but MAY allow users to force such an installation to occur.
Metadata version
Version of the file format; "2.0" is the only legal value.
Automated tools consuming metadata SHOULD warn if metadata_version is greater than the highest version they support, and MUST fail if metadata_version has a greater major version than the highest version they support (as described in PEP 440, the major version is the value before the first dot).
For broader compatibility, build tools MAY choose to produce distribution metadata using the lowest metadata version that includes all of the needed fields.
Example:
"metadata_version": "2.0"
Generator
Name (and optional version) of the program that generated the file, if any. A manually produced file would omit this field.
Example:
"generator": "setuptools (0.9)"
Name
The name of the distribution.
As distribution names are used as part of URLs, filenames, command line parameters and must also interoperate with other packaging systems, the permitted characters are constrained to:
- ASCII letters ([a-zA-Z])
- ASCII digits ([0-9])
- underscores (_)
- hyphens (-)
- periods (.)
Distribution names MUST start and end with an ASCII letter or digit.
Automated tools MUST reject non-compliant names.
All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent.
Index servers MAY consider "confusable" characters (as defined by the Unicode Consortium in TR39: Unicode Security Mechanisms) to be equivalent.
Index servers that permit arbitrary distribution name registrations from untrusted sources SHOULD consider confusable characters to be equivalent when registering new distributions (and hence reject them as duplicates).
Integration tools MUST NOT silently accept a confusable alternate spelling as matching a requested distribution name.
At time of writing, the characters in the ASCII subset designated as confusables by the Unicode Consortium are:
- 1 (DIGIT ONE), l (LATIN SMALL LETTER L), and I (LATIN CAPITAL LETTER I)
- 0 (DIGIT ZERO), and O (LATIN CAPITAL LETTER O)
Example:
"name": "ComfyChair"
Version
The distribution's public or local version identifier, as defined in PEP 440. Version identifiers are designed for consumption by automated tools and support a variety of flexible version specification mechanisms (see PEP 440 for details).
Version identifiers MUST comply with the format defined in PEP 440.
Version identifiers MUST be unique within each project.
Index servers MAY place restrictions on the use of local version identifiers as described in PEP 440.
Example:
"version": "1.0a2"
Summary
A short summary of what the distribution does.
This field SHOULD contain fewer than 512 characters and MUST contain fewer than 2048.
This field SHOULD NOT contain any line breaks.
A more complete description SHOULD be included as a separate file in the sdist for the distribution. Refer to the python-details extension in PEP 459 for more information.
Example:
"summary": "A module that is more fiendish than soft cushions."
Source code metadata
This section specifies fields that provide identifying details for the source code used to produce this distribution.
All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, including failing cleanly when an operation depending on one of these fields is requested.
Source labels
Source labels are text strings with minimal defined semantics. They are intended to allow the original source code to be unambiguously identified, even if an integrator has applied additional local modifications to a particular distribution.
To ensure source labels can be readily incorporated as part of file names and URLs, and to avoid formatting inconsistencies in hexadecimal hash representations they MUST be limited to the following set of permitted characters:
- Lowercase ASCII letters ([a-z])
- ASCII digits ([0-9])
- underscores (_)
- hyphens (-)
- periods (.)
- plus signs (+)
Source labels MUST start and end with an ASCII letter or digit.
A source label for a project MUST NOT match any defined version for that project. This restriction ensures that there is no ambiguity between version identifiers and source labels.
Examples:
"source_label": "1.0.0-alpha.1" "source_label": "1.3.7+build.11.e0f985a" "source_label": "v1.8.1.301.ga0df26f" "source_label": "2013.02.17.dev123"
Source URL
A string containing a full URL where the source for this specific version of the distribution can be downloaded.
Source URLs MUST be unique within each project. This means that the URL can't be something like "https://github.com/pypa/pip/archive/master.zip", but instead must be "https://github.com/pypa/pip/archive/1.3.1.zip".
The source URL MUST reference either a source archive or a tag or specific commit in an online version control system that permits creation of a suitable VCS checkout. It is intended primarily for integrators that wish to recreate the distribution from the original source form.
All source URL references SHOULD specify a secure transport mechanism (such as https) AND include an expected hash value in the URL for verification purposes. If a source URL is specified without any hash information, with hash information that the tool doesn't understand, or with a selected hash algorithm that the tool considers too weak to trust, automated tools SHOULD at least emit a warning and MAY refuse to rely on the URL. If such a source URL also uses an insecure transport, automated tools SHOULD NOT rely on the URL.
It is RECOMMENDED that only hashes which are unconditionally provided by the latest version of the standard library's hashlib module be used for source archive hashes. At time of writing, that list consists of 'md5', 'sha1', 'sha224', 'sha256', 'sha384', and 'sha512'.
For source archive references, an expected hash value may be specified by including a <hash-algorithm>=<expected-hash> entry as part of the URL fragment.
For version control references, the VCS+protocol scheme SHOULD be used to identify both the version control system and the secure transport, and a version control system with hash based commit identifiers SHOULD be used. Automated tools MAY omit warnings about missing hashes for version control systems that do not provide hash based commit identifiers.
To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.
Note
This isn't quite the same as the existing VCS reference notation supported by pip. Firstly, the distribution name is moved in front rather than embedded as part of the URL. Secondly, the commit hash is included even when retrieving based on a tag, in order to meet the requirement above that every link should include a hash to make things harder to forge (creating a malicious repo with a particular tag is easy, creating one with a specific hash, less so).
Example:
"source_url": "https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686" "source_url": "git+https://github.com/pypa/pip.git@1.3.1#7921be1537eac1e97bc40179a57f0349c2aee67d" "source_url": "git+https://github.com/pypa/pip.git@7921be1537eac1e97bc40179a57f0349c2aee67d"
Semantic dependencies
Dependency metadata allows distributions to make use of functionality provided by other distributions, without needing to bundle copies of those distributions.
Semantic dependencies allow publishers to indicate not only which other distributions are needed, but also why they're needed. This additional information allows integrators to install just the dependencies they need for specific activities, making it easier to minimise installation footprints in constrained environments (regardless of the reasons for those constraints).
Distributions may declare five differents kinds of dependency:
- Runtime dependencies: other distributions that are needed to actually use this distribution (but are not considered subdistributions).
- "Meta" dependencies: subdistributions that are grouped together into a single larger metadistribution for ease of reference and installation.
- Test dependencies: other distributions that are needed to run the automated test suite for this distribution (but are not needed just to use it).
- Build dependencies: other distributions that are needed to build this distribution.
- Development dependencies: other distributions that are needed when working on this distribution (but do not fit into one of the other dependency categories).
Within each of these categories, distributions may also declare "Extras". Extras are dependencies that may be needed for some optional functionality, or which are otherwise complementary to the distribution.
Dependency management is heavily dependent on the version identification and specification scheme defined in PEP 440.
All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, by assuming that a missing field indicates "Not applicable for this distribution".
Dependency specifiers
While many dependencies will be needed to use a distribution at all, others are needed only on particular platforms or only when particular optional features of the distribution are needed. To handle this, dependency specifiers are mappings with the following subfields:
- requires: a list of requirement specifiers needed to satisfy the dependency
- extra: the name of a set of optional dependencies that are requested and installed together. See Extras (optional dependencies) for details.
- environment: an environment marker defining the environment that needs these dependencies. See Environment markers for details.
requires is the only required subfield. When it is the only subfield, the dependencies are said to be unconditional. If extra or environment is specified, then the dependencies are conditional.
All three fields may be supplied, indicating that the dependencies are needed only when the named extra is requested in a particular environment.
Automated tools MUST combine related dependency specifiers (those with common values for extra and environment) into a single specifier listing multiple requirements when serialising metadata or passing it to an install hook.
Despite this required normalisation, the same extra name or environment marker MAY appear in multiple conditional dependencies. This may happen, for example, if an extra itself only needs some of its dependencies in specific environments. It is only the combination of extras and environment markers that is required to be unique in a list of dependency specifiers.
Any extras referenced from a dependency specifier MUST be named in the Extras field for this distribution. This helps avoid typographical errors and also makes it straightforward to identify the available extras without scanning the full set of dependencies.
Requirement specifiers
Individual requirements are defined as strings containing a distribution name (as found in the name field). The distribution name may be followed by an extras specifier (enclosed in square brackets) and by a version specifier or direct reference.
Whitespace is permitted between the distribution name and an opening square bracket or parenthesis. Whitespace is also permitted between a closing square bracket and the version specifier.
See Extras (optional dependencies) for details on extras and PEP 440 for details on version specifiers and direct references.
The distribution names should correspond to names as found on the Python Package Index [3]; while these names are often the same as the module names as accessed with import x, this is not always the case (especially for distributions that provide multiple top level modules or packages).
Example requirement specifiers:
"Flask" "Django" "Pyramid" "SciPy ~= 0.12" "ComfyChair[warmup]" "ComfyChair[warmup] > 0.1"
Mapping dependencies to development and distribution activities
The different categories of dependency are based on the various distribution and development activities identified above, and govern which dependencies should be installed for the specified activities:
Implied runtime dependencies:
- run_requires
- meta_requires
Implied build dependencies:
- build_requires
- If running the distribution's test suite as part of the build process,
request the :run:, :meta:, and :test: extras to also
install:
- run_requires
- meta_requires
- test_requires
Implied development and publication dependencies:
- run_requires
- meta_requires
- build_requires
- test_requires
- dev_requires
The notation described in Extras (optional dependencies) SHOULD be used to determine exactly what gets installed for various operations.
Installation tools SHOULD report an error if dependencies cannot be satisfied, MUST at least emit a warning, and MAY allow the user to force the installation to proceed regardless.
See Appendix B for an overview of mapping these dependencies to an RPM spec file.
Extras
A list of optional sets of dependencies that may be used to define conditional dependencies in dependency fields. See Extras (optional dependencies) for details.
The names of extras MUST abide by the same restrictions as those for distribution names.
Example:
"extras": ["warmup"]
Run requires
A list of other distributions needed to actually run this distribution.
Automated tools MUST NOT allow strict version matching clauses or direct references in this field - if permitted at all, such clauses should appear in meta_requires instead.
Example:
"run_requires":
{
"requires": ["SciPy", "PasteDeploy", "zope.interface > 3.5.0"]
},
{
"requires": ["pywin32 > 1.0"],
"environment": "sys_platform == 'win32'"
},
{
"requires": ["SoftCushions"],
"extra": "warmup"
}
]
Meta requires
An abbreviation of "metadistribution requires". This is a list of subdistributions that can easily be installed and used together by depending on this metadistribution.
In this field, automated tools:
- MUST allow strict version matching
- MUST NOT allow more permissive version specifiers.
- MAY allow direct references
Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended primarily as a tool for software integrators rather than publishers.
Distributions that rely on direct references to platform specific binary archives SHOULD define appropriate constraints in their supports_environments field.
Example:
"meta_requires":
{
"requires": ["ComfyUpholstery == 1.0a2",
"ComfySeatCushion == 1.0a2"]
},
{
"requires": ["CupOfTeaAtEleven == 1.0a2"],
"environment": "'linux' in sys_platform"
}
]
Test requires
A list of other distributions needed in order to run the automated tests for this distribution..
Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.
Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.
Example:
"test_requires":
{
"requires": ["unittest2"]
},
{
"requires": ["pywin32 > 1.0"],
"environment": "sys_platform == 'win32'"
},
{
"requires": ["CompressPadding"],
"extra": "warmup"
}
]
Build requires
A list of other distributions needed when this distribution is being built (creating a binary archive from an sdist, source archive or VCS checkout).
Note that while these are build dependencies for the distribution being built, the installation is a deployment scenario for the dependencies.
Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.
Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.
Example:
"build_requires":
{
"requires": ["setuptools >= 0.7"]
},
{
"requires": ["pywin32 > 1.0"],
"environment": "sys_platform == 'win32'"
},
{
"requires": ["cython"],
"extra": "c-accelerators"
}
]
Dev requires
A list of any additional distributions needed during development of this distribution that aren't already covered by the deployment and build dependencies.
Additional dependencies that may be listed in this field include:
- tools needed to create an sdist from a source archive or VCS checkout
- tools needed to generate project documentation that is published online rather than distributed along with the rest of the software
Automated tools MAY disallow strict version matching clauses and direct references in this field and SHOULD at least emit a warning for such clauses.
Public index servers SHOULD NOT allow strict version matching clauses or direct references in this field.
Example:
"dev_requires":
{
"requires": ["hgtools", "sphinx >= 1.0"]
},
{
"requires": ["pywin32 > 1.0"],
"environment": "sys_platform == 'win32'"
}
]
Provides
A list of strings naming additional dependency requirements that are satisfied by installing this distribution. These strings must be of the form Name or Name (Version), as for the requires field.
While dependencies are usually resolved based on distribution names and versions, a distribution may provide additional names explicitly in the provides field.
For example, this may be used to indicate that multiple projects have been merged into and replaced by a single distribution or to indicate that this project is a substitute for another.
For instance, with distribute merged back into setuptools, the merged project is able to include a "provides": ["distribute"] entry to satisfy any projects that require the now obsolete distribution's name.
To avoid malicious hijacking of names, when interpreting metadata retrieved from a public index server, automated tools MUST NOT pay any attention to "provides" entries that do not correspond to a published distribution.
However, to appropriately handle project forks and mergers, automated tools MUST accept "provides" entries that name other distributions when the entry is retrieved from a local installation database or when there is a corresponding "obsoleted_by" entry in the metadata for the named distribution.
A distribution may wish to depend on a "virtual" project name, which does not correspond to any separately distributed project: such a name might be used to indicate an abstract capability which could be supplied by one of multiple projects. For example, multiple projects might supply PostgreSQL bindings for use with SQL Alchemy: each project might declare that it provides sqlalchemy-postgresql-bindings, allowing other projects to depend only on having at least one of them installed.
To handle this case in a way that doesn't allow for name hijacking, the authors of the distribution that first defines the virtual dependency should create a project on the public index server with the corresponding name, and depend on the specific distribution that should be used if no other provider is already installed. This also has the benefit of publishing the default provider in a way that automated tools will understand.
A version declaration may be supplied as part of an entry in the provides field and must follow the rules described in PEP 440. The distribution's version identifier will be implied if none is specified.
Example:
"provides": ["AnotherProject (3.4)", "virtual-package"]
Obsoleted by
A string that indicates that this project is no longer being developed. The named project provides a substitute or replacement.
A version declaration may be supplied and must follow the rules described in PEP 440.
An inactive project may be explicitly indicated by setting this field to None (which is serialised as null in JSON as usual).
Automated tools SHOULD report a warning when installing an obsolete project.
Possible uses for this field include handling project name changes and project mergers.
For instance, with distribute merging back into setuptools, a new version of distribute may be released that depends on the new version of setuptools, and also explicitly indicates that distribute itself is now obsolete.
Note that without a corresponding provides, there is no expectation that the replacement project will be a "drop-in" replacement for the obsolete project - at the very least, upgrading to the new distribution is likely to require changes to import statements.
Examples:
"name": "BadName", "obsoleted_by": "AcceptableName" "name": "distribute", "obsoleted_by": "setuptools >= 0.7"
Metadata Extensions
Extensions to the metadata MAY be present in a mapping under the extensions key. The keys MUST be valid prefixed names, while the values MUST themselves be nested mappings.
Two key names are reserved and MUST NOT be used by extensions, except as described below:
- extension_version
- installer_must_handle
The following example shows the python.details and python.commands standard extensions from PEP 459:
"extensions" : {
"python.details": {
"license": "GPL version 3, excluding DRM provisions",
"keywords": [
"comfy", "chair", "cushions", "too silly", "monty python"
],
"classifiers": [
"Development Status :: 4 - Beta",
"Environment :: Console (Text Based)",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)"
],
"document_names": {
"description": "README.rst",
"license": "LICENSE.rst",
"changelog": "NEWS"
}
},
"python.commands": {
"wrap_console": [{"chair": "chair:run_cli"}],
"wrap_gui": [{"chair-gui": "chair:run_gui"}],
"prebuilt": ["reduniforms"]
},
}
Extension names are defined by distributions that will then make use of the additional published metadata in some way.
To reduce the chance of name conflicts, extension names SHOULD use a prefix that corresponds to a module name in the distribution that defines the meaning of the extension. This practice will also make it easier to find authoritative documentation for metadata extensions.
Metadata extensions allow development tools to record information in the metadata that may be useful during later phases of distribution, but is not essential for dependency resolution or building the software.
Extension versioning
Extensions MUST be versioned, using the extension_version key. However, if this key is omitted, then the implied version is 1.0.
Automated tools consuming extension metadata SHOULD warn if extension_version is greater than the highest version they support, and MUST fail if extension_version has a greater major version than the highest version they support (as described in PEP 440, the major version is the value before the first dot).
For broader compatibility, build tools MAY choose to produce extension metadata using the lowest metadata version that includes all of the needed fields.
Required extension handling
A project may consider correct handling of some extensions to be essential to correct installation of the software. This is indicated by setting the installer_must_handle field to true. Setting it to false or omitting it altogether indicates that processing the extension when installing the distribution is not considered mandatory by the developers.
Installation tools MUST fail if installer_must_handle is set to true for an extension and the tool does not have any ability to process that particular extension (whether directly or through a tool-specific plugin system).
If an installation tool encounters a required extension it doesn't understand when attempting to install from a wheel archive, it MAY fall back on attempting to install from source rather than failing entirely.
Extras (optional dependencies)
Extras are additional dependencies that enable an optional aspect of the distribution, often corresponding to a try: import optional_dependency ... block in the code. To support the use of the distribution with or without the optional dependencies they are listed separately from the distribution's core dependencies and must be requested explicitly, either in the dependency specifications of another distribution, or else when issuing a command to an installation tool.
Note that installation of extras is not tracked directly by installation tools: extras are merely a convenient way to indicate a set of dependencies that is needed to provide some optional functionality of the distribution. If selective installation of components is desired, then multiple distributions must be defined rather than relying on the extras system.
The names of extras MUST abide by the same restrictions as those for distribution names.
Example of a distribution with optional dependencies:
"name": "ComfyChair",
"extras": ["warmup", "c-accelerators"]
"run_requires": [
{
"requires": ["SoftCushions"],
"extra": "warmup"
}
]
"build_requires": [
{
"requires": ["cython"],
"extra": "c-accelerators"
}
]
Other distributions require the additional dependencies by placing the relevant extra names inside square brackets after the distribution name when specifying the dependency.
Extra specifications MUST allow the following additional syntax:
- Multiple extras can be requested by separating them with a comma within the brackets.
- The following special extras request processing of the corresponding
lists of dependencies:
- :meta:: meta_requires
- :run:: run_requires
- :test:: test_requires
- :build:: build_requires
- :dev:: dev_requires
- :*:: process all dependency lists
- The * character as an extra is a wild card that enables all of the entries defined in the distribution's extras field.
- Extras may be explicitly excluded by prefixing their name with a - character (this is useful in conjunction with * to exclude only particular extras that are definitely not wanted, while enabling all others).
- The - character as an extra specification indicates that the distribution itself should NOT be installed, and also disables the normally implied processing of :meta: and :run: dependencies (those may still be requested explicitly using the appropriate extra specifications).
Command line based installation tools SHOULD support this same syntax to allow extras to be requested explicitly.
The full set of dependency requirements is then based on the top level dependencies, along with those of any requested extras.
Dependency examples (showing just the requires subfield):
"requires": ["ComfyChair[warmup]"]
-> requires ``ComfyChair`` and ``SoftCushions``
"requires": ["ComfyChair[*]"]
-> requires ``ComfyChair`` and ``SoftCushions``, but will also
pick up any new extras defined in later versions
Command line examples:
pip install ComfyChair
-> installs ComfyChair with applicable :meta: and :run: dependencies
pip install ComfyChair[*]
-> as above, but also installs all extra dependencies
pip install ComfyChair[-,:build:,*]
-> installs just the build dependencies with all extras
pip install ComfyChair[-,:build:,:run:,:meta:,:test:,*]
-> as above, but also installs dependencies needed to run the tests
pip install ComfyChair[-,:*:,*]
-> installs the full set of development dependencies, but avoids
installing ComfyChair itself
Environment markers
An environment marker describes a condition about the current execution environment. They are used to indicate when certain dependencies are only required in particular environments, and to indicate supported platforms for distributions with additional constraints beyond the availability of a Python runtime.
Here are some examples of such markers:
"sys_platform == 'win32'" "platform_machine == 'i386'" "python_version == '2.4' or python_version == '2.5'" "'linux' in sys_platform"
And here's an example of some conditional metadata for a distribution that requires PyWin32 both at runtime and buildtime when using Windows:
"name": "ComfyChair",
"run_requires": [
{
"requires": ["pywin32 > 1.0"],
"environment": "sys.platform == 'win32'"
}
]
"build_requires": [
{
"requires": ["pywin32 > 1.0"],
"environment": "sys.platform == 'win32'"
}
]
The micro-language behind this is a simple subset of Python: it compares only strings, with the == and in operators (and their opposites), and with the ability to combine expressions. Parentheses are supported for grouping.
The pseudo-grammar is
MARKER: EXPR [(and|or) EXPR]*
EXPR: ("(" MARKER ")") | (SUBEXPR [CMPOP SUBEXPR])
CMPOP: (==|!=|<|>|<=|>=|in|not in)
where SUBEXPR is either a Python string (such as '2.4', or 'win32') or one of the following marker variables:
- python_version: '{0.major}.{0.minor}'.format(sys.version_info)
- python_full_version: see definition below
- os_name``: os.name
- sys_platform``: sys.platform
- platform_release: platform.release()
- platform_version: platform.version()
- platform_machine: platform.machine()
- platform_python_implementation: platform.python_implementation()
- implementation_name``: sys.implementation.name
- implementation_version``: see definition below
If a particular value is not available (such as the sys.implementation subattributes in versions of Python prior to 3.3), the corresponding marker variable MUST be considered equivalent to the empty string.
Note that all subexpressions are restricted to strings or one of the marker variable names (which refer to string values), meaning that it is not possible to use other sequences like tuples or lists on the right side of the in and not in operators.
Chaining of comparison operations is permitted using the normal Python semantics of an implied and.
The python_full_version and implementation_version marker variables are derived from sys.version_info() and sys.implementation.version respectively, in accordance with the following algorithm:
def format_full_version(info):
version = '{0.major}.{0.minor}.{0.micro}'.format(info)
kind = info.releaselevel
if kind != 'final':
version += kind[0] + str(info.serial)
return version
python_full_version = format_full_version(sys.version_info)
implementation_version = format_full_version(sys.implementation.version)
python_full_version will typically correspond to the leading segment of sys.version().
Updating the metadata specification
The metadata specification may be updated with clarifications without requiring a new PEP or a change to the metadata version.
Changing the meaning of existing fields or adding new features (other than through the extension mechanism) requires a new metadata version defined in a new PEP.
Appendix A: Conversion notes for legacy metadata
The reference implementations for converting from legacy metadata to metadata 2.0 are:
- the wheel project, which adds the bdist_wheel command to setuptools
- the Warehouse project, which will eventually be migrated to the Python Packaging Authority as the next generation Python Package Index implementation
- the distlib project which is derived from the core packaging infrastructure created for the distutils2 project
Note
These tools have yet to be updated for the switch to standard extensions for several fields.
While it is expected that there may be some edge cases where manual intervention is needed for clean conversion, the specification has been designed to allow fully automated conversion of almost all projects on PyPI.
Metadata conversion (especially on the part of the index server) is a necessary step to allow installation and analysis tools to start benefiting from the new metadata format, without having to wait for developers to upgrade to newer build systems.
Appendix B: Mapping dependency declarations to an RPM SPEC file
As an example of mapping this PEP to Linux distro packages, assume an example project without any extras defined is split into 2 RPMs in a SPEC file: example and example-devel.
The meta_requires and run_requires dependencies would be mapped to the Requires dependencies for the "example" RPM (a mapping from environment markers relevant to Linux to SPEC file conditions would also allow those to be handled correctly)
The build_requires dependencies would be mapped to the BuildRequires dependencies for the "example" RPM.
All defined dependencies relevant to Linux, including those in dev_requires and test_requires would become Requires dependencies for the "example-devel" RPM.
A documentation toolchain dependency like Sphinx would either go in build_requires (for example, if man pages were included in the built distribution) or in dev_requires (for example, if the documentation is published solely through ReadTheDocs or the project website). This would be enough to allow an automated converter to map it to an appropriate dependency in the spec file.
If the project did define any extras, those could be mapped to additional virtual RPMs with appropriate BuildRequires and Requires entries based on the details of the dependency specifications. Alternatively, they could be mapped to other system package manager features (such as package lists in yum).
Other system package managers may have other options for dealing with extras (Debian packagers, for example, would have the option to map them to "Recommended" or "Suggested" package entries).
The metadata extension format should also allow distribution specific hints to be included in the upstream project metadata without needing to manually duplicate any of the upstream metadata in a distribution specific format.
Appendix C: Summary of differences from PEP 345
- Metadata-Version is now 2.0, with semantics specified for handling version changes
- The increasingly complex ad hoc "Key: Value" format has been replaced by a more structured JSON compatible format that is easily represented as Python dictionaries, strings, lists.
- Most fields are now optional and filling in dummy data for omitted fields is explicitly disallowed
- Explicit permission for in-place clarifications without releasing a new version of the specification
- The PEP now attempts to provide more of an explanation of why the fields exist and how they are intended to be used, rather than being a simple description of the permitted contents
- Changed the version scheme to be based on PEP 440 rather than PEP 386
- Added the source label mechanism as described in PEP 440
- Support for different kinds of dependencies
- The "Extras" optional dependency mechanism
- A well-defined metadata extension mechanism, and migration of any fields not needed for dependency resolution to standard extensions.
- Clarify and simplify various aspects of environment markers:
- allow use of parentheses for grouping in the pseudo-grammar
- consistently use underscores instead of periods in the variable names
- allow ordered string comparisons and chained comparisons
- New constraint mechanism to define supported environments and ensure compatibility between independently built binary components at installation time
- Updated obsolescence mechanism
- More flexible system for defining contact points and contributors
- Defined a recommended set of project URLs
- Identification of supporting documents in the dist-info directory:
- Allows markup formats to be indicated through file extensions
- Standardises the common practice of taking the description from README
- Also supports inclusion of license files and changelogs
- With all due respect to Charles Schulz and Peanuts, many of the examples have been updated to be more thematically appropriate [4] for Python ;)
The rationale for major changes is given in the following sections.
Metadata-Version semantics
The semantics of major and minor version increments are now specified, and follow the same model as the format version semantics specified for the wheel format in PEP 427: minor version increments must behave reasonably when processed by a tool that only understand earlier metadata versions with the same major version, while major version increments may include changes that are not compatible with existing tools.
The major version number of the specification has been incremented accordingly, as interpreting PEP 426 metadata obviously cannot be interpreted in accordance with earlier metadata specifications.
Whenever the major version number of the specification is incremented, it is expected that deployment will take some time, as either metadata consuming tools must be updated before other tools can safely start producing the new format, or else the sdist and wheel formats, along with the installation database definition, will need to be updated to support provision of multiple versions of the metadata in parallel.
Existing tools won't abide by this guideline until they're updated to support the new metadata standard, so the new semantics will first take effect for a hypothetical 2.x -> 3.0 transition. For the 1.x -> 2.0 transition, we will use the approach where tools continue to produce the existing supplementary files (such as entry_points.txt) in addition to any equivalents specified using the new features of the standard metadata format (including the formal extension mechanism).
Switching to a JSON compatible format
The old "Key:Value" format was becoming increasingly limiting, with various complexities like parsers needing to know which fields were permitted to occur more than once, which fields supported the environment marker syntax (with an optional ";" to separate the value from the marker) and eventually even the option to embed arbitrary JSON inside particular subfields.
The old serialisation format also wasn't amenable to easy conversion to standard Python data structures for use in the new install hook APIs, or in future extensions to the importer APIs to allow them to provide information for inclusion in the installation database.
Accordingly, we've taken the step of switching to a JSON-compatible metadata format. This works better for APIs and is much easier for tools to parse and generate correctly. Changing the name of the metadata file also makes it easy to distribute 1.x and 2.x metadata in parallel, greatly simplifying several aspects of the migration to the new metadata format.
The specific choice of pydist.json as the preferred file name relates to the fact that the metadata described in these files applies to the distribution as a whole, rather than to any particular build. Additional metadata formats may be defined in the future to hold information that can only be determined after building a binary distribution for a particular target environment.
Changing the version scheme
See PEP 440 for a detailed rationale for the various changes made to the versioning scheme.
Source labels
The new source label support is intended to make it clearer that the constraints on public version identifiers are there primarily to aid in the creation of reliable automated dependency analysis tools. Projects are free to use whatever versioning scheme they like internally, so long as they are able to translate it to something the dependency analysis tools will understand.
Source labels also make it straightforward to record specific details of a version, like a hash or tag name that allows the release to be reconstructed from the project version control system.
Support for different kinds of dependencies
The separation of the five different kinds of dependency allows a distribution to indicate whether a dependency is needed specifically to develop, build, test or use the distribution.
To allow for metadistributions like PyObjC, while still actively discouraging overly strict dependency specifications, the separate meta dependency fields are used to separate out those dependencies where exact version specifications are appropriate.
The advantage of having these distinctions supported in the upstream Python specific metadata is that even if a project doesn't care about these distinction themselves, they may be more amenable to patches from downstream redistributors that separate the fields appropriately. Over time, this should allow much greater control over where and when particular dependencies end up being installed.
The names for the dependency fields have been deliberately chosen to avoid conflicting with the existing terminology in setuptools and previous versions of the metadata standard. Specifically, the names requires, install_requires and setup_requires are not used, which will hopefully reduce confusion when converting legacy metadata to the new standard.
Support for optional dependencies for distributions
The new extras system allows distributions to declare optional behaviour, and to use the dependency fields to indicate when particular dependencies are needed only to support that behaviour. It is derived from the equivalent system that is already in widespread use as part of setuptools and allows that aspect of the legacy setuptools metadata to be accurately represented in the new metadata format.
The additions to the extras syntax relative to setuptools are defined to make it easier to express the various possible combinations of dependencies, in particular those associated with build systems (with optional support for running the test suite) and development systems.
Support for metadata extensions
The new extension effectively allows sections of the metadata namespace to be delegated to other distributions, while preserving a standard overal format metadata format for easy of processing by distribution tools that do not support a particular extension.
It also works well in combination with the new build_requires field to allow a distribution to depend on tools which do know how to handle the chosen extension, and the new extras mechanism, allowing support for particular extensions to be provided as optional features.
Possible future uses for extensions include declaration of plugins for other distributions and hints for automatic conversion to Linux system packages.
The ability to declare an extension as required is included primarily to allow the definition of the metadata hooks extension to be deferred until some time after the initial adoption of the metadata 2.0 specification. If a distribution needs a postinstall hook to run in order to complete the installation successfully, then earlier versions of tools should fall back to installing from source rather than installing from a wheel file and then failing to run the expected postinstall hook.
Changes to environment markers
There are three substantive changes to environment markers in this version:
- platform_release was added, as it provides more useful information than platform_version on at least Linux and Mac OS X (specifically, it provides details of the running kernel version)
- ordered comparison of strings is allowed, as this is more useful for setting minimum and maximum versions where conditional dependencies are needed or where a platform is supported
- comparison chaining is explicitly allowed, as this becomes useful in the presence of ordered comparisons
The other changes to environment markers are just clarifications and simplifications to make them easier to use.
The arbitrariness of the choice of . and _ in the different variables was addressed by standardising on _ (as these are all predefined variables rather than live references into the Python module namespace)
The use of parentheses for grouping was explicitly noted to address some underspecified behaviour in the previous version of the specification.
Updated contact information
This feature is provided by the python.project and python.integrator extensions in PEP 459.
The switch to JSON made it possible to provide a more flexible system for defining multiple contact points for a project, as well as listing other contributors.
The type concept allows for preservation of the distinction between the original author of a project, and a lead maintainer that takes over at a later date.
Changes to project URLs
This feature is provided by the python.project and python.integrator extensions in PEP 459.
In addition to allow arbitrary strings as project URL labels, the new metadata standard also defines a recommend set of four URL labels for a distribution's home page, documentation, source control and issue tracker.
Changes to platform support
This feature is provided by the python.constraints extension in PEP 459.
The new environment marker system makes it possible to define supported platforms in a way that is actually amenable to automated processing. This has been used to replace several older fields with poorly defined semantics.
The constraints mechanism also allows additional information to be conveyed through metadata extensions and then checked for consistency at install time.
For the moment, the old Requires-External field has been removed entirely. The metadata extension mechanism will hopefully prove to be a more useful replacement.
Updated obsolescence mechanism
The marker to indicate when a project is obsolete and should be replaced has been moved to the obsolete project (the new obsoleted_by field), replacing the previous marker on the replacement project (the removed Obsoletes-Dist field).
This should allow distribution tools to more easily warn users of obsolete projects and their suggested replacements.
The Obsoletes-Dist header is removed rather than deprecated as it is not widely supported, and so removing it does not present any significant barrier to tools and projects adopting the new metadata format.
Included text documents
This feature is provided by the python.details extension in PEP 459.
Currently, PyPI attempts to determine the description's markup format by rendering it as reStructuredText, and if that fails, treating it as plain text.
Furthermore, many projects simply read their long description in from an existing README file in setup.py. The popularity of this practice is only expected to increase, as many online version control systems (including both GitHub and BitBucket) automatically display such files on the landing page for the project.
Standardising on the inclusion of the long description as a separate file in the dist-info directory allows this to be simplified:
- An existing file can just be copied into the dist-info directory as part of creating the sdist
- The expected markup format can be determined by inspecting the file extension of the specified path
Allowing the intended format to be stated explicitly in the path allows the format guessing to be removed and more informative error reports to be provided to users when a rendering error occurs.
This is especially helpful since PyPI applies additional restrictions to the rendering process for security reasons, thus a description that renders correctly on a developer's system may still fail to render on the server.
The document naming system used to achieve this then makes it relatively straightforward to allow declaration of alternative markup formats like HTML, Markdown and AsciiDoc through the use of appropriate file extensions, as well as to define similar included documents for the project's license and changelog.
Grouping the included document names into a single top level field gives automated tools the option of treating them as arbitrary documents without worrying about their contents.
Requiring that the included documents be added to the dist-info metadata directory means that the complete metadata for the distribution can be extracted from an sdist or binary archive simply by extracting that directory, without needing to check for references to other files in the sdist.
Appendix D: Deferred features
Several potentially useful features have been deliberately deferred in order to better prioritise our efforts in migrating to the new metadata standard. These all reflect information that may be nice to have in the new metadata, but which can be readily added in metadata 2.1 without breaking any use cases already supported by metadata 2.0.
Once the pypi, setuptools, pip, wheel and distlib projects support creation and consumption of metadata 2.0, then we may revisit the creation of metadata 2.1 with some or all of these additional features.
MIME type registration
At some point after acceptance of the PEP, we may submit the following MIME type registration requests to IANA:
- Full metadata: application/vnd.python.pydist+json
- Essential dependency resolution metadata: application/vnd.python.pydist-dependencies+json
It's even possible we may be able to just register the vnd.python namespace under the banner of the PSF rather than having to register the individual subformats.
String methods in environment markers
Supporting at least ".startswith" and ".endswith" string methods in environment markers would allow some conditions to be written more naturally. For example, "sys.platform.startswith('win')" is a somewhat more intuitive way to mark Windows specific dependencies, since "'win' in sys.platform" is incorrect thanks to cygwin and the fact that 64-bit Windows still shows up as win32 is more than a little strange.
Support for metadata hooks
While a draft proposal for a metadata hook system has been created, that proposal is not part of the initial set of standard metadata extensions in PEP 459.
A metadata hook system would allow the wheel format to fully replace direct installation on deployment targets, by allowing projects to explicitly define code that should be executed following installation from a wheel file.
This may be something relatively simple, like the two line refresh of the Twisted plugin caches that the Twisted developers recommend for any project that provides Twisted plugins, to more complex platform dependent behaviour, potentially in conjunction with appropriate metadata extensions and supports_environments entries.
For example, upstream declaration of external dependencies for various Linux distributions in a distribution neutral format may be supported by defining an appropriate metadata extension that is read by a postinstall hook and converted into an appropriate invocation of the system package manager. Other operations (such as registering COM DLLs on Windows, registering services for automatic startup on any platform, or altering firewall settings) may need to be undertaken with elevated privileges, meaning they cannot be deferred to implicit execution on first use of the distribution.
For the time being, any such system is being left to the realm of tool specific metadata extensions. This does mean that affected projects may choose not to publish wheel files, instead continuing to rely on source distributions until the relevant extension is well defined and widely supported.
Metabuild system
This version of the metadata specification continues to use setup.py and the distutils command syntax to invoke build and test related operations on a source archive or VCS checkout.
It may be desirable to replace these in the future with tool independent entry points that support:
- Generating the metadata file on a development system
- Generating an sdist on a development system
- Generating a binary archive on a build system
- Running the test suite on a built (but not installed) distribution
Metadata 2.0 deliberately focuses on wheel based installation, leaving sdist, source archive, and VCS checkout based installation to use the existing setup.py based distutils command interface.
In the meantime, the above operations will be handled through the distutils/setuptools command system:
- python setup.py dist_info
- python setup.py sdist
- python setup.py build_ext --inplace
- python setup.py test
- python setup.py bdist_wheel
The following metabuild hooks may be defined in metadata 2.1 to cover these operations without relying on setup.py:
- make_dist_info: generate the sdist's dist_info directory
- make_sdist: create the contents of an sdist
- build_dist: create the contents of a binary wheel archive from an unpacked sdist
- test_built_dist: run the test suite for a built distribution
Tentative signatures have been designed for those hooks, but in order to better focus initial development efforts on the integration and installation use cases, they will not be pursued further until metadata 2.1:
def make_dist_info(source_dir, info_dir):
"""Generate the contents of dist_info for an sdist archive
*source_dir* points to a source checkout or unpacked tarball
*info_dir* is the destination where the sdist metadata files should
be written
Returns the distribution metadata as a dictionary.
"""
def make_sdist(source_dir, contents_dir, info_dir):
"""Generate the contents of an sdist archive
*source_dir* points to a source checkout or unpacked tarball
*contents_dir* is the destination where the sdist contents should be
written (note that archiving the contents is the responsibility of
the metabuild tool rather than the hook function)
*info_dir* is the destination where the sdist metadata files should
be written
Returns the distribution metadata as a dictionary.
"""
def build_dist(sdist_dir, built_dir, info_dir, compatibility=None):
"""Generate the contents of a binary wheel archive
*sdist_dir* points to an unpacked sdist
*built_dir* is the destination where the wheel contents should be
written (note that archiving the contents is the responsibility of
the metabuild tool rather than the hook function)
*info_dir* is the destination where the wheel metadata files should
be written
*compatibility* is an optional PEP 425 compatibility tag indicating
the desired target compatibility for the build. If the tag cannot
be satisfied, the hook should throw ``ValueError``.
Returns the actual compatibility tag for the build
"""
def test_built_dist(sdist_dir, built_dir, info_dir):
"""Check a built (but not installed) distribution works as expected
*sdist_dir* points to an unpacked sdist
*built_dir* points to a platform appropriate unpacked wheel archive
(which may be missing the wheel metadata directory)
*info_dir* points to the appropriate wheel metadata directory
Requires that the distribution's test dependencies be installed
(indicated by the ``:test:`` extra).
Returns ``True`` if the check passes, ``False`` otherwise.
"""
As with the existing install hooks, checking for extras would be done using the same import based checks as are used for runtime extras. That way it doesn't matter if the additional dependencies were requested explicitly or just happen to be available on the system.
There are still a number of open questions with this design, such as whether a single build hook is sufficient to cover both "build for testing" and "prep for deployment", as well as various complexities like support for cross-compilation of binaries, specification of target platforms and Python versions when creating wheel files, etc.
Opting to retain the status quo for now allows us to make progress on improved metadata publication and binary installation support, rather than having to delay that awaiting the creation of a viable metabuild framework.
Appendix E: Rejected features
The following features have been explicitly considered and rejected as introducing too much additional complexity for too small a gain in expressiveness.
Separate lists for conditional and unconditional dependencies
Earlier versions of this PEP used separate lists for conditional and unconditional dependencies. This turned out to be annoying to handle in automated tools and removing it also made the PEP and metadata schema substantially shorter, suggesting it was actually harder to explain as well.
Disallowing underscores in distribution names
Debian doesn't actually permit underscores in names, but that seems unduly restrictive for this spec given the common practice of using valid Python identifiers as Python distribution names. A Debian side policy of converting underscores to hyphens seems easy enough to implement (and the requirement to consider hyphens and underscores as equivalent ensures that doing so won't introduce any conflicts).
Allowing the use of Unicode in distribution names
This PEP deliberately avoids following Python 3 down the path of arbitrary Unicode identifiers, as the security implications of doing so are substantially worse in the software distribution use case (it opens up far more interesting attack vectors than mere code obfuscation).
In addition, the existing tools really only work properly if you restrict names to ASCII and changing that would require a lot of work for all the automated tools in the chain.
It may be reasonable to revisit this question at some point in the (distant) future, but setting up a more reliable software distribution system is challenging enough without adding more general Unicode identifier support into the mix.
Single list for conditional and unconditional dependencies
It's technically possible to store the conditional and unconditional dependencies of each kind in a single list and switch the handling based on the entry type (string or mapping).
However, the current *requires vs *may-require two list design seems easier to understand and work with, since it's only the conditional dependencies that need to be checked against the requested extras list and the target installation environment.
Depending on source labels
There is no mechanism to express a dependency on a source label - they are included in the metadata for internal project reference only. Instead, dependencies must be expressed in terms of either public versions or else direct URL references.
Alternative dependencies
An earlier draft of this PEP considered allowing lists in place of the usual strings in dependency specifications to indicate that there aren multiple ways to satisfy a dependency.
If at least one of the individual dependencies was already available, then the entire dependency would be considered satisfied, otherwise the first entry would be added to the dependency set.
Alternative dependency specification example:
["Pillow", "PIL"] ["mysql", "psycopg2 >= 4", "sqlite3"]
However, neither of the given examples is particularly compelling, since Pillow/PIL style forks aren't common, and the database driver use case would arguably be better served by an SQL Alchemy defined "supported database driver" metadata extension where a project depends on SQL Alchemy, and then declares in the extension which database drivers are checked for compatibility by the upstream project (similar to the advisory supports_environments field in the main metadata).
We're also getting better support for "virtual provides" in this version of the metadata standard, so this may end up being an installer and index server problem to better track and publish those.
Compatible release comparisons in environment markers
PEP 440 defines a rich syntax for version comparisons that could potentially be useful with python_version and python_full_version in environment markers. However, allowing the full syntax would mean environment markers are no longer a Python subset, while allowing only some of the comparisons would introduce yet another special case to handle.
Given that environment markers are only used in cases where a higher level "or" is implied by the metadata structure, it seems easier to require the use of multiple comparisons against specific Python versions for the rare cases where this would be useful.
Conditional provides
Under the revised metadata design, conditional "provides" based on runtime features or the environment would go in a separate "may_provide" field. However, it isn't clear there's any use case for doing that, so the idea is rejected unless someone can present a compelling use case (and even then the idea won't be reconsidered until metadata 2.1 at the earliest).
References
This document specifies version 2.0 of the metadata format. Version 1.0 is specified in PEP 241. Version 1.1 is specified in PEP 314. Version 1.2 is specified in PEP 345.
The initial attempt at a standardised version scheme, along with the justifications for needing such a standard can be found in PEP 386.
| [1] | reStructuredText markup: http://docutils.sourceforge.net/ |
| [2] | PEP 301: http://www.python.org/dev/peps/pep-0301/ |
| [3] | (1, 2, 3) http://pypi.python.org/pypi/ |
| [4] | https://www.youtube.com/watch?v=CSe38dzJYkY |
Copyright
This document has been placed in the public domain.
pep-0427 The Wheel Binary Package Format 1.0
| PEP: | 427 |
|---|---|
| Title: | The Wheel Binary Package Format 1.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Daniel Holth <dholth at gmail.com> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Sep-2012 |
| Post-History: | 18-Oct-2012, 15-Feb-2013 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-February/124103.html |
Contents
- Abstract
- PEP Acceptance
- Rationale
- Details
- FAQ
- Wheel defines a .data directory. Should I put all my data there?
- Why does wheel include attached signatures?
- Why does wheel allow JWS signatures?
- Why does wheel also allow S/MIME signatures?
- What's the deal with "purelib" vs. "platlib"?
- Is it possible to import Python code directly from a wheel file?
- References
- Appendix
- Copyright
Abstract
This PEP describes a built-package format for Python called "wheel".
A wheel is a ZIP-format archive with a specially formatted file name and the .whl extension. It contains a single distribution nearly as it would be installed according to PEP 376 with a particular installation scheme. Although a specialized installer is recommended, a wheel file may be installed by simply unpacking into site-packages with the standard 'unzip' tool while preserving enough information to spread its contents out onto their final paths at any later time.
PEP Acceptance
This PEP was accepted, and the defined wheel version updated to 1.0, by Nick Coghlan on 16th February, 2013 [1]
Rationale
Python needs a package format that is easier to install than sdist. Python's sdist packages are defined by and require the distutils and setuptools build systems, running arbitrary code to build-and-install, and re-compile, code just so it can be installed into a new virtualenv. This system of conflating build-install is slow, hard to maintain, and hinders innovation in both build systems and installers.
Wheel attempts to remedy these problems by providing a simpler interface between the build system and the installer. The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.
Details
Installing a wheel 'distribution-1.0-py32-none-any.whl'
Wheel installation notionally consists of two phases:
- Unpack.
- Parse distribution-1.0.dist-info/WHEEL.
- Check that installer is compatible with Wheel-Version. Warn if minor version is greater, abort if major version is greater.
- If Root-Is-Purelib == 'true', unpack archive into purelib (site-packages).
- Else unpack archive into platlib (site-packages).
- Spread.
- Unpacked archive includes distribution-1.0.dist-info/ and (if there is data) distribution-1.0.data/.
- Move each subtree of distribution-1.0.data/ onto its destination path. Each subdirectory of distribution-1.0.data/ is a key into a dict of destination directories, such as distribution-1.0.data/(purelib|platlib|headers|scripts|data). The initially supported paths are taken from distutils.command.install.
- If applicable, update scripts starting with #!python to point to the correct interpreter.
- Update distribution-1.0.dist.info/RECORD with the installed paths.
- Remove empty distribution-1.0.data directory.
- Compile any installed .py to .pyc. (Uninstallers should be smart enough to remove .pyc even if it is not mentioned in RECORD.)
Recommended installer features
- Rewrite #!python.
In wheel, scripts are packaged in {distribution}-{version}.data/scripts/. If the first line of a file in scripts/ starts with exactly b'#!python', rewrite to point to the correct interpreter. Unix installers may need to add the +x bit to these files if the archive was created on Windows.
The b'#!pythonw' convention is allowed. b'#!pythonw' indicates a GUI script instead of a console script.
- Generate script wrappers.
- In wheel, scripts packaged on Unix systems will certainly not have accompanying .exe wrappers. Windows installers may want to add them during install.
Recommended archiver features
- Place .dist-info at the end of the archive.
- Archivers are encouraged to place the .dist-info files physically at the end of the archive. This enables some potentially interesting ZIP tricks including the ability to amend the metadata without rewriting the entire archive.
File Format
File name convention
The wheel filename is {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl.
- distribution
- Distribution name, e.g. 'django', 'pyramid'.
- version
- Distribution version, e.g. 1.0.
- build tag
- Optional build number. Must start with a digit. A tie breaker if two wheels have the same version. Sort as the empty string if unspecified, else sort the initial digits as a number, and the remainder lexicographically.
- language implementation and version tag
- E.g. 'py27', 'py2', 'py3'.
- abi tag
- E.g. 'cp33m', 'abi3', 'none'.
- platform tag
- E.g. 'linux_x86_64', 'any'.
For example, distribution-1.0-1-py27-none-any.whl is the first build of a package called 'distribution', and is compatible with Python 2.7 (any Python 2.7 implementation), with no ABI (pure Python), on any CPU architecture.
The last three components of the filename before the extension are called "compatibility tags." The compatibility tags express the package's basic interpreter requirements and are detailed in PEP 425.
Escaping and Unicode
Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:
re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)
The archive filename is Unicode. It will be some time before the tools are updated to support non-ASCII filenames, but they are supported in this specification.
The filenames inside the archive are encoded as UTF-8. Although some ZIP clients in common use do not properly display UTF-8 filenames, the encoding is supported by both the ZIP specification and Python's zipfile.
File contents
The contents of a wheel file, where {distribution} is replaced with the name of the package, e.g. beaglevote and {version} is replaced with its version, e.g. 1.0.0, consist of:
/, the root of the archive, contains all files to be installed in purelib or platlib as specified in WHEEL. purelib and platlib are usually both site-packages.
{distribution}-{version}.dist-info/ contains metadata.
{distribution}-{version}.data/ contains one subdirectory for each non-empty install scheme key not already covered, where the subdirectory name is an index into a dictionary of install paths (e.g. data, scripts, include, purelib, platlib).
Python scripts must appear in scripts and begin with exactly b'#!python' in order to enjoy script wrapper generation and #!python rewriting at install time. They may have any or no extension.
{distribution}-{version}.dist-info/METADATA is Metadata version 1.1 or greater format metadata.
{distribution}-{version}.dist-info/WHEEL is metadata about the archive itself in the same basic key: value format:
Wheel-Version: 1.0 Generator: bdist_wheel 1.0 Root-Is-Purelib: true Tag: py2-none-any Tag: py3-none-any Build: 1
Wheel-Version is the version number of the Wheel specification.
Generator is the name and optionally the version of the software that produced the archive.
Root-Is-Purelib is true if the top level directory of the archive should be installed into purelib; otherwise the root should be installed into platlib.
Tag is the wheel's expanded compatibility tags; in the example the filename would contain py2.py3-none-any.
Build is the build number and is omitted if there is no build number.
A wheel installer should warn if Wheel-Version is greater than the version it supports, and must fail if Wheel-Version has a greater major version than the version it supports.
Wheel, being an installation format that is intended to work across multiple versions of Python, does not generally include .pyc files.
Wheel does not contain setup.py or setup.cfg.
This version of the wheel specification is based on the distutils install schemes and does not define how to install files to other locations. The layout offers a superset of the functionality provided by the existing wininst and egg binary formats.
The .dist-info directory
- Wheel .dist-info directories include at a minimum METADATA, WHEEL, and RECORD.
- METADATA is the package metadata, the same format as PKG-INFO as found at the root of sdists.
- WHEEL is the wheel metadata specific to a build of the package.
- RECORD is a list of (almost) all the files in the wheel and their secure hashes. Unlike PEP 376, every file except RECORD, which cannot contain a hash of itself, must include its hash. The hash algorithm must be sha256 or better; specifically, md5 and sha1 are not permitted, as signed wheel files rely on the strong hashes in RECORD to validate the integrity of the archive.
- PEP 376's INSTALLER and REQUESTED are not included in the archive.
- RECORD.jws is used for digital signatures. It is not mentioned in RECORD.
- RECORD.p7s is allowed as a courtesy to anyone who would prefer to use S/MIME signatures to secure their wheel files. It is not mentioned in RECORD.
- During extraction, wheel installers verify all the hashes in RECORD against the file contents. Apart from RECORD and its signatures, installation will fail if any file in the archive is not both mentioned and correctly hashed in RECORD.
The .data directory
Any file that is not normally installed inside site-packages goes into the .data directory, named as the .dist-info directory but with the .data/ extension:
distribution-1.0.dist-info/ distribution-1.0.data/
The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution. During installation the contents of these subdirectories are moved onto their destination paths.
Signed wheel files
Wheel files include an extended RECORD that enables digital signatures. PEP 376's RECORD is altered to include a secure hash digestname=urlsafe_b64encode_nopad(digest) (urlsafe base64 encoding with no trailing = characters) as the second column instead of an md5sum. All possible entries are hashed, including any generated files such as .pyc files, but not RECORD which cannot contain its own hash. For example:
file.py,sha256=AVTFPZpEKzuHr7OvQZmhaU3LvwKz06AJw8mT\_pNh2yI,3144 distribution-1.0.dist-info/RECORD,,
The signature file(s) RECORD.jws and RECORD.p7s are not mentioned in RECORD at all since they can only be added after RECORD is generated. Every other file in the archive must have a correct hash in RECORD or the installation will fail.
If JSON web signatures are used, one or more JSON Web Signature JSON Serialization (JWS-JS) signatures is stored in a file RECORD.jws adjacent to RECORD. JWS is used to sign RECORD by including the SHA-256 hash of RECORD as the signature's JSON payload:
{ "hash": "sha256=ADD-r2urObZHcxBW3Cr-vDCu5RJwT4CaRTHiFmbcIYY" }
(The hash value is the same format used in RECORD.)
If RECORD.p7s is used, it must contain a detached S/MIME format signature of RECORD.
A wheel installer is not required to understand digital signatures but MUST verify the hashes in RECORD against the extracted file contents. When the installer checks file hashes against RECORD, a separate signature checker only needs to establish that RECORD matches the signature.
See
Comparison to .egg
- Wheel is an installation format; egg is importable. Wheel archives do not need to include .pyc and are less tied to a specific Python version or implementation. Wheel can install (pure Python) packages built with previous versions of Python so you don't always have to wait for the packager to catch up.
- Wheel uses .dist-info directories; egg uses .egg-info. Wheel is compatible with the new world of Python packaging and the new concepts it brings.
- Wheel has a richer file naming convention for today's multi-implementation world. A single wheel archive can indicate its compatibility with a number of Python language versions and implementations, ABIs, and system architectures. Historically the ABI has been specific to a CPython release, wheel is ready for the stable ABI.
- Wheel is lossless. The first wheel implementation bdist_wheel always generates egg-info, and then converts it to a .whl. It is also possible to convert existing eggs and bdist_wininst distributions.
- Wheel is versioned. Every wheel file contains the version of the wheel specification and the implementation that packaged it. Hopefully the next migration can simply be to Wheel 2.0.
- Wheel is a reference to the other Python.
FAQ
Wheel defines a .data directory. Should I put all my data there?
This specification does not have an opinion on how you should organize your code. The .data directory is just a place for any files that are not normally installed inside site-packages or on the PYTHONPATH. In other words, you may continue to use pkgutil.get_data(package, resource) even though those files will usually not be distributed in wheel's .data directory.
Why does wheel include attached signatures?
Attached signatures are more convenient than detached signatures because they travel with the archive. Since only the individual files are signed, the archive can be recompressed without invalidating the signature or individual files can be verified without having to download the whole archive.
Why does wheel allow JWS signatures?
The JOSE specifications of which JWS is a part are designed to be easy to implement, a feature that is also one of wheel's primary design goals. JWS yields a useful, concise pure-Python implementation.
Why does wheel also allow S/MIME signatures?
S/MIME signatures are allowed for users who need or want to use existing public key infrastructure with wheel.
Signed packages are only a basic building block in a secure package update system. Wheel only provides the building block.
What's the deal with "purelib" vs. "platlib"?
Wheel preserves the "purelib" vs. "platlib" distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to '/usr/lib/pythonX.Y/site-packages' and platform dependent packages to '/usr/lib64/pythonX.Y/site-packages'.
A wheel with "Root-Is-Purelib: false" with all its files in {name}-{version}.data/purelib is equivalent to a wheel with "Root-Is-Purelib: true" with those same files in the root, and it is legal to have files in both the "purelib" and "platlib" categories.
In practice a wheel should have only one of "purelib" or "platlib" depending on whether it is pure Python or not and those files should be at the root with the appropriate setting given for "Root-is-purelib".
Is it possible to import Python code directly from a wheel file?
Technically, due to the combination of supporting installation via simple extraction and using an archive format that is compatible with zipimport, a subset of wheel files do support being placed directly on sys.path. However, while this behaviour is a natural consequence of the format design, actually relying on it is generally discouraged.
Firstly, wheel is designed primarily as a distribution format, so skipping the installation step also means deliberately avoiding any reliance on features that assume full installation (such as being able to use standard tools like pip and virtualenv to capture and manage dependencies in a way that can be properly tracked for auditing and security update purposes, or integrating fully with the standard build machinery for C extensions by publishing header files in the appropriate place).
Secondly, while some Python software is written to support running directly from a zip archive, it is still common for code to be written assuming it has been fully installed. When that assumption is broken by trying to run the software from a zip archive, the failures can often be obscure and hard to diagnose (especially when they occur in third party libraries). The two most common sources of problems with this are the fact that importing C extensions from a zip archive is not supported by CPython (since doing so is not supported directly by the dynamic loading machinery on any platform) and that when running from a zip archive the __file__ attribute no longer refers to an ordinary filesystem path, but to a combination path that includes both the location of the zip archive on the filesystem and the relative path to the module inside the archive. Even when software correctly uses the abstract resource APIs internally, interfacing with external components may still require the availability of an actual on-disk file.
Like metaclasses, monkeypatching and metapath importers, if you're not already sure you need to take advantage of this feature, you almost certainly don't need it. If you do decide to use it anyway, be aware that many projects will require a failure to be reproduced with a fully installed package before accepting it as a genuine bug.
References
| [1] | PEP acceptance (http://mail.python.org/pipermail/python-dev/2013-February/124103.html) |
Appendix
Example urlsafe-base64-nopad implementation:
# urlsafe-base64-nopad for Python 3
import base64
def urlsafe_b64encode_nopad(data):
return base64.urlsafe_b64encode(data).rstrip(b'=')
def urlsafe_b64decode_nopad(data):
pad = b'=' * (4 - (len(data) & 3))
return base64.urlsafe_b64decode(data + pad)
Copyright
This document has been placed into the public domain.
pep-0428 The pathlib module -- object-oriented filesystem paths
| PEP: | 428 |
|---|---|
| Title: | The pathlib module -- object-oriented filesystem paths |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-July-2012 |
| Python-Version: | 3.4 |
| Post-History: | http://mail.python.org/pipermail/python-ideas/2012-October/016338.html |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130424.html |
Contents
Abstract
This PEP proposes the inclusion of a third-party module, pathlib [1], in the standard library. The inclusion is proposed under the provisional label, as described in PEP 411. Therefore, API changes can be done, either as part of the PEP process, or after acceptance in the standard library (and until the provisional label is removed).
The aim of this library is to provide a simple hierarchy of classes to handle filesystem paths and the common operations users do over them.
Implementation
The implementation of this proposal is tracked in the pep428 branch of pathlib's Mercurial repository [6].
Why an object-oriented API
The rationale to represent filesystem paths using dedicated classes is the same as for other kinds of stateless objects, such as dates, times or IP addresses. Python has been slowly moving away from strictly replicating the C language's APIs to providing better, more helpful abstractions around all kinds of common functionality. Even if this PEP isn't accepted, it is likely that another form of filesystem handling abstraction will be adopted one day into the standard library.
Indeed, many people will prefer handling dates and times using the high-level objects provided by the datetime module, rather than using numeric timestamps and the time module API. Moreover, using a dedicated class allows to enable desirable behaviours by default, for example the case insensitivity of Windows paths.
Proposal
Class hierarchy
The pathlib [1] module implements a simple hierarchy of classes:
+----------+
| |
---------| PurePath |--------
| | | |
| +----------+ |
| | |
| | |
v | v
+---------------+ | +-----------------+
| | | | |
| PurePosixPath | | | PureWindowsPath |
| | | | |
+---------------+ | +-----------------+
| v |
| +------+ |
| | | |
| -------| Path |------ |
| | | | | |
| | +------+ | |
| | | |
| | | |
v v v v
+-----------+ +-------------+
| | | |
| PosixPath | | WindowsPath |
| | | |
+-----------+ +-------------+
This hierarchy divides path classes along two dimensions:
- a path class can be either pure or concrete: pure classes support only operations that don't need to do any actual I/O, which are most path manipulation operations; concrete classes support all the operations of pure classes, plus operations that do I/O.
- a path class is of a given flavour according to the kind of operating system paths it represents. pathlib [1] implements two flavours: Windows paths for the filesystem semantics embodied in Windows systems, POSIX paths for other systems.
Any pure class can be instantiated on any system: for example, you can manipulate PurePosixPath objects under Windows, PureWindowsPath objects under Unix, and so on. However, concrete classes can only be instantiated on a matching system: indeed, it would be error-prone to start doing I/O with WindowsPath objects under Unix, or vice-versa.
Furthermore, there are two base classes which also act as system-dependent factories: PurePath will instantiate either a PurePosixPath or a PureWindowsPath depending on the operating system. Similarly, Path will instantiate either a PosixPath or a WindowsPath.
It is expected that, in most uses, using the Path class is adequate, which is why it has the shortest name of all.
No confusion with builtins
In this proposal, the path classes do not derive from a builtin type. This contrasts with some other Path class proposals which were derived from str. They also do not pretend to implement the sequence protocol: if you want a path to act as a sequence, you have to lookup a dedicated attribute (the parts attribute).
Not behaving like one of the basic builtin types also minimizes the potential for confusion if a path is combined by accident with genuine builtin types.
Immutability
Path objects are immutable, which makes them hashable and also prevents a class of programming errors.
Sane behaviour
Little of the functionality from os.path is reused. Many os.path functions are tied by backwards compatibility to confusing or plain wrong behaviour (for example, the fact that os.path.abspath() simplifies ".." path components without resolving symlinks first).
Comparisons
Paths of the same flavour are comparable and orderable, whether pure or not:
>>> PurePosixPath('a') == PurePosixPath('b')
False
>>> PurePosixPath('a') < PurePosixPath('b')
True
>>> PurePosixPath('a') == PosixPath('a')
True
Comparing and ordering Windows path objects is case-insensitive:
>>> PureWindowsPath('a') == PureWindowsPath('A')
True
Paths of different flavours always compare unequal, and cannot be ordered:
>>> PurePosixPath('a') == PureWindowsPath('a')
False
>>> PurePosixPath('a') < PureWindowsPath('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: PurePosixPath() < PureWindowsPath()
Paths compare unequal to, and are not orderable with instances of builtin types (such as str) and any other types.
Useful notations
The API tries to provide useful notations all the while avoiding magic. Some examples:
>>> p = Path('/home/antoine/pathlib/setup.py')
>>> p.name
'setup.py'
>>> p.suffix
'.py'
>>> p.root
'/'
>>> p.parts
('/', 'home', 'antoine', 'pathlib', 'setup.py')
>>> p.relative_to('/home/antoine')
PosixPath('pathlib/setup.py')
>>> p.exists()
True
Pure paths API
The philosophy of the PurePath API is to provide a consistent array of useful path manipulation operations, without exposing a hodge-podge of functions like os.path does.
Definitions
First a couple of conventions:
- All paths can have a drive and a root. For POSIX paths, the drive is always empty.
- A relative path has neither drive nor root.
- A POSIX path is absolute if it has a root. A Windows path is absolute if it has both a drive and a root. A Windows UNC path (e.g. \\host\share\myfile.txt) always has a drive and a root (here, \\host\share and \, respectively).
- A path which has either a drive or a root is said to be anchored. Its anchor is the concatenation of the drive and root. Under POSIX, "anchored" is the same as "absolute".
Construction
We will present construction and joining together since they expose similar semantics.
The simplest way to construct a path is to pass it its string representation:
>>> PurePath('setup.py')
PurePosixPath('setup.py')
Extraneous path separators and "." components are eliminated:
>>> PurePath('a///b/c/./d/')
PurePosixPath('a/b/c/d')
If you pass several arguments, they will be automatically joined:
>>> PurePath('docs', 'Makefile')
PurePosixPath('docs/Makefile')
Joining semantics are similar to os.path.join, in that anchored paths ignore the information from the previously joined components:
>>> PurePath('/etc', '/usr', 'bin')
PurePosixPath('/usr/bin')
However, with Windows paths, the drive is retained as necessary:
>>> PureWindowsPath('c:/foo', '/Windows')
PureWindowsPath('c:/Windows')
>>> PureWindowsPath('c:/foo', 'd:')
PureWindowsPath('d:')
Also, path separators are normalized to the platform default:
>>> PureWindowsPath('a/b') == PureWindowsPath('a\\b')
True
Extraneous path separators and "." components are eliminated, but not ".." components:
>>> PurePosixPath('a//b/./c/')
PurePosixPath('a/b/c')
>>> PurePosixPath('a/../b')
PurePosixPath('a/../b')
Multiple leading slashes are treated differently depending on the path flavour. They are always retained on Windows paths (because of the UNC notation):
>>> PureWindowsPath('//some/path')
PureWindowsPath('//some/path/')
On POSIX, they are collapsed except if there are exactly two leading slashes, which is a special case in the POSIX specification on pathname resolution [7] (this is also necessary for Cygwin compatibility):
>>> PurePosixPath('///some/path')
PurePosixPath('/some/path')
>>> PurePosixPath('//some/path')
PurePosixPath('//some/path')
Calling the constructor without any argument creates a path object pointing to the logical "current directory" (without looking up its absolute path, which is the job of the cwd() classmethod on concrete paths):
>>> PurePosixPath()
PurePosixPath('.')
Representing
To represent a path (e.g. to pass it to third-party libraries), just call str() on it:
>>> p = PurePath('/home/antoine/pathlib/setup.py')
>>> str(p)
'/home/antoine/pathlib/setup.py'
>>> p = PureWindowsPath('c:/windows')
>>> str(p)
'c:\\windows'
To force the string representation with forward slashes, use the as_posix() method:
>>> p.as_posix() 'c:/windows'
To get the bytes representation (which might be useful under Unix systems), call bytes() on it, which internally uses os.fsencode():
>>> bytes(p) b'/home/antoine/pathlib/setup.py'
To represent the path as a file: URI, call the as_uri() method:
>>> p = PurePosixPath('/etc/passwd')
>>> p.as_uri()
'file:///etc/passwd'
>>> p = PureWindowsPath('c:/Windows')
>>> p.as_uri()
'file:///c:/Windows'
The repr() of a path always uses forward slashes, even under Windows, for readability and to remind users that forward slashes are ok:
>>> p = PureWindowsPath('c:/Windows')
>>> p
PureWindowsPath('c:/Windows')
Properties
Several simple properties are provided on every path (each can be empty):
>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.drive
'c:'
>>> p.root
'\\'
>>> p.anchor
'c:\\'
>>> p.name
'pathlib.tar.gz'
>>> p.stem
'pathlib.tar'
>>> p.suffix
'.gz'
>>> p.suffixes
['.tar', '.gz']
Deriving new paths
Joining
A path can be joined with another using the / operator:
>>> p = PurePosixPath('foo')
>>> p / 'bar'
PurePosixPath('foo/bar')
>>> p / PurePosixPath('bar')
PurePosixPath('foo/bar')
>>> 'bar' / p
PurePosixPath('bar/foo')
As with the constructor, multiple path components can be specified, either collapsed or separately:
>>> p / 'bar/xyzzy'
PurePosixPath('foo/bar/xyzzy')
>>> p / 'bar' / 'xyzzy'
PurePosixPath('foo/bar/xyzzy')
A joinpath() method is also provided, with the same behaviour:
>>> p.joinpath('Python')
PurePosixPath('foo/Python')
Changing the path's final component
The with_name() method returns a new path, with the name changed:
>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.with_name('setup.py')
PureWindowsPath('c:/Downloads/setup.py')
It fails with a ValueError if the path doesn't have an actual name:
>>> p = PureWindowsPath('c:/')
>>> p.with_name('setup.py')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pathlib.py", line 875, in with_name
raise ValueError("%r has an empty name" % (self,))
ValueError: PureWindowsPath('c:/') has an empty name
>>> p.name
''
The with_suffix() method returns a new path with the suffix changed. However, if the path has no suffix, the new suffix is added:
>>> p = PureWindowsPath('c:/Downloads/pathlib.tar.gz')
>>> p.with_suffix('.bz2')
PureWindowsPath('c:/Downloads/pathlib.tar.bz2')
>>> p = PureWindowsPath('README')
>>> p.with_suffix('.bz2')
PureWindowsPath('README.bz2')
Making the path relative
The relative_to() method computes the relative difference of a path to another:
>>> PurePosixPath('/usr/bin/python').relative_to('/usr')
PurePosixPath('bin/python')
ValueError is raised if the method cannot return a meaningful value:
>>> PurePosixPath('/usr/bin/python').relative_to('/etc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pathlib.py", line 926, in relative_to
.format(str(self), str(formatted)))
ValueError: '/usr/bin/python' does not start with '/etc'
Sequence-like access
The parts property returns a tuple providing read-only sequence access to a path's components:
>>> p = PurePosixPath('/etc/init.d')
>>> p.parts
('/', 'etc', 'init.d')
Windows paths handle the drive and the root as a single path component:
>>> p = PureWindowsPath('c:/setup.py')
>>> p.parts
('c:\\', 'setup.py')
(separating them would be wrong, since C: is not the parent of C:\\).
The parent property returns the logical parent of the path:
>>> p = PureWindowsPath('c:/python33/bin/python.exe')
>>> p.parent
PureWindowsPath('c:/python33/bin')
The parents property returns an immutable sequence of the path's logical ancestors:
>>> p = PureWindowsPath('c:/python33/bin/python.exe')
>>> len(p.parents)
3
>>> p.parents[0]
PureWindowsPath('c:/python33/bin')
>>> p.parents[1]
PureWindowsPath('c:/python33')
>>> p.parents[2]
PureWindowsPath('c:/')
Querying
is_relative() returns True if the path is relative (see definition above), False otherwise.
is_reserved() returns True if a Windows path is a reserved path such as CON or NUL. It always returns False for POSIX paths.
match() matches the path against a glob pattern. It operates on individual parts and matches from the right:
>>> p = PurePosixPath('/usr/bin')
>>> p.match('/usr/b*')
True
>>> p.match('usr/b*')
True
>>> p.match('b*')
True
>>> p.match('/u*')
False
This behaviour respects the following expectations:
- A simple pattern such as "*.py" matches arbitrarily long paths as long as the last part matches, e.g. "/usr/foo/bar.py".
- Longer patterns can be used as well for more complex matching, e.g. "/usr/foo/*.py" matches "/usr/foo/bar.py".
Concrete paths API
In addition to the operations of the pure API, concrete paths provide additional methods which actually access the filesystem to query or mutate information.
Constructing
The classmethod cwd() creates a path object pointing to the current working directory in absolute form:
>>> Path.cwd()
PosixPath('/home/antoine/pathlib')
File metadata
The stat() returns the file's stat() result; similarly, lstat() returns the file's lstat() result (which is different iff the file is a symbolic link):
>>> p.stat() posix.stat_result(st_mode=33277, st_ino=7483155, st_dev=2053, st_nlink=1, st_uid=500, st_gid=500, st_size=928, st_atime=1343597970, st_mtime=1328287308, st_ctime=1343597964)
Higher-level methods help examine the kind of the file:
>>> p.exists() True >>> p.is_file() True >>> p.is_dir() False >>> p.is_symlink() False >>> p.is_socket() False >>> p.is_fifo() False >>> p.is_block_device() False >>> p.is_char_device() False
The file owner and group names (rather than numeric ids) are queried through corresponding methods:
>>> p = Path('/etc/shadow')
>>> p.owner()
'root'
>>> p.group()
'shadow'
Path resolution
The resolve() method makes a path absolute, resolving any symlink on the way (like the POSIX realpath() call). It is the only operation which will remove ".." path components. On Windows, this method will also take care to return the canonical path (with the right casing).
Directory walking
Simple (non-recursive) directory access is done by calling the iterdir() method, which returns an iterator over the child paths:
>>> p = Path('docs')
>>> for child in p.iterdir(): child
...
PosixPath('docs/conf.py')
PosixPath('docs/_templates')
PosixPath('docs/make.bat')
PosixPath('docs/index.rst')
PosixPath('docs/_build')
PosixPath('docs/_static')
PosixPath('docs/Makefile')
This allows simple filtering through list comprehensions:
>>> p = Path('.')
>>> [child for child in p.iterdir() if child.is_dir()]
[PosixPath('.hg'), PosixPath('docs'), PosixPath('dist'), PosixPath('__pycache__'), PosixPath('build')]
Simple and recursive globbing is also provided:
>>> for child in p.glob('**/*.py'): child
...
PosixPath('test_pathlib.py')
PosixPath('setup.py')
PosixPath('pathlib.py')
PosixPath('docs/conf.py')
PosixPath('build/lib/pathlib.py')
File opening
The open() method provides a file opening API similar to the builtin open() method:
>>> p = Path('setup.py')
>>> with p.open() as f: f.readline()
...
'#!/usr/bin/env python3\n'
Filesystem modification
Several common filesystem operations are provided as methods: touch(), mkdir(), rename(), replace(), unlink(), rmdir(), chmod(), lchmod(), symlink_to(). More operations could be provided, for example some of the functionality of the shutil module.
Detailed documentation of the proposed API can be found at the pathlib docs [8].
Discussion
Division operator
The division operator came out first in a poll [9] about the path joining operator. Initial versions of pathlib [1] used square brackets (i.e. __getitem__) instead.
joinpath()
The joinpath() method was initially called join(), but several people objected that it could be confused with str.join() which has different semantics. Therefore it was renamed to joinpath().
Case-sensitivity
Windows users consider filesystem paths to be case-insensitive and expect path objects to observe that characteristic, even though in some rare situations some foreign filesystem mounts may be case-sensitive under Windows.
In the words of one commenter,
"If glob("*.py") failed to find SETUP.PY on Windows, that would be a usability disaster".
—Paul Moore in https://mail.python.org/pipermail/python-dev/2013-April/125254.html
References
| [1] | (1, 2, 3, 4) http://pypi.python.org/pypi/pathlib/ |
| [2] | https://github.com/jaraco/path.py |
| [3] | http://twistedmatrix.com/documents/current/api/twisted.python.filepath.FilePath.html |
| [4] | http://wiki.python.org/moin/AlternativePathClass |
| [5] | https://bitbucket.org/sluggo/unipath/overview |
| [6] | https://bitbucket.org/pitrou/pathlib/ |
| [7] | http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11 |
| [8] | https://pathlib.readthedocs.org/en/pep428/ |
| [9] | https://mail.python.org/pipermail/python-ideas/2012-October/016544.html |
Copyright
This document has been placed into the public domain.
pep-0429 Python 3.4 Release Schedule
| PEP: | 429 |
|---|---|
| Title: | Python 3.4 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Larry Hastings <larry at hastings.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 17-Oct-2012 |
| Python-Version: | 3.4 |
Contents
Abstract
This document describes the development and release schedule for Python 3.4. The schedule primarily concerns itself with PEP-sized items.
Release Manager and Crew
- 3.4 Release Manager: Larry Hastings
- Windows installers: Martin v. Lรถwis
- Mac installers: Ned Deily
- Documentation: Georg Brandl
Release Schedule
The releases:
- 3.4.0 alpha 1: August 3, 2013
- 3.4.0 alpha 2: September 9, 2013
- 3.4.0 alpha 3: September 29, 2013
- 3.4.0 alpha 4: October 20, 2013
- 3.4.0 beta 1: November 24, 2013
- 3.4.0 beta 2: January 5, 2014
- 3.4.0 beta 3: January 26, 2014
- 3.4.0 candidate 1: February 10, 2014
- 3.4.0 candidate 2: February 23, 2014
- 3.4.0 candidate 3: March 9, 2014
- 3.4.0 final: March 16, 2014
(Beta 1 was also "feature freeze"--no new features beyond this point.)
3.4.1 schedule
- 3.4.1 candidate 1: May 5, 2014
- 3.4.1 final: May 18, 2014
3.4.2 schedule
- 3.4.2 candidate 1: September 22, 2014
- 3.4.2 final: October 6, 2014
3.4.3 schedule
- 3.4.3 candidate 1: February 8, 2015
- 3.4.3 final: February 25, 2015
Features for 3.4
Implemented / Final PEPs:
- PEP 428, a "pathlib" module providing object-oriented filesystem paths
- PEP 435, a standardized "enum" module
- PEP 436, a build enhancement that will help generate introspection information for builtins
- PEP 442, improved semantics for object finalization
- PEP 443, adding single-dispatch generic functions to the standard library
- PEP 445, a new C API for implementing custom memory allocators
- PEP 446, changing file descriptors to not be inherited by default in subprocesses
- PEP 450, a new "statistics" module
- PEP 451, standardizing module metadata for Python's module import system
- PEP 453, a bundled installer for the pip package manager
- PEP 454, a new "tracemalloc" module for tracing Python memory allocations
- PEP 456, a new hash algorithm for Python strings and binary data
- PEP 3154, a new and improved protocol for pickled objects
- PEP 3156, a new "asyncio" module, a new framework for asynchronous I/O
Deferred to post-3.4:
Copyright
This document has been placed in the public domain.
pep-0430 Migrating to Python 3 as the default online documentation
| PEP: | 430 |
|---|---|
| Title: | Migrating to Python 3 as the default online documentation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| BDFL-Delegate: | Georg Brandl |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 27-Oct-2012 |
Contents
Abstract
This PEP proposes a strategy for migrating the default version of the Python documentation presented to users of Python when accessing docs.python.org from 2.7 to Python 3.3.
It proposes a backwards compatible scheme that preserves the meaning of existing deep links in to the Python 2 documentation, while still presenting the Python 3 documentation by default, and presenting the Python 2 and 3 documentation in a way that avoids making the Python 3 documentation look like a second-class citizen.
Background
With the transition of the overall Python ecosystem from Python 2 to Python 3 still in progress, one question which arises periodically [1, 2] is when and how to handle the change from providing the Python 2 documentation as the default version displayed at the docs.python.org root URL to providing the Python 3 documentation.
Key Concerns
There are a couple of key concerns that any migration proposal needs to address.
Don't Confuse Beginners
Many beginners learn Python through third party resources. These resources, not all of which are online resources, may reference in to the python.org online documentation for additional background and details.
Importantly, even when the online documentation is updated, the "version added" and "version changed" tags usually provide enough information for users to adjust appropriately for the specific version they are using.
While deep links in to the python.org documentation may occasionally break within the Python 2 series, this is very rare.
Migrating to Python 3 is a very different matter. Many links would break due to renames and removals, and the "version added" and "version changed" information for the Python 2 series is completely absent.
Don't Break Useful Resources
There are many useful Python resources out there, such as the mailing list archives on python.org and question-and-answer sites like Stack Overflow, where links are highly unlikely to be updated, no matter how much notice is provided.
Old posts and answers to questions all currently link to docs.python.org expecting to get the Python 2 documentation at unqualified URLs. Links from answers that relate to Python 3 are explicitly qualified with /py3k/ in the path component.
Proposal
This PEP (based on an idea originally put forward back in May [3]) is to not migrate the Python 2 specific deep links at all, and instead adopt a scheme where all URLs presented to users on docs.python.org are qualified appropriately with the relevant release series.
Visitors to the root URL at http://docs.python.org will be automatically redirected to http://docs.python.org/3/, but links deeper in the version-specific hierarchy, such as to http://docs.python.org/library/os, will instead be redirected to a Python 2 specific link such as http://docs.python.org/2/library/os.
The specific subpaths which will be redirected to explicitly qualified paths for the Python 2 docs are:
- /c-api/
- /distutils/
- /extending/
- /faq/
- /howto/
- /library/
- /reference/
- /tutorial/
- /using/
- /whatsnew/
- /about.html
- /bugs.html
- /contents.html
- /copyright.html
- /license.html
- /genindex.html
- /glossary.html
- /py-modindex.html
- /search.html
The existing /py3k/ subpath will be redirected to the new /3/ subpath.
Presented URLs
With this scheme, the following URLs would be presented to users after resolution of any aliasing and rewriting rules:
- http://docs.python.org/x/*
- http://docs.python.org/x.y/*
- http://docs.python.org/dev/*
- http://docs.python.org/release/x.y.z/*
- http://docs.python.org/devguide
The /x/ URLs mean "give me the latest documentation for a released version in this release series". It will draw the documentation from the relevant maintenance branch in source control (this will always be the 2.7 branch for Python 2 and is currently 3.3 for Python 3). Differences relative to previous versions in the release series will be available through "version added" and "version changed" markers.
The /x.y/ URLs mean "give me the latest documentation for this release". It will draw the documentation from the relevant maintenance branch in source control (or the default branch for the currently in development version). It differs from the status quo in that the URLs will actually remain available in the user's browser for easy copy and pasting. (Currently, references to specific versions that are not the latest in their release series will resolve to a stable URL for a specific maintenance version in the "release" hierarchy, while the current latest version in the release series resolves to the release series URL. This makes it hard to get a "latest version specific URL", since it is always necessary to construct them manually).
The /dev/ URL means the documentation for the default branch in source control.
The /release/x.y.x/ URLs will refer to the documentation of those releases, exactly as it was at the time of the release.
The developer's guide is not version specific, and thus retains its own stable /devguide/ URL.
Rationale
There is some desire to switch the unqualified references to mean Python 3 as a sign of confidence in Python 3. Such a move would either break a lot of things, or else involve an awful lot of work to avoid breaking things.
I believe we can get much the same effect without breaking the world by:
- Deprecating the use of unqualified references to the online documentation (while promising to preserve the meaning of such references indefinitely)
- Updating all python.org and python-dev controlled links to use qualified references (excluding archived email)
- Redirecting visitors to the root of http://docs.python.org to http://docs.python.org/3.x
Most importantly, because this scheme doesn't alter the behaviour of any existing deep links, it could be implemented with a significantly shorter warning period than would be required for a scheme that risked breaking deep links, or started to redirect unqualified links to Python 3. The only part of the scheme which would require any warning at all is the step of redirecting the "http://docs.python.org/" landing page to the Python 3.3 documentation.
Namespaces are one honking great idea - let's do more of those.
Note that the approach described in this PEP gives two ways to access the content of the default branch: as /dev/ or using the appropriate /x.y/ reference. This is deliberate, as the default branch is referenced for two different purposes:
- to provide additional information when discussing an upcoming feature of the next release (a /x.y/ URL is appropriate)
- to provide a stable destination for developers to access the documentation of the next feature release, regardless of the version (a /dev/ URL is appropriate)
Implementation
The URLs on docs.python.org are controlled by the python.org infrastructure team rather than through the CPython source repo, so acceptance and implementation of the ideas in this PEP will be up to the team.
References
| [1] | May 2012 discussion (http://mail.python.org/pipermail/python-dev/2012-May/119524.html) |
| [2] | October 2012 discussion (http://mail.python.org/pipermail/python-ideas/2012-October/017406.html) |
| [3] | Using a "/latest/" path prefix (http://mail.python.org/pipermail/python-dev/2012-May/119567.html) |
Copyright
This document has been placed in the public domain.
pep-0431 Time zone support improvements
| PEP: | 431 |
|---|---|
| Title: | Time zone support improvements |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Lennart Regebro <regebro at gmail.com> |
| BDFL-Delegate: | Barry Warsaw <barry@python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 11-Dec-2012 |
| Post-History: | 11-Dec-2012, 28-Dec-2012, 28-Jan-2013 |
Abstract
This PEP proposes the implementation of concrete time zone support in the Python standard library, and also improvements to the time zone API to deal with ambiguous time specifications during DST changes.
Proposal
Concrete time zone support
The time zone support in Python has no concrete implementation in the standard library outside of a tzinfo baseclass that supports fixed offsets. To properly support time zones you need to include a database over all time zones, both current and historical, including daylight saving changes. But such information changes frequently, so even if we include the last information in a Python release, that information would be outdated just a few months later.
Time zone support has therefore only been available through two third-party modules, pytz and dateutil, both who include and wrap the "zoneinfo" database. This database, also called "tz" or "The Olsen database", is the de-facto standard time zone database over time zones, and it is included in most Unix and Unix-like operating systems, including OS X.
This gives us the opportunity to include the code that supports the zoneinfo data in the standard library, but by default use the operating system's copy of the data, which typically will be kept updated by the updating mechanism of the operating system or distribution.
For those who have an operating system that does not include the zoneinfo database, for example Windows, the Python source distribution will include a copy of the zoneinfo database, and a distribution containing the latest zoneinfo database will also be available at the Python Package Index, so it can be easily installed with the Python packaging tools such as easy_install or pip. This could also be done on Unices that are no longer receiving updates and therefore have an outdated database.
With such a mechanism Python would have full time zone support in the standard library on any platform, and a simple package installation would provide an updated time zone database on those platforms where the zoneinfo database isn't included, such as Windows, or on platforms where OS updates are no longer provided.
The time zone support will be implemented by making the datetime module into a package, and adding time zone support to datetime based on Stuart Bishop's pytz module.
Getting the local time zone
On Unix there is no standard way of finding the name of the time zone that is being used. All the information that is available is the time zone abbreviations, such as EST and PDT, but many of those abbreviations are ambiguous and therefore you can't rely on them to figure out which time zone you are located in.
There is however a standard for finding the compiled time zone information since it's located in /etc/localtime. Therefore it is possible to create a local time zone object with the correct time zone information even though you don't know the name of the time zone. A function in datetime should be provided to return the local time zone.
The support for this will be made by integrating Lennart Regebro's tzlocal module into the new datetime module.
For Windows it will look up the local Windows time zone name, and use a mapping between Windows time zone names and zoneinfo time zone names provided by the Unicode consortium to convert that to a zoneinfo time zone.
The mapping should be updated before each major or bugfix release, scripts for doing so will be provided in the Tools/ directory.
Ambiguous times
When changing over from daylight savings time (DST) the clock is turned back one hour. This means that the times during that hour happens twice, once with DST and then once without DST. Similarly, when changing to daylight savings time, one hour goes missing.
The current time zone API can not differentiate between the two ambiguous times during a change from DST. For example, in Stockholm the time of 2012-11-28 02:00:00 happens twice, both at UTC 2012-11-28 00:00:00 and also at 2012-11-28 01:00:00.
The current time zone API can not disambiguate this and therefore it's unclear which time should be returned:
# This could be either 00:00 or 01:00 UTC:
>>> dt = datetime(2012, 10, 28, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'))
# But we can not specify which:
>>> dt.astimezone(zoneinfo('UTC'))
datetime.datetime(2012, 10, 28, 1, 0, tzinfo=<UTC>)
pytz solved this problem by adding is_dst parameters to several methods of the tzinfo objects to make it possible to disambiguate times when this is desired.
This PEP proposes to add these is_dst parameters to the relevant methods of the datetime API, and therefore add this functionality directly to datetime. This is likely the hardest part of this PEP as this involves updating the C version of the datetime library with this functionality, as this involved writing new code, and not just reorganizing existing external libraries.
Implementation API
The zoneinfo database
The latest version of the zoneinfo database should exist in the Lib/tzdata directory of the Python source control system. This copy of the database should be updated before every Python feature and bug-fix release, but not for releases of Python versions that are in security-fix-only-mode.
Scripts to update the database will be provided in Tools/, and the release instructions will be updated to include this update.
New configure options --enable-internal-timezone-database and --disable-internal-timezone-database will be implemented to enable and disable the installation of this database when installing from source. A source install will default to installing them.
Binary installers for systems that have a system-provided zoneinfo database may skip installing the included database since it would never be used for these platforms. For other platforms, for example Windows, binary installers must install the included database.
Changes in the datetime-module
The public API of the new time zone support contains one new class, one new function, one new exception and four new collections. In addition to this, several methods on the datetime object gets a new is_dst parameter.
New class dsttimezone
This class provides a concrete implementation of the tzinfo base class that implements DST support.
New function zoneinfo(name=None, db_path=None)
This function takes a name string that must be a string specifying a valid zoneinfo time zone, i.e. "US/Eastern", "Europe/Warsaw" or "Etc/GMT". If not given, the local time zone will be looked up. If an invalid zone name is given, or the local time zone can not be retrieved, the function raises UnknownTimeZoneError.
The function also takes an optional path to the location of the zoneinfo database which should be used. If not specified, the function will look for databases in the following order:
- Check if the tzdata-update module is installed, and then use that database.
- Use the database in /usr/share/zoneinfo, if it exists.
- Use the Python-provided database in Lib/tzdata.
If no database is found an UnknownTimeZoneError or subclass thereof will be raised with a message explaining that no zoneinfo database can be found, but that you can install one with the tzdata-update package.
New parameter is_dst
A new is_dst parameter is added to several methods to handle time ambiguity during DST changeovers.
- tzinfo.utcoffset(dt, is_dst=False)
- tzinfo.dst(dt, is_dst=False)
- tzinfo.tzname(dt, is_dst=False)
- datetime.astimezone(tz, is_dst=False)
The is_dst parameter can be False (default), True, or None.
False will specify that the given datetime should be interpreted as not happening during daylight savings time, i.e. that the time specified is after the change from DST. This is default to preserve existing behavior.
True will specify that the given datetime should be interpreted as happening during daylight savings time, i.e. that the time specified is before the change from DST.
None will raise an AmbiguousTimeError exception if the time specified was during a DST change over. It will also raise a NonExistentTimeError if a time is specified during the "missing time" in a change to DST.
New exceptions
UnknownTimeZoneError
This exception is a subclass of KeyError and raised when giving a time zone specification that can't be found:
>>> datetime.zoneinfo('Europe/New_York') Traceback (most recent call last): ... UnknownTimeZoneError: There is no time zone called 'Europe/New_York'InvalidTimeError
This exception serves as a base for AmbiguousTimeError and NonExistentTimeError, to enable you to trap these two separately. It will subclass from ValueError, so that you can catch these errors together with inputs like the 29th of February 2011.
AmbiguousTimeError
This exception is raised when giving a datetime specification that is ambiguous while setting is_dst to None:
>>> datetime(2012, 11, 28, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'), is_dst=None) >>> Traceback (most recent call last): ... AmbiguousTimeError: 2012-10-28 02:00:00 is ambiguous in time zone Europe/StockholmNonExistentTimeError
This exception is raised when giving a datetime specification for a time that due to daylight saving does not exist, while setting is_dst to None:
>>> datetime(2012, 3, 25, 2, 0, tzinfo=zoneinfo('Europe/Stockholm'), is_dst=None) >>> Traceback (most recent call last): ... NonExistentTimeError: 2012-03-25 02:00:00 does not exist in time zone Europe/Stockholm
New collections
- all_timezones is the exhaustive list of the time zone names that can be used, listed alphabethically.
- common_timezones is a list of useful, current time zones, listed alphabethically.
The tzdata-update-package
The zoneinfo database will be packaged for easy installation with easy_install/pip/buildout. This package will not install any Python code, and will not contain any Python code except that which is needed for installation.
It will be kept updated with the same tools as the internal database, but released whenever the zoneinfo-database is updated, and use the same version schema.
Differences from the pytz API
- pytz has the functions localize() and normalize() to work around that tzinfo doesn't have is_dst. When is_dst is implemented directly in datetime.tzinfo they are no longer needed.
- The timezone() function is called zoneinfo() to avoid clashing with the timezone class introduced in Python 3.2.
- zoneinfo() will return the local time zone if called without arguments.
- The class pytz.StaticTzInfo is there to provide the is_dst support for static time zones. When is_dst support is included in datetime.tzinfo it is no longer needed.
- InvalidTimeError subclasses from ValueError.
Copyright
This document has been placed in the public domain.
pep-0432 Simplifying the CPython startup sequence
| PEP: | 432 |
|---|---|
| Title: | Simplifying the CPython startup sequence |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Dec-2012 |
| Python-Version: | 3.5 |
| Post-History: | 28-Dec-2012, 2-Jan-2013 |
Contents
- Abstract
- Proposal
- Background
- Key Concerns
- Required Configuration Settings
- Design Details
- Interpreter Initialization Phases
- Main Execution Phases
- Invocation of Phases
- Pre-Initialization Phase
- Determining the remaining configuration settings
- Supported configuration settings
- Completing the interpreter initialization
- Preparing the main module
- Executing the main module
- Internal Storage of Configuration Data
- Creating and Configuring Subinterpreters
- Stable ABI
- Build time configuration
- Backwards Compatibility
- A System Python Executable
- Open Questions
- Implementation
- The Status Quo
- References
- Copyright
Abstract
This PEP proposes a mechanism for simplifying the startup sequence for CPython, making it easier to modify the initialization behaviour of the reference interpreter executable, as well as making it easier to control CPython's startup behaviour when creating an alternate executable or embedding it as a Python execution engine inside a larger application.
Note: TBC = To Be Confirmed, TBD = To Be Determined. The appropriate resolution for most of these should become clearer as the reference implementation is developed.
Proposal
This PEP proposes that CPython move to an explicit multi-phase initialization process, where a preliminary interpreter is put in place with limited OS interaction capabilities early in the startup sequence. This essential core remains in place while all of the configuration settings are determined, until a final configuration call takes those settings and finishes bootstrapping the interpreter immediately before locating and executing the main module.
In the new design, the interpreter will move through the following well-defined phases during the initialization sequence:
- Pre-Initialization - no interpreter available
- Initializing - interpreter partially available
- Initialized - interpreter available, __main__ related metadata incomplete
With the interpreter itself fully initialised, main module execution will then proceed through two phases:
- Main Preparation - __main__ related metadata populated
- Main Execution - bytecode executing in the __main__ module namespace
(Embedding applications may choose not to use the Main Preparation and Execution phases)
As a concrete use case to help guide any design changes, and to solve a known problem where the appropriate defaults for system utilities differ from those for running user scripts, this PEP also proposes the creation and distribution of a separate system Python (pysystem) executable which, by default, ignores user site directories and environment variables, and does not implicitly set sys.path[0] based on the current directory or the script being executed (it will, however, still support virtual environments).
To keep the implementation complexity under control, this PEP does not propose wholesale changes to the way the interpreter state is accessed at runtime. Changing the order in which the existing initialization steps occur in order to make the startup sequence easier to maintain is already a substantial change, and attempting to make those other changes at the same time will make the change significantly more invasive and much harder to review. However, such proposals may be suitable topics for follow-on PEPs or patches - one key benefit of this PEP is decreasing the coupling between the internal storage model and the configuration interface, so such changes should be easier once this PEP has been implemented.
Background
Over time, CPython's initialization sequence has become progressively more complicated, offering more options, as well as performing more complex tasks (such as configuring the Unicode settings for OS interfaces in Python 3 [10], bootstrapping a pure Python implementation of the import system, and implementing an isolated mode more suitable for system applications that run with elevated privileges [6]).
Much of this complexity is formally accessible only through the Py_Main and Py_Initialize APIs, offering embedding applications little opportunity for customisation. This creeping complexity also makes life difficult for maintainers, as much of the configuration needs to take place prior to the Py_Initialize call, meaning much of the Python C API cannot be used safely.
A number of proposals are on the table for even more sophisticated startup behaviour, such as better control over sys.path initialization (easily adding additional directories on the command line in a cross-platform fashion [7], as well as controlling the configuration of sys.path[0] [8]), easier configuration of utilities like coverage tracing when launching Python subprocesses [9].
Rather than continuing to bolt such behaviour onto an already complicated system, this PEP proposes to start simplifying the status quo by introducing a more stuctured startup sequence, with the aim of making these further feature requests easier to implement.
Key Concerns
There are a couple of key concerns that any change to the startup sequence needs to take into account.
Maintainability
The current CPython startup sequence is difficult to understand, and even more difficult to modify. It is not clear what state the interpreter is in while much of the initialization code executes, leading to behaviour such as lists, dictionaries and Unicode values being created prior to the call to Py_Initialize when the -X or -W options are used [1].
By moving to an explicitly multi-phase startup sequence, developers should only need to understand which features are not available in the core bootstrapping phase, as the vast majority of the configuration process will now take place during that phase.
By basing the new design on a combination of C structures and Python data types, it should also be easier to modify the system in the future to add new configuration options.
Performance
CPython is used heavily to run short scripts where the runtime is dominated by the interpreter initialization time. Any changes to the startup sequence should minimise their impact on the startup overhead.
Experience with the importlib migration suggests that the startup time is dominated by IO operations. However, to monitor the impact of any changes, a simple benchmark can be used to check how long it takes to start and then tear down the interpreter:
python3 -m timeit -s "from subprocess import call" "call(['./python', '-c', 'pass'])"
Current numbers on my system for Python 3.5 (using the 3.4 subprocess and timeit modules to execute the check, all with non-debug builds):
$ python3 -m timeit -s "from subprocess import call" "call(['./python', '-c', 'pass'])" 10 loops, best of 3: 18.2 msec per loop
This PEP is not expected to have any significant effect on the startup time, as it is aimed primarily at reordering the existing initialization sequence, without making substantial changes to the individual steps.
However, if this simple check suggests that the proposed changes to the initialization sequence may pose a performance problem, then a more sophisticated microbenchmark will be developed to assist in investigation.
Required Configuration Settings
A comprehensive configuration scheme requires that an embedding application be able to control the following aspects of the final interpreter state:
- Whether or not to use randomised hashes (and if used, potentially specify a specific random seed)
- Whether or not to enable the import system (required by CPython's build process when freezing the importlib._bootstrap bytecode)
- The "Where is Python located?" elements in the sys module: * sys.executable * sys.base_exec_prefix * sys.base_prefix * sys.exec_prefix * sys.prefix
- The path searched for imports from the filesystem (and other path hooks): * sys.path
- The command line arguments seen by the interpeter: * sys.argv
- The filesystem encoding used by: * sys.getfsencoding * os.fsencode * os.fsdecode
- The IO encoding (if any) and the buffering used by: * sys.stdin * sys.stdout * sys.stderr
- The initial warning system state: * sys.warnoptions
- Arbitrary extended options (e.g. to automatically enable faulthandler): * sys._xoptions
- Whether or not to implicitly cache bytecode files: * sys.dont_write_bytecode
- Whether or not to enforce correct case in filenames on case-insensitive platforms * os.environ["PYTHONCASEOK"]
- The other settings exposed to Python code in sys.flags:
- debug (Enable debugging output in the pgen parser)
- inspect (Enter interactive interpreter after __main__ terminates)
- interactive (Treat stdin as a tty)
- optimize (__debug__ status, write .pyc or .pyo, strip doc strings)
- no_user_site (don't add the user site directory to sys.path)
- no_site (don't implicitly import site during startup)
- ignore_environment (whether environment vars are used during config)
- verbose (enable all sorts of random output)
- bytes_warning (warnings/errors for implicit str/bytes interaction)
- quiet (disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
- Whether or not CPython's signal handlers should be installed
- What code (if any) should be executed as __main__:
- Nothing (just create an empty module)
- A filesystem path referring to a Python script (source or bytecode)
- A filesystem path referring to a valid sys.path entry (typically a directory or zipfile)
- A given string (equivalent to the "-c" option)
- A module or package (equivalent to the "-m" option)
- Standard input as a script (i.e. a non-interactive stream)
- Standard input as an interactive interpreter session
<TBD: Did I miss anything?>
Note that this just covers settings that are currently configurable in some manner when using the main CPython executable. While this PEP aims to make adding additional configuration settings easier in the future, it deliberately avoids adding any new settings of its own (except where such additional settings arise naturally in the course of migrating existing settings to the new structure).
Design Details
(Note: details here are still very much in flux, but preliminary feedback is appreciated anyway)
The main theme of this proposal is to create the interpreter state for the main interpreter much earlier in the startup process. This will allow most of the CPython API to be used during the remainder of the initialization process, potentially simplifying a number of operations that currently need to rely on basic C functionality rather than being able to use the richer data structures provided by the CPython C API.
In the following, the term "embedding application" also covers the standard CPython command line application.
Interpreter Initialization Phases
Three distinct interpreter initialisation phases are proposed:
- Pre-Initialization:
- no interpreter is available.
- Py_IsInitializing() returns 0
- Py_IsInitialized() returns 0
- The embedding application determines the settings required to create the main interpreter and moves to the next phase by calling Py_BeginInitialization.
- Initializing:
- the main interpreter is available, but only partially configured.
- Py_IsInitializing() returns 1
- Py_IsInitialized() returns 0
- The embedding application determines and applies the settings required to complete the initialization process by calling Py_ReadConfig and Py_EndInitialization.
- Initialized:
- the main interpreter is available and fully operational, but __main__ related metadata is incomplete
- Py_IsInitializing() returns 0
- Py_IsInitialized() returns 1
Main Execution Phases
After initializing the interpreter, the embedding application may continue on to execute code in the __main__ module namespace.
- Main Preparation:
- subphase of Initialized (not separately identified at runtime)
- fully populates __main__ related metadata
- may execute code in __main__ namespace (e.g. PYTHONSTARTUP)
- invoked as PyRun_PrepareMain
- Main Execution:
- subphase of Initialized (not separately identified at runtime)
- user supplied bytecode is being executed in the __main__ namespace
- invoked as PyRun_ExecMain
Invocation of Phases
All listed phases will be used by the standard CPython interpreter and the proposed System Python interpreter. Other embedding applications may choose to skip the step of executing code in the __main__ namespace.
An embedding application may still continue to leave initialization almost entirely under CPython's control by using the existing Py_Initialize API. Alternatively, if an embedding application wants greater control over CPython's initial state, it will be able to use the new, finer grained API, which allows the embedding application greater control over the initialization process:
/* Phase 1: Pre-Initialization */ PyCoreConfig core_config = PyCoreConfig_INIT; PyConfig config = PyConfig_INIT; /* Easily control the core configuration */ core_config.ignore_environment = 1; /* Ignore environment variables */ core_config.use_hash_seed = 0; /* Full hash randomisation */ Py_BeginInitialization(&core_config); /* Phase 2: Initialization */ /* Optionally preconfigure some settings here - they will then be * used to derive other settings */ Py_ReadConfig(&config); /* Can completely override derived settings here */ Py_EndInitialization(&config); /* Phase 3: Initialized */ /* If an embedding application has no real concept of a main module * it can just stop the initialization process here. * Alternatively, it can launch __main__ via the PyRun_*Main functions. */
Pre-Initialization Phase
The pre-initialization phase is where an embedding application determines the settings which are absolutely required before the interpreter can be initialized at all. Currently, the primary configuration settings in this category are those related to the randomised hash algorithm - the hash algorithms must be consistent for the lifetime of the process, and so they must be in place before the core interpreter is created.
The specific settings needed are a flag indicating whether or not to use a specific seed value for the randomised hashes, and if so, the specific value for the seed (a seed value of zero disables randomised hashing). In addition, due to the possible use of PYTHONHASHSEED in configuring the hash randomisation, the question of whether or not to consider environment variables must also be addressed early. Finally, to support the CPython build process, an option is offered to completely disable the import system.
The proposed API for this step in the startup sequence is:
void Py_BeginInitialization(const PyCoreConfig *config);
Like Py_Initialize, this part of the new API treats initialization failures as fatal errors. While that's still not particularly embedding friendly, the operations in this step really shouldn't be failing, and changing them to return error codes instead of aborting would be an even larger task than the one already being proposed.
The new PyCoreConfig struct holds the settings required for preliminary configuration:
/* Note: if changing anything in PyCoreConfig, also update
* PyCoreConfig_INIT */
typedef struct {
int ignore_environment; /* -E switch, -I switch */
int use_hash_seed; /* PYTHONHASHSEED */
unsigned long hash_seed; /* PYTHONHASHSEED */
int _disable_importlib; /* Needed by freeze_importlib */
} PyCoreConfig;
#define PyCoreConfig_INIT {0, -1, 0, 0}
The core configuration settings pointer may be NULL, in which case the default values are ignore_environment = -1 and use_hash_seed = -1.
The PyCoreConfig_INIT macro is designed to allow easy initialization of a struct instance with sensible defaults:
PyCoreConfig core_config = PyCoreConfig_INIT;
ignore_environment controls the processing of all Python related environment variables. If the flag is zero, then environment variables are processed normally. Otherwise, all Python-specific environment variables are considered undefined (exceptions may be made for some OS specific environment variables, such as those used on Mac OS X to communicate between the App bundle and the main Python binary).
use_hash_seed controls the configuration of the randomised hash algorithm. If it is zero, then randomised hashes with a random seed will be used. It it is positive, then the value in hash_seed will be used to seed the random number generator. If the hash_seed is zero in this case, then the randomised hashing is disabled completely.
If use_hash_seed is negative (and ignore_environment is zero), then CPython will inspect the PYTHONHASHSEED environment variable. If the environment variable is not set, is set to the empty string, or to the value "random", then randomised hashes with a random seed will be used. If the environment variable is set to the string "0" the randomised hashing will be disabled. Otherwise, the hash seed is expected to be a string representation of an integer in the range [0; 4294967295].
To make it easier for embedding applications to use the PYTHONHASHSEED processing with a different data source, the following helper function will be added to the C API:
int Py_ReadHashSeed(char *seed_text,
int *use_hash_seed,
unsigned long *hash_seed);
This function accepts a seed string in seed_text and converts it to the appropriate flag and seed values. If seed_text is NULL, the empty string or the value "random", both use_hash_seed and hash_seed will be set to zero. Otherwise, use_hash_seed will be set to 1 and the seed text will be interpreted as an integer and reported as hash_seed. On success the function will return zero. A non-zero return value indicates an error (most likely in the conversion to an integer).
The _disable_importlib setting is used as part of the CPython build process to create an interpreter with no import capability at all. It is considered private to the CPython development team (hence the leading underscore), as the only known use case is to permit compiler changes that invalidate the previously frozen bytecode for importlib._bootstrap without breaking the build process.
The aim is to keep this initial level of configuration as small as possible in order to keep the bootstrapping environment consistent across different embedding applications. If we can create a valid interpreter state without the setting, then the setting should go in the configuration passed to Py_EndInitialization() rather than in the core configuration.
A new query API will allow code to determine if the interpreter is in the bootstrapping state between the creation of the interpreter state and the completion of the bulk of the initialization process:
int Py_IsInitializing();
Attempting to call Py_BeginInitialization() again when Py_IsInitializing() or Py_IsInitialized() is true is a fatal error.
While in the initializing state, the interpreter should be fully functional except that:
- compilation is not allowed (as the parser and compiler are not yet configured properly)
- creation of subinterpreters is not allowed
- creation of additional thread states is not allowed
- The following attributes in the sys module are all either missing or None: * sys.path * sys.argv * sys.executable * sys.base_exec_prefix * sys.base_prefix * sys.exec_prefix * sys.prefix * sys.warnoptions * sys.flags * sys.dont_write_bytecode * sys.stdin * sys.stdout
- The filesystem encoding is not yet defined
- The IO encoding is not yet defined
- CPython signal handlers are not yet installed
- only builtin and frozen modules may be imported (due to above limitations)
- sys.stderr is set to a temporary IO object using unbuffered binary mode
- The warnings module is not yet initialized
- The __main__ module does not yet exist
<TBD: identify any other notable missing functionality>
The main things made available by this step will be the core Python datatypes, in particular dictionaries, lists and strings. This allows them to be used safely for all of the remaining configuration steps (unlike the status quo).
In addition, the current thread will possess a valid Python thread state, allowing any further configuration data to be stored on the interpreter object rather than in C process globals.
Any call to Py_BeginInitialization() must have a matching call to Py_Finalize(). It is acceptable to skip calling Py_EndInitialization() in between (e.g. if attempting to read the configuration settings fails)
Determining the remaining configuration settings
The next step in the initialization sequence is to determine the full settings needed to complete the process. No changes are made to the interpreter state at this point. The core API for this step is:
int Py_ReadConfig(PyConfig *config);
The config argument should be a pointer to a config struct (which may be a temporary one stored on the C stack). For any already configured value (i.e. non-NULL pointer or non-negative numeric value), CPython will sanity check the supplied value, but otherwise accept it as correct.
A struct is used rather than a Python dictionary as the struct is easier to work with from C, the list of supported fields is fixed for a given CPython version and only a read-only view needs to be exposed to Python code (which is relatively straightforward, thanks to the infrastructure already put in place to expose sys.implementation).
Unlike Py_Initialize and Py_BeginInitialization, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data.
Any supported configuration setting which is not already set will be populated appropriately in the supplied configuration struct. The default configuration can be overridden entirely by setting the value before calling Py_ReadConfiguration. The provided value will then also be used in calculating any other settings derived from that value.
Alternatively, settings may be overridden after the Py_ReadConfiguration call (this can be useful if an embedding application wants to adjust a setting rather than replace it completely, such as removing sys.path[0]).
Merely reading the configuration has no effect on the interpreter state: it only modifies the passed in configuration struct. The settings are not applied to the running interpreter until the Py_EndInitialization call (see below).
Supported configuration settings
The new PyConfig struct holds the settings required to complete the interpreter configuration. All fields are either pointers to Python data types (not set == NULL) or numeric flags (not set == -1):
/* Note: if changing anything in PyConfig, also update PyConfig_INIT */
typedef struct {
/* Argument processing */
PyListObject *raw_argv;
PyListObject *argv;
PyListObject *warnoptions; /* -W switch, PYTHONWARNINGS */
PyDictObject *xoptions; /* -X switch */
/* Filesystem locations */
PyUnicodeObject *program_name;
PyUnicodeObject *executable;
PyUnicodeObject *prefix; /* PYTHONHOME */
PyUnicodeObject *exec_prefix; /* PYTHONHOME */
PyUnicodeObject *base_prefix; /* pyvenv.cfg */
PyUnicodeObject *base_exec_prefix; /* pyvenv.cfg */
/* Site module */
int enable_site_config; /* -S switch (inverted) */
int no_user_site; /* -s switch, PYTHONNOUSERSITE */
/* Import configuration */
int dont_write_bytecode; /* -B switch, PYTHONDONTWRITEBYTECODE */
int ignore_module_case; /* PYTHONCASEOK */
PyListObject *import_path; /* PYTHONPATH (etc) */
/* Standard streams */
int use_unbuffered_io; /* -u switch, PYTHONUNBUFFEREDIO */
PyUnicodeObject *stdin_encoding; /* PYTHONIOENCODING */
PyUnicodeObject *stdin_errors; /* PYTHONIOENCODING */
PyUnicodeObject *stdout_encoding; /* PYTHONIOENCODING */
PyUnicodeObject *stdout_errors; /* PYTHONIOENCODING */
PyUnicodeObject *stderr_encoding; /* PYTHONIOENCODING */
PyUnicodeObject *stderr_errors; /* PYTHONIOENCODING */
/* Filesystem access */
PyUnicodeObject *fs_encoding;
/* Debugging output */
int debug_parser; /* -d switch, PYTHONDEBUG */
int verbosity; /* -v switch */
/* Code generation */
int bytes_warnings; /* -b switch */
int optimize; /* -O switch */
/* Signal handling */
int install_signal_handlers;
/* Implicit execution */
PyUnicodeObject *startup_file; /* PYTHONSTARTUP */
/* Main module
*
* If prepare_main is set, at most one of the main_* settings should
* be set before calling PyRun_PrepareMain (Py_ReadConfiguration will
* set one of them based on the command line arguments if prepare_main
* is non-zero when that API is called).
int prepare_main;
PyUnicodeObject *main_source; /* -c switch */
PyUnicodeObject *main_path; /* filesystem path */
PyUnicodeObject *main_module; /* -m switch */
PyCodeObject *main_code; /* Run directly from a code object */
PyObject *main_stream; /* Run from stream */
int run_implicit_code; /* Run implicit code during prep */
/* Interactive main
*
* Note: Settings related to interactive mode are very much in flux.
*/
PyObject *prompt_stream; /* Output interactive prompt */
int show_banner; /* -q switch (inverted) */
int inspect_main; /* -i switch, PYTHONINSPECT */
} PyConfig;
/* Struct initialization is pretty ugly in C89. Avoiding this mess would
* be the most attractive aspect of using a PyDictObject* instead... */
#define _PyArgConfig_INIT NULL, NULL, NULL, NULL
#define _PyLocationConfig_INIT NULL, NULL, NULL, NULL, NULL, NULL
#define _PySiteConfig_INIT -1, -1
#define _PyImportConfig_INIT -1, -1, NULL
#define _PyStreamConfig_INIT -1, NULL, NULL, NULL, NULL, NULL, NULL
#define _PyFilesystemConfig_INIT NULL
#define _PyDebuggingConfig_INIT -1, -1, -1
#define _PyCodeGenConfig_INIT -1, -1
#define _PySignalConfig_INIT -1
#define _PyImplicitConfig_INIT NULL
#define _PyMainConfig_INIT -1, NULL, NULL, NULL, NULL, NULL, -1
#define _PyInteractiveConfig_INIT NULL, -1, -1
#define PyConfig_INIT {_PyArgConfig_INIT, _PyLocationConfig_INIT,
_PySiteConfig_INIT, _PyImportConfig_INIT,
_PyStreamConfig_INIT, _PyFilesystemConfig_INIT,
_PyDebuggingConfig_INIT, _PyCodeGenConfig_INIT,
_PySignalConfig_INIT, _PyImplicitConfig_INIT,
_PyMainConfig_INIT, _PyInteractiveConfig_INIT}
<TBD: did I miss anything?>
Completing the interpreter initialization
The final step in the initialization process is to actually put the configuration settings into effect and finish bootstrapping the interpreter up to full operation:
int Py_EndInitialization(const PyConfig *config);
Like Py_ReadConfiguration, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data.
All configuration settings are required - the configuration struct should always be passed through Py_ReadConfig() to ensure it is fully populated.
After a successful call, Py_IsInitializing() will be false, while Py_IsInitialized() will become true. The caveats described above for the interpreter during the initialization phase will no longer hold.
Attempting to call Py_EndInitialization() again when Py_IsInitializing() is false or Py_IsInitialized() is true is an error.
However, some metadata related to the __main__ module may still be incomplete:
sys.argv[0] may not yet have its final value
it will be -m when executing a module or package with CPython
it will be the same as sys.path[0] rather than the location of the __main__ module when executing a valid sys.path entry (typically a zipfile or directory)
otherwise, it will be accurate:
- the script name if running an ordinary script
- -c if executing a supplied string
- - or the empty string if running from stdin
the metadata in the __main__ module will still indicate it is a builtin module
This function will normally implicitly import site as its final operation (after Py_IsInitialized() is already set). Clearing the "enable_site_config" flag in the configuration settings will disable this behaviour, as well as eliminating any side effects on global state if import site is later explicitly executed in the process.
Preparing the main module
This subphase completes the population of the __main__ module related metadata, without actually starting execution of the __main__ module code.
It is handled by calling the following API:
int PyRun_PrepareMain();
The actual processing is driven by the main related settings stored in the interpreter state as part of the configuration struct.
If prepare_main is zero, this call does nothing.
If all of main_source, main_path, main_module, main_stream and main_code are NULL, this call does nothing.
If more than one of main_source, main_path, main_module, main_stream or main_code are set, RuntimeError will be reported.
If main_code is already set, then this call does nothing.
If main_stream is set, and run_implicit_code is also set, then the file identified in startup_file will be read, compiled and executed in the __main__ namespace.
If main_source, main_path or main_module are set, then this call will take whatever steps are needed to populate main_code:
For main_source, the supplied string will be compiled and saved to main_code.
- For main_path:
- if the supplied path is recognised as a valid sys.path entry, it is inserted as sys.path[0], main_module is set to __main__ and processing continues as for main_module below.
- otherwise, path is read as a CPython bytecode file
- if that fails, it is read as a Python source file and compiled
- in the latter two cases, the code object is saved to main_code and __main__.__file__ is set appropriately
- For main_module:
- any parent package is imported
- the loader for the module is determined
- if the loader indicates the module is a package, add .__main__ to the end of main_module and try again (if the final name segment is already .__main__ then fail immediately)
- once the module source code is located, save the compiled module code as main_code and populate the following attributes in __main__ appropriately: __name__, __loader__, __file__, __cached__, __package__.
(Note: the behaviour described in this section isn't new, it's a write-up of the current behaviour of the CPython interpreter adjusted for the new configuration system)
Executing the main module
This subphase covers the execution of the actual __main__ module code.
It is handled by calling the following API:
int PyRun_ExecMain();
The actual processing is driven by the main related settings stored in the interpreter state as part of the configuration struct.
If both main_stream and main_code are NULL, this call does nothing.
If both main_stream and main_code are set, RuntimeError will be reported.
If main_stream and prompt_stream are both set, main execution will be delegated to a new API:
int PyRun_InteractiveMain(PyObject *input, PyObject* output);
If main_stream is set and prompt_stream is NULL, main execution will be delegated to a new API:
int PyRun_StreamInMain(PyObject *input);
If main_code is set, main execution will be delegated to a new API:
int PyRun_CodeInMain(PyCodeObject *code);
After execution of main completes, if inspect_main is set, or the PYTHONINSPECT environment variable has been set, then PyRun_ExecMain will invoke PyRun_InteractiveMain(sys.__stdin__, sys.__stdout__).
Internal Storage of Configuration Data
The interpreter state will be updated to include details of the configuration settings supplied during initialization by extending the interpreter state object with an embedded copy of the PyCoreConfig and PyConfig structs.
For debugging purposes, the configuration settings will be exposed as a sys._configuration simple namespace (similar to sys.flags and sys.implementation. Field names will match those in the configuration structs, except for hash_seed, which will be deliberately excluded.
An underscored attribute is chosen deliberately, as these configuration settings are part of the CPython implementation, rather than part of the Python language definition. If settings are needed to support cross-implementation compatibility in the standard library, then those should be agreed with the other implementations and exposed as new required attributes on sys.implementation, as described in PEP 421.
These are snapshots of the initial configuration settings. They are not modified by the interpreter during runtime (except as noted above).
Creating and Configuring Subinterpreters
As the new configuration settings are stored in the interpreter state, they need to be initialised when a new subinterpreter is created. This turns out to be trickier than one might think due to PyThreadState_Swap(NULL); (which is fortunately exercised by CPython's own embedding tests, allowing this problem to be detected during development).
To provide a straightforward solution for this case, the PEP proposes to add a new API:
Py_InterpreterState *Py_InterpreterState_Main();
This will be a counterpart to Py_InterpreterState_Head(), reporting the oldest currently existing interpreter rather than the newest. If Py_NewInterpreter() is called from a thread with an existing thread state, then the interpreter configuration for that thread will be used when initialising the new subinterpreter. If there is no current thread state, the configuration from Py_InterpreterState_Main() will be used.
While the existing Py_InterpreterState_Head() API could be used instead, that reference changes as subinterpreters are created and destroyed, while PyInterpreterState_Main() will always refer to the initial interpreter state created in Py_BeginInitialization().
A new constraint is also added to the embedding API: attempting to delete the main interpreter while subinterpreters still exist will now be a fatal error.
Stable ABI
Most of the APIs proposed in this PEP are excluded from the stable ABI, as embedding a Python interpreter involves a much higher degree of coupling than merely writing an extension.
The only newly exposed API that will be part of the stable ABI is the Py_IsInitializing() query.
Build time configuration
This PEP makes no changes to the handling of build time configuration settings, and thus has no effect on the contents of sys.implementation or the result of sysconfig.get_config_vars().
Backwards Compatibility
Backwards compatibility will be preserved primarily by ensuring that Py_ReadConfig() interrogates all the previously defined configuration settings stored in global variables and environment variables, and that Py_EndInitialization() writes affected settings back to the relevant locations.
One acknowledged incompatiblity is that some environment variables which are currently read lazily may instead be read once during interpreter initialization. As the PEP matures, these will be discussed in more detail on a case by case basis. The environment variables which are currently known to be looked up dynamically are:
- PYTHONCASEOK: writing to os.environ['PYTHONCASEOK'] will no longer dynamically alter the interpreter's handling of filename case differences on import (TBC)
- PYTHONINSPECT: os.environ['PYTHONINSPECT'] will still be checked after execution of the __main__ module terminates
The Py_Initialize() style of initialization will continue to be supported. It will use (at least some elements of) the new API internally, but will continue to exhibit the same behaviour as it does today, ensuring that sys.argv is not populated until a subsequent PySys_SetArgv call. All APIs that currently support being called prior to Py_Initialize() will continue to do so, and will also support being called prior to Py_BeginInitialization().
To minimise unnecessary code churn, and to ensure the backwards compatibility is well tested, the main CPython executable may continue to use some elements of the old style initialization API. (very much TBC)
A System Python Executable
When executing system utilities with administrative access to a system, many of the default behaviours of CPython are undesirable, as they may allow untrusted code to execute with elevated privileges. The most problematic aspects are the fact that user site directories are enabled, environment variables are trusted and that the directory containing the executed file is placed at the beginning of the import path.
Issue 16499 [6] proposes adding a -I option to change the behaviour of the normal CPython executable, but this is a hard to discover solution (and adds yet another option to an already complex CLI). This PEP proposes to instead add a separate pysystem executable
Currently, providing a separate executable with different default behaviour would be prohibitively hard to maintain. One of the goals of this PEP is to make it possible to replace much of the hard to maintain bootstrapping code with more normal CPython code, as well as making it easier for a separate application to make use of key components of Py_Main. Including this change in the PEP is designed to help avoid acceptance of a design that sounds good in theory but proves to be problematic in practice.
Cleanly supporting this kind of "alternate CLI" is the main reason for the proposed changes to better expose the core logic for deciding between the different execution modes supported by CPython:
- script execution
- directory/zipfile execution
- command execution ("-c" switch)
- module or package execution ("-m" switch)
- execution from stdin (non-interactive)
- interactive stdin
Actually implementing this may also reveal the need for some better argument parsing infrastructure for use during the initializing phase.
Open Questions
- Error details for Py_ReadConfiguration and Py_EndInitialization (these should become clear as the implementation progresses)
- Should there be Py_PreparingMain() and Py_RunningMain() query APIs?
- Should the answer to Py_IsInitialized() be exposed via the sys module?
- Is initialisation of the PyConfig struct too unwieldy to be maintainable? Would a Python dictionary be a better choice, despite being harder to work with from C code?
- Would it be better to manage the flag variables in PyConfig as Python integers or as "negative means false, positive means true, zero means not set" so the struct can be initialized with a simple memset(&config, 0, sizeof(*config)), eliminating the need to update both PyConfig and PyConfig_INIT when adding new fields?
- The name of the new system Python executable is a bikeshed waiting to be painted. The 3 options considered so far are spython, pysystem and python-minimal. The PEP text reflects my current preferred choice (pysystem).
Implementation
The reference implementation is being developed as a feature branch in my BitBucket sandbox [2]. Pull requests to fix the inevitably broken Windows builds are welcome, but the basic design is still in too much flux for other pull requests to be feasible just yet. Once the overall design settles down and it's a matter of migrating individual settings over to the new design, that level of collaboration should become more practical.
As the number of application binaries created by the build process is now four, the reference implementation also creates a new top level "Apps" directory in the CPython source tree. The source files for the main python binary and the new pysystem binary will be located in that directory. The source files for the _freeze_importlib binary and the _testembed binary have been moved out of the Modules directory (which is intended for CPython builtin and extension modules) and into the Tools directory.
The Status Quo
The current mechanisms for configuring the interpreter have accumulated in a fairly ad hoc fashion over the past 20+ years, leading to a rather inconsistent interface with varying levels of documentation.
(Note: some of the info below could probably be cleaned up and added to the C API documentation for at least 3.3. - it's all CPython specific, so it doesn't belong in the language reference)
Ignoring Environment Variables
The -E command line option allows all environment variables to be ignored when initializing the Python interpreter. An embedding application can enable this behaviour by setting Py_IgnoreEnvironmentFlag before calling Py_Initialize().
In the CPython source code, the Py_GETENV macro implicitly checks this flag, and always produces NULL if it is set.
<TBD: I believe PYTHONCASEOK is checked regardless of this setting > <TBD: Does -E also ignore Windows registry keys? >
Randomised Hashing
The randomised hashing is controlled via the -R command line option (in releases prior to 3.3), as well as the PYTHONHASHSEED environment variable.
In Python 3.3, only the environment variable remains relevant. It can be used to disable randomised hashing (by using a seed value of 0) or else to force a specific hash value (e.g. for repeatability of testing, or to share hash values between processes)
However, embedding applications must use the Py_HashRandomizationFlag to explicitly request hash randomisation (CPython sets it in Py_Main() rather than in Py_Initialize()).
The new configuration API should make it straightforward for an embedding application to reuse the PYTHONHASHSEED processing with a text based configuration setting provided by other means (e.g. a config file or separate environment variable).
Locating Python and the standard library
The location of the Python binary and the standard library is influenced by several elements. The algorithm used to perform the calculation is not documented anywhere other than in the source code [3,4_]. Even that description is incomplete, as it failed to be updated for the virtual environment support added in Python 3.3 (detailed in PEP 405).
These calculations are affected by the following function calls (made prior to calling Py_Initialize()) and environment variables:
- Py_SetProgramName()
- Py_SetPythonHome()
- PYTHONHOME
The filesystem is also inspected for pyvenv.cfg files (see PEP 405) or, failing that, a lib/os.py (Windows) or lib/python$VERSION/os.py file.
The build time settings for PREFIX and EXEC_PREFIX are also relevant, as are some registry settings on Windows. The hardcoded fallbacks are based on the layout of the CPython source tree and build output when working in a source checkout.
Configuring sys.path
An embedding application may call Py_SetPath() prior to Py_Initialize() to completely override the calculation of sys.path. It is not straightforward to only allow some of the calculations, as modifying sys.path after initialization is already complete means those modifications will not be in effect when standard library modules are imported during the startup sequence.
If Py_SetPath() is not used prior to the first call to Py_GetPath() (implicit in Py_Initialize()), then it builds on the location data calculations above to calculate suitable path entries, along with the PYTHONPATH environment variable.
<TBD: On Windows, there's also a bunch of stuff to do with the registry>
The site module, which is implicitly imported at startup (unless disabled via the -S option) adds additional paths to this initial set of paths, as described in its documentation [5].
The -s command line option can be used to exclude the user site directory from the list of directories added. Embedding applications can control this by setting the Py_NoUserSiteDirectory global variable.
The following commands can be used to check the default path configurations for a given Python executable on a given system:
- ./python -c "import sys, pprint; pprint.pprint(sys.path)" - standard configuration
- ./python -s -c "import sys, pprint; pprint.pprint(sys.path)" - user site directory disabled
- ./python -S -c "import sys, pprint; pprint.pprint(sys.path)" - all site path modifications disabled
(Note: you can see similar information using -m site instead of -c, but this is slightly misleading as it calls os.abspath on all of the path entries, making relative path entries look absolute. Using the site module also causes problems in the last case, as on Python versions prior to 3.3, explicitly importing site will carry out the path modifications -S avoids, while on 3.3+ combining -m site with -S currently fails)
The calculation of sys.path[0] is comparatively straightforward:
- For an ordinary script (Python source or compiled bytecode), sys.path[0] will be the directory containing the script.
- For a valid sys.path entry (typically a zipfile or directory), sys.path[0] will be that path
- For an interactive session, running from stdin or when using the -c or -m switches, sys.path[0] will be the empty string, which the import system interprets as allowing imports from the current directory
Configuring sys.argv
Unlike most other settings discussed in this PEP, sys.argv is not set implicitly by Py_Initialize(). Instead, it must be set via an explicitly call to Py_SetArgv().
CPython calls this in Py_Main() after calling Py_Initialize(). The calculation of sys.argv[1:] is straightforward: they're the command line arguments passed after the script name or the argument to the -c or -m options.
The calculation of sys.argv[0] is a little more complicated:
- For an ordinary script (source or bytecode), it will be the script name
- For a sys.path entry (typically a zipfile or directory) it will initially be the zipfile or directory name, but will later be changed by the runpy module to the full path to the imported __main__ module.
- For a module specified with the -m switch, it will initially be the string "-m", but will later be changed by the runpy module to the full path to the executed module.
- For a package specified with the -m switch, it will initially be the string "-m", but will later be changed by the runpy module to the full path to the executed __main__ submodule of the package.
- For a command executed with -c, it will be the string "-c"
- For explicitly requested input from stdin, it will be the string "-"
- Otherwise, it will be the empty string
Embedding applications must call Py_SetArgv themselves. The CPython logic for doing so is part of Py_Main() and is not exposed separately. However, the runpy module does provide roughly equivalent logic in runpy.run_module and runpy.run_path.
Other configuration settings
TBD: Cover the initialization of the following in more detail:
- Completely disabling the import system
- The initial warning system state: * sys.warnoptions * (-W option, PYTHONWARNINGS)
- Arbitrary extended options (e.g. to automatically enable faulthandler): * sys._xoptions * (-X option)
- The filesystem encoding used by: * sys.getfsencoding * os.fsencode * os.fsdecode
- The IO encoding and buffering used by: * sys.stdin * sys.stdout * sys.stderr * (-u option, PYTHONIOENCODING, PYTHONUNBUFFEREDIO)
- Whether or not to implicitly cache bytecode files: * sys.dont_write_bytecode * (-B option, PYTHONDONTWRITEBYTECODE)
- Whether or not to enforce correct case in filenames on case-insensitive platforms * os.environ["PYTHONCASEOK"]
- The other settings exposed to Python code in sys.flags:
- debug (Enable debugging output in the pgen parser)
- inspect (Enter interactive interpreter after __main__ terminates)
- interactive (Treat stdin as a tty)
- optimize (__debug__ status, write .pyc or .pyo, strip doc strings)
- no_user_site (don't add the user site directory to sys.path)
- no_site (don't implicitly import site during startup)
- ignore_environment (whether environment vars are used during config)
- verbose (enable all sorts of random output)
- bytes_warning (warnings/errors for implicit str/bytes interaction)
- quiet (disable banner output even if verbose is also enabled or stdin is a tty and the interpreter is launched in interactive mode)
- Whether or not CPython's signal handlers should be installed
Much of the configuration of CPython is currently handled through C level global variables:
Py_BytesWarningFlag (-b) Py_DebugFlag (-d option) Py_InspectFlag (-i option, PYTHONINSPECT) Py_InteractiveFlag (property of stdin, cannot be overridden) Py_OptimizeFlag (-O option, PYTHONOPTIMIZE) Py_DontWriteBytecodeFlag (-B option, PYTHONDONTWRITEBYTECODE) Py_NoUserSiteDirectory (-s option, PYTHONNOUSERSITE) Py_NoSiteFlag (-S option) Py_UnbufferedStdioFlag (-u, PYTHONUNBUFFEREDIO) Py_VerboseFlag (-v option, PYTHONVERBOSE)
For the above variables, the conversion of command line options and environment variables to C global variables is handled by Py_Main, so each embedding application must set those appropriately in order to change them from their defaults.
Some configuration can only be provided as OS level environment variables:
PYTHONSTARTUP PYTHONCASEOK PYTHONIOENCODING
The Py_InitializeEx() API also accepts a boolean flag to indicate whether or not CPython's signal handlers should be installed.
Finally, some interactive behaviour (such as printing the introductory banner) is triggered only when standard input is reported as a terminal connection by the operating system.
TBD: Document how the "-x" option is handled (skips processing of the first comment line in the main script)
Also see detailed sequence of operations notes at [1]
References
| [1] | CPython interpreter initialization notes (http://wiki.python.org/moin/CPythonInterpreterInitialization) |
| [2] | BitBucket Sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox/compare/pep432_modular_bootstrap..default#commits) |
| [3] | *nix getpath implementation (http://hg.python.org/cpython/file/default/Modules/getpath.c) |
| [4] | Windows getpath implementation (http://hg.python.org/cpython/file/default/PC/getpathp.c) |
| [5] | Site module documentation (http://docs.python.org/3/library/site.html) |
| [6] | Proposed CLI option for isolated mode (http://bugs.python.org/issue16499) |
| [7] | Adding to sys.path on the command line (http://mail.python.org/pipermail/python-ideas/2010-October/008299.html) (http://mail.python.org/pipermail/python-ideas/2012-September/016128.html) |
| [8] | Control sys.path[0] initialisation (http://bugs.python.org/issue13475) |
| [9] | Enabling code coverage in subprocesses when testing (http://bugs.python.org/issue14803) |
| [10] | Problems with PYTHONIOENCODING in Blender (http://bugs.python.org/issue16129) |
Copyright
This document has been placed in the public domain.
pep-0433 Easier suppression of file descriptor inheritance
| PEP: | 433 |
|---|---|
| Title: | Easier suppression of file descriptor inheritance |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 10-January-2013 |
| Python-Version: | 3.4 |
| Superseded-By: | 446 |
Contents
Abstract
Add a new optional cloexec parameter on functions creating file descriptors, add different ways to change default values of this parameter, and add four new functions:
- os.get_cloexec(fd)
- os.set_cloexec(fd, cloexec=True)
- sys.getdefaultcloexec()
- sys.setdefaultcloexec(cloexec)
Rationale
A file descriptor has a close-on-exec flag which indicates if the file descriptor will be inherited or not.
On UNIX, if the close-on-exec flag is set, the file descriptor is not inherited: it will be closed at the execution of child processes; otherwise the file descriptor is inherited by child processes.
On Windows, if the close-on-exec flag is set, the file descriptor is not inherited; the file descriptor is inherited by child processes if the close-on-exec flag is cleared and if CreateProcess() is called with the bInheritHandles parameter set to TRUE (when subprocess.Popen is created with close_fds=False for example). Windows does not have "close-on-exec" flag but an inheritance flag which is just the opposite value. For example, setting close-on-exec flag means clearing the HANDLE_FLAG_INHERIT flag of an handle.
Status in Python 3.3
On UNIX, the subprocess module closes file descriptors greater than 2 by default since Python 3.2 [1]. All file descriptors created by the parent process are automatically closed in the child process.
xmlrpc.server.SimpleXMLRPCServer sets the close-on-exec flag of the listening socket, the parent class socketserver.TCPServer does not set this flag.
There are other cases creating a subprocess or executing a new program where file descriptors are not closed: functions of the os.spawn*() and the os.exec*() families and third party modules calling exec() or fork() + exec(). In this case, file descriptors are shared between the parent and the child processes which is usually unexpected and causes various issues.
This PEP proposes to continue the work started with the change in the subprocess in Python 3.2, to fix the issue in any code, and not just code using subprocess.
Inherited file descriptors issues
Closing the file descriptor in the parent process does not close the related resource (file, socket, ...) because it is still open in the child process.
The listening socket of TCPServer is not closed on exec(): the child process is able to get connection from new clients; if the parent closes the listening socket and create a new listening socket on the same address, it would get an "address already is used" error.
Not closing file descriptors can lead to resource exhaustion: even if the parent closes all files, creating a new file descriptor may fail with "too many files" because files are still open in the child process.
See also the following issues:
- Issue #2320: Race condition in subprocess using stdin (2008)
- Issue #3006: subprocess.Popen causes socket to remain open after close (2008)
- Issue #7213: subprocess leaks open file descriptors between Popen instances causing hangs (2009)
- Issue #12786: subprocess wait() hangs when stdin is closed (2011)
Security
Leaking file descriptors is a major security vulnerability. An untrusted child process can read sensitive data like passwords and take control of the parent process though leaked file descriptors. It is for example a known vulnerability to escape from a chroot.
See also the CERT recommandation: FIO42-C. Ensure files are properly closed when they are no longer needed.
Example of vulnerabilities:
- OpenSSH Security Advisory: portable-keysign-rand-helper.adv (April 2011)
- CWE-403: Exposure of File Descriptor to Unintended Control Sphere (2008)
- Hijacking Apache https by mod_php (Dec 2003)
- Apache: Apr should set FD_CLOEXEC if APR_FOPEN_NOCLEANUP is not set (fixed in 2009)
- PHP: system() (and similar) don't cleanup opened handles of Apache (not fixed in january 2013)
Atomicity
Using fcntl() to set the close-on-exec flag is not safe in a multithreaded application. If a thread calls fork() and exec() between the creation of the file descriptor and the call to fcntl(fd, F_SETFD, new_flags): the file descriptor will be inherited by the child process. Modern operating systems offer functions to set the flag during the creation of the file descriptor, which avoids the race condition.
Portability
Python 3.2 added socket.SOCK_CLOEXEC flag, Python 3.3 added os.O_CLOEXEC flag and os.pipe2() function. It is already possible to set atomically close-on-exec flag in Python 3.3 when opening a file and creating a pipe or socket.
The problem is that these flags and functions are not portable: only recent versions of operating systems support them. O_CLOEXEC and SOCK_CLOEXEC flags are ignored by old Linux versions and so FD_CLOEXEC flag must be checked using fcntl(fd, F_GETFD). If the kernel ignores O_CLOEXEC or SOCK_CLOEXEC flag, a call to fcntl(fd, F_SETFD, flags) is required to set close-on-exec flag.
Note
OpenBSD older 5.2 does not close the file descriptor with close-on-exec flag set if fork() is used before exec(), but it works correctly if exec() is called without fork(). Try openbsd_bug.py.
Scope
Applications still have to close explicitly file descriptors after a fork(). The close-on-exec flag only closes file descriptors after exec(), and so after fork() + exec().
This PEP only change the close-on-exec flag of file descriptors created by the Python standard library, or by modules using the standard library. Third party modules not using the standard library should be modified to conform to this PEP. The new os.set_cloexec() function can be used for example.
Note
See Close file descriptors after fork for a possible solution for fork() without exec().
Proposal
Add a new optional cloexec parameter on functions creating file descriptors and different ways to change default value of this parameter.
Add new functions:
- os.get_cloexec(fd:int) -> bool: get the close-on-exec flag of a file descriptor. Not available on all platforms.
- os.set_cloexec(fd:int, cloexec:bool=True): set or clear the close-on-exec flag on a file descriptor. Not available on all platforms.
- sys.getdefaultcloexec() -> bool: get the current default value of the cloexec parameter
- sys.setdefaultcloexec(cloexec: bool): set the default value of the cloexec parameter
Add a new optional cloexec parameter to:
- asyncore.dispatcher.create_socket()
- io.FileIO
- io.open()
- open()
- os.dup()
- os.dup2()
- os.fdopen()
- os.open()
- os.openpty()
- os.pipe()
- select.devpoll()
- select.epoll()
- select.kqueue()
- socket.socket()
- socket.socket.accept()
- socket.socket.dup()
- socket.socket.fromfd
- socket.socketpair()
The default value of the cloexec parameter is sys.getdefaultcloexec().
Add a new command line option -e and an environment variable PYTHONCLOEXEC to the set close-on-exec flag by default.
subprocess clears the close-on-exec flag of file descriptors of the pass_fds parameter.
All functions creating file descriptors in the standard library must respect the default value of the cloexec parameter: sys.getdefaultcloexec().
File descriptors 0 (stdin), 1 (stdout) and 2 (stderr) are expected to be inherited, but Python does not handle them differently. When os.dup2() is used to replace standard streams, cloexec=False must be specified explicitly.
Drawbacks of the proposal:
- It is not more possible to know if the close-on-exec flag will be set or not on a newly created file descriptor just by reading the source code.
- If the inheritance of a file descriptor matters, the cloexec parameter must now be specified explicitly, or the library or the application will not work depending on the default value of the cloexec parameter.
Alternatives
Inheritance enabled by default, default no configurable
Add a new optional parameter cloexec on functions creating file descriptors. The default value of the cloexec parameter is False, and this default cannot be changed. File descriptor inheritance enabled by default is also the default on POSIX and on Windows. This alternative is the most convervative option.
This option does not solve issues listed in the Rationale section, it only provides an helper to fix them. All functions creating file descriptors have to be modified to set cloexec=True in each module used by an application to fix all these issues.
Inheritance enabled by default, default can only be set to True
This alternative is based on the proposal: the only difference is that sys.setdefaultcloexec() does not take any argument, it can only be used to set the default value of the cloexec parameter to True.
Disable inheritance by default
This alternative is based on the proposal: the only difference is that the default value of the cloexec parameter is True (instead of False).
If a file must be inherited by child processes, cloexec=False parameter can be used.
Advantages of setting close-on-exec flag by default:
- There are far more programs that are bitten by FD inheritance upon exec (see Inherited file descriptors issues and Security) than programs relying on it (see Applications using inheritance of file descriptors).
Drawbacks of setting close-on-exec flag by default:
- It violates the principle of least surprise. Developers using the os module may expect that Python respects the POSIX standard and so that close-on-exec flag is not set by default.
- The os module is written as a thin wrapper to system calls (to functions of the C standard library). If atomic flags to set close-on-exec flag are not supported (see Appendix: Operating system support), a single Python function call may call 2 or 3 system calls (see Performances section).
- Extra system calls, if any, may slow down Python: see Performances.
Backward compatibility: only a few programs rely on inheritance of file descriptors, and they only pass a few file descriptors, usually just one. These programs will fail immediatly with EBADF error, and it will be simple to fix them: add cloexec=False parameter or use os.set_cloexec(fd, False).
The subprocess module will be changed anyway to clear close-on-exec flag on file descriptors listed in the pass_fds parameter of Popen constructor. So it possible that these programs will not need any fix if they use the subprocess module.
Close file descriptors after fork
This PEP does not fix issues with applications using fork() without exec(). Python needs a generic process to register callbacks which would be called after a fork, see #16500: Add an atfork module [2]. Such registry could be used to close file descriptors just after a fork().
Drawbacks:
- It does not solve the problem on Windows: fork() does not exist on Windows
- This alternative does not solve the problem for programs using exec() without fork().
- A third party module may call directly the C function fork() which will not call "atfork" callbacks.
- All functions creating file descriptors must be changed to register a callback and then unregister their callback when the file is closed. Or a list of all open file descriptors must be maintained.
- The operating system is a better place than Python to close automatically file descriptors. For example, it is not easy to avoid a race condition between closing the file and unregistering the callback closing the file.
open(): add "e" flag to mode
A new "e" mode would set close-on-exec flag (best-effort).
This alternative only solves the problem for open(). socket.socket() and os.pipe() do not have a mode parameter for example.
Since its version 2.7, the GNU libc supports "e" flag for fopen(). It uses O_CLOEXEC if available, or use fcntl(fd, F_SETFD, FD_CLOEXEC). With Visual Studio, fopen() accepts a "N" flag which uses O_NOINHERIT.
Bikeshedding on the name of the new parameter
- inherit, inherited: closer to Windows definition
- sensitive
- sterile: "Does not produce offspring."
Applications using inheritance of file descriptors
Most developers don't know that file descriptors are inherited by default. Most programs do not rely on inheritance of file descriptors. For example, subprocess.Popen was changed in Python 3.2 to close all file descriptors greater than 2 in the child process by default. No user complained about this behavior change yet.
Network servers using fork may want to pass the client socket to the child process. For example, on UNIX a CGI server pass the socket client through file descriptors 0 (stdin) and 1 (stdout) using dup2().
To access a restricted resource like creating a socket listening on a TCP port lower than 1024 or reading a file containing sensitive data like passwords, a common practice is: start as the root user, create a file descriptor, create a child process, drop privileges (ex: change the current user), pass the file descriptor to the child process and exit the parent process.
Security is very important in such use case: leaking another file descriptor would be a critical security vulnerability (see Security). The root process may not exit but monitors the child process instead, and restarts a new child process and pass the same file descriptor if the previous child process crashed.
Example of programs taking file descriptors from the parent process using a command line option:
- gpg: --status-fd <fd>, --logger-fd <fd>, etc.
- openssl: -pass fd:<fd>
- qemu: -add-fd <fd>
- valgrind: --log-fd=<fd>, --input-fd=<fd>, etc.
- xterm: -S <fd>
On Linux, it is possible to use "/dev/fd/<fd>" filename to pass a file descriptor to a program expecting a filename.
Performances
Setting close-on-exec flag may require additional system calls for each creation of new file descriptors. The number of additional system calls depends on the method used to set the flag:
- O_NOINHERIT: no additional system call
- O_CLOEXEC: one additional system call, but only at the creation of the first file descriptor, to check if the flag is supported. If the flag is not supported, Python has to fallback to the next method.
- ioctl(fd, FIOCLEX): one additional system call per file descriptor
- fcntl(fd, F_SETFD, flags): two additional system calls per file descriptor, one to get old flags and one to set new flags
On Linux, setting the close-on-flag has a low overhead on performances. Results of bench_cloexec.py on Linux 3.6:
- close-on-flag not set: 7.8 us
- O_CLOEXEC: 1% slower (7.9 us)
- ioctl(): 3% slower (8.0 us)
- fcntl(): 3% slower (8.0 us)
Implementation
os.get_cloexec(fd)
Get the close-on-exec flag of a file descriptor.
Pseudo-code:
if os.name == 'nt':
def get_cloexec(fd):
handle = _winapi._get_osfhandle(fd);
flags = _winapi.GetHandleInformation(handle)
return not(flags & _winapi.HANDLE_FLAG_INHERIT)
else:
try:
import fcntl
except ImportError:
pass
else:
def get_cloexec(fd):
flags = fcntl.fcntl(fd, fcntl.F_GETFD)
return bool(flags & fcntl.FD_CLOEXEC)
os.set_cloexec(fd, cloexec=True)
Set or clear the close-on-exec flag on a file descriptor. The flag is set after the creation of the file descriptor and so it is not atomic.
Pseudo-code:
if os.name == 'nt':
def set_cloexec(fd, cloexec=True):
handle = _winapi._get_osfhandle(fd);
mask = _winapi.HANDLE_FLAG_INHERIT
if cloexec:
flags = 0
else:
flags = mask
_winapi.SetHandleInformation(handle, mask, flags)
else:
fnctl = None
ioctl = None
try:
import ioctl
except ImportError:
try:
import fcntl
except ImportError:
pass
if ioctl is not None and hasattr('FIOCLEX', ioctl):
def set_cloexec(fd, cloexec=True):
if cloexec:
ioctl.ioctl(fd, ioctl.FIOCLEX)
else:
ioctl.ioctl(fd, ioctl.FIONCLEX)
elif fnctl is not None:
def set_cloexec(fd, cloexec=True):
flags = fcntl.fcntl(fd, fcntl.F_GETFD)
if cloexec:
flags |= FD_CLOEXEC
else:
flags &= ~FD_CLOEXEC
fcntl.fcntl(fd, fcntl.F_SETFD, flags)
ioctl is preferred over fcntl because it requires only one syscall, instead of two syscalls for fcntl.
Note
fcntl(fd, F_SETFD, flags) only supports one flag (FD_CLOEXEC), so it would be possible to avoid fcntl(fd, F_GETFD). But it may drop other flags in the future, and so it is safer to keep the two functions calls.
Note
fopen() function of the GNU libc ignores the error if fcntl(fd, F_SETFD, flags) failed.
open()
- Windows: open() with O_NOINHERIT flag [atomic]
- open() with O_CLOEXEC flag [atomic]
- open() + os.set_cloexec(fd, True) [best-effort]
os.dup()
- Windows: DuplicateHandle() [atomic]
- fcntl(fd, F_DUPFD_CLOEXEC) [atomic]
- dup() + os.set_cloexec(fd, True) [best-effort]
os.dup2()
- fcntl(fd, F_DUP2FD_CLOEXEC, fd2) [atomic]
- dup3() with O_CLOEXEC flag [atomic]
- dup2() + os.set_cloexec(fd, True) [best-effort]
os.pipe()
- Windows: CreatePipe() with SECURITY_ATTRIBUTES.bInheritHandle=TRUE, or _pipe() with O_NOINHERIT flag [atomic]
- pipe2() with O_CLOEXEC flag [atomic]
- pipe() + os.set_cloexec(fd, True) [best-effort]
socket.socket()
- Windows: WSASocket() with WSA_FLAG_NO_HANDLE_INHERIT flag [atomic]
- socket() with SOCK_CLOEXEC flag [atomic]
- socket() + os.set_cloexec(fd, True) [best-effort]
socket.socketpair()
- socketpair() with SOCK_CLOEXEC flag [atomic]
- socketpair() + os.set_cloexec(fd, True) [best-effort]
socket.socket.accept()
- accept4() with SOCK_CLOEXEC flag [atomic]
- accept() + os.set_cloexec(fd, True) [best-effort]
Backward compatibility
There is no backward incompatible change. The default behaviour is unchanged: the close-on-exec flag is not set by default.
Appendix: Operating system support
Windows
Windows has an O_NOINHERIT flag: "Do not inherit in child processes".
For example, it is supported by open() and _pipe().
The flag can be cleared using SetHandleInformation(fd, HANDLE_FLAG_INHERIT, 0).
CreateProcess() has an bInheritHandles parameter: if it is FALSE, the handles are not inherited. If it is TRUE, handles with HANDLE_FLAG_INHERIT flag set are inherited. subprocess.Popen uses close_fds option to define bInheritHandles.
ioctl
Functions:
- ioctl(fd, FIOCLEX, 0): set the close-on-exec flag
- ioctl(fd, FIONCLEX, 0): clear the close-on-exec flag
Availability: Linux, Mac OS X, QNX, NetBSD, OpenBSD, FreeBSD.
fcntl
Functions:
- flags = fcntl(fd, F_GETFD); fcntl(fd, F_SETFD, flags | FD_CLOEXEC): set the close-on-exec flag
- flags = fcntl(fd, F_GETFD); fcntl(fd, F_SETFD, flags & ~FD_CLOEXEC): clear the close-on-exec flag
Availability: AIX, Digital UNIX, FreeBSD, HP-UX, IRIX, Linux, Mac OS X, OpenBSD, Solaris, SunOS, Unicos.
Atomic flags
New flags:
- O_CLOEXEC: available on Linux (2.6.23), FreeBSD (8.3), OpenBSD 5.0, Solaris 11, QNX, BeOS, next NetBSD release (6.1?). This flag is part of POSIX.1-2008.
- SOCK_CLOEXEC flag for socket() and socketpair(), available on Linux 2.6.27, OpenBSD 5.2, NetBSD 6.0.
- WSA_FLAG_NO_HANDLE_INHERIT flag for WSASocket(): supported on Windows 7 with SP1, Windows Server 2008 R2 with SP1, and later
- fcntl(): F_DUPFD_CLOEXEC flag, available on Linux 2.6.24, OpenBSD 5.0, FreeBSD 9.1, NetBSD 6.0, Solaris 11. This flag is part of POSIX.1-2008.
- fcntl(): F_DUP2FD_CLOEXEC flag, available on FreeBSD 9.1 and Solaris 11.
- recvmsg(): MSG_CMSG_CLOEXEC, available on Linux 2.6.23, NetBSD 6.0.
On Linux older than 2.6.23, O_CLOEXEC flag is simply ignored. So we have to check that the flag is supported by calling fcntl(). If it does not work, we have to set the flag using ioctl() or fcntl().
On Linux older than 2.6.27, if the SOCK_CLOEXEC flag is set in the socket type, socket() or socketpair() fail and errno is set to EINVAL.
On Windows XPS3, WSASocket() with with WSAEPROTOTYPE when WSA_FLAG_NO_HANDLE_INHERIT flag is used.
New functions:
- dup3(): available on Linux 2.6.27 (and glibc 2.9)
- pipe2(): available on Linux 2.6.27 (and glibc 2.9)
- accept4(): available on Linux 2.6.28 (and glibc 2.10)
If accept4() is called on Linux older than 2.6.28, accept4() returns -1 (fail) and errno is set to ENOSYS.
Links
Links:
- Secure File Descriptor Handling (Ulrich Drepper, 2008)
- win32_support.py of the Tornado project: emulate fcntl(fd, F_SETFD, FD_CLOEXEC) using SetHandleInformation(fd, HANDLE_FLAG_INHERIT, 1)
- LKML: [PATCH] nextfd(2)
Python issues:
- #10115: Support accept4() for atomic setting of flags at socket creation
- #12105: open() does not able to set flags, such as O_CLOEXEC
- #12107: TCP listening sockets created without FD_CLOEXEC flag
- #16500: Add an atfork module
- #16850: Add "e" mode to open(): close-and-exec (O_CLOEXEC) / O_NOINHERIT
- #16860: Use O_CLOEXEC in the tempfile module
- #17036: Implementation of the PEP 433
- #16946: subprocess: _close_open_fd_range_safe() does not set close-on-exec flag on Linux < 2.6.23 if O_CLOEXEC is defined
- #17070: PEP 433: Use the new cloexec to improve security and avoid bugs
Other languages:
- Perl sets the close-on-exec flag on newly created file decriptor if their number is greater than $SYSTEM_FD_MAX ($^F). See $SYSTEM_FD_MAX documentation. Perl does this since the creation of Perl (it was already present in Perl 1).
- Ruby: Set FD_CLOEXEC for all fds (except 0, 1, 2)
- Ruby: O_CLOEXEC flag missing for Kernel::open: the commit was reverted later
- OCaml: PR#5256: Processes opened using Unix.open_process* inherit all opened file descriptors (including sockets). OCaml has a Unix.set_close_on_exec function.
Footnotes
| [1] | On UNIX since Python 3.2, subprocess.Popen() closes all file descriptors by default: close_fds=True. It closes file descriptors in range 3 inclusive to local_max_fd exclusive, where local_max_fd is fcntl(0, F_MAXFD) on NetBSD, or sysconf(_SC_OPEN_MAX) otherwise. If the error pipe has a descriptor smaller than 3, ValueError is raised. |
pep-0434 IDLE Enhancement Exception for All Branches
| PEP: | 434 |
|---|---|
| Title: | IDLE Enhancement Exception for All Branches |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Todd Rovito <rovitotv at gmail.com>, Terry Reedy <tjreedy at udel.edu> |
| BDFL-Delegate: | Nick Coghlan |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 16-Feb-2013 |
| Post-History: | 16-Feb-2013 03-Mar-2013 21-Mar-2013 30-Mar-2013 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-March/125003.html |
Abstract
Most CPython tracker issues are classified as behavior or enhancement. Most behavior patches are backported to branches for existing versions. Enhancement patches are restricted to the default branch that becomes the next Python version.
This PEP proposes that the restriction on applying enhancements be relaxed for IDLE code, residing in .../Lib/idlelib/. In practice, this would mean that IDLE developers would not have to classify or agree on the classification of a patch but could instead focus on what is best for IDLE users and future IDLE development. It would also mean that IDLE patches would not necessarily have to be split into 'bugfix' changes and enhancement changes.
The PEP would apply to changes in existing features and addition of small features, such as would require a new menu entry, but not necessarily to possible major re-writes such as switching to themed widgets or tabbed windows.
Motivation
This PEP was prompted by controversy on both the tracker and pydev list over adding Cut, Copy, and Paste to right-click context menus (Issue 1207589, opened in 2005 [1]; pydev thread [2]). The features were available as keyboard shortcuts but not on the context menu. It is standard, at least on Windows, that they should be when applicable (a read-only window would only have Copy), so users do not have to shift to the keyboard after selecting text for cutting or copying or a slice point for pasting. The context menu was not documented until 10 days before the new options were added (Issue 10405 [5]).
Normally, behavior is called a bug if it conflicts with documentation judged to be correct. But if there is no doc, what is the standard? If the code is its own documentation, most IDLE issues on the tracker are enhancement issues. If we substitute reasonable user expectation, (which can, of course, be its own subject of disagreement), many more issues are behavior issues.
For context menus, people disagreed on the status of the additions -- bugfix or enhancement. Even people who called it an enhancement disagreed as to whether the patch should be backported. This PEP proposes to make the status disagreement irrelevant by explicitly allowing more liberal backporting than for other stdlib modules.
Python does have many advanced features yet Python is well known for being a easy computer language for beginners [3]. A major Python philosophy is "batteries included" which is best demonstrated in Python's standard library with many modules that are not typically included with other programming languages [4]. IDLE is a important "battery" in the Python toolbox because it allows a beginner to get started quickly without downloading and configuring a third party IDE. IDLE represents a commitment by the Python community to encouage the use of Python as a teaching language both inside and outside of formal educational settings. The recommended teaching experience is to have a learner start with IDLE. This PEP and the work that it will enable will allow the Python community to make that learner's experience with IDLE awesome by making IDLE a simple tool for beginners to get started with Python.
Rationale
People primarily use IDLE by running the graphical user interface (GUI) application, rather than by directly importing the effectively private (undocumented) implementation modules in idlelib. Whether they use the shell, the editor, or both, we believe they will benefit more from consistency across the latest releases of current Python versions than from consistency within the bugfix releases for one Python version. This is especially true when existing behavior is clearly unsatisfactory.
When people use the standard interpreter, the OS-provided frame works the same for all Python versions. If, for instance, Microsoft were to upgrade the Command Prompt GUI, the improvements would be present regardless of which Python were running within it. Similarly, if one edits Python code with editor X, behaviors such as the right-click context menu and the search-replace box do not depend on the version of Python being edited or even the language being edited.
The benefit for IDLE developers is mixed. On the one hand, testing more versions and possibly having to adjust a patch, especially for 2.7, is more work. (There is, of course, the option on not backporting everything. For issue 12510, some changes to calltips for classes were not included in the 2.7 patch because of issues with old-style classes [6].) On the other hand, bike-shedding can be an energy drain. If the obvious fix for a bug looks like an enhancement, writing a separate bugfix-only patch is more work. And making the code diverge between versions makes future multi-version patches more difficult.
These issue are illustrated by the search-and-replace dialog box. It used to raise an exception for certain user entries [7]. The uncaught exception caused IDLE to exit. At least on Windows, the exit was silent (no visible traceback) and looked like a crash if IDLE was started normally, from an icon.
Was this a bug? IDLE Help (on the current Help submenu) just says "Replace... Open a search-and-replace dialog box", and a box was opened. It is not, in general, a bug for a library method to raise an exception. And it is not, in general, a bug for a library method to ignore an exception raised by functions it calls. So if we were to adopt the 'code = doc' philosophy in the absence of detailed docs, one might say 'No'.
However, IDLE exiting when it does not need to is definitely obnoxious. So four of us agreed that it should be prevented. But there was still the question of what to do instead? Catch the exception? Just not raise the exception? Beep? Display an error message box? Or try to do something useful with the user's entry? Would replacing a 'crash' with useful behavior be an enhancement, limited to future Python releases? Should IDLE developers have to ask that?
Backwards Compatibility
For IDLE, there are three types of users who might be concerned about back compatibility. First are people who run IDLE as an application. We have already discussed them above.
Second are people who import one of the idlelib modules. As far as we know, this is only done to start the IDLE application, and we do not propose breaking such use. Otherwise, the modules are undocumented and effectively private implementations. If an IDLE module were defined as public, documented, and perhaps moved to the tkinter package, it would then follow the normal rules. (Documenting the private interfaces for the benefit of people working on the IDLE code is a separate issue.)
Third are people who write IDLE extensions. The guaranteed extension interface is given in idlelib/extension.txt. This should be respected at least in existing versions, and not frivolously changed in future versions. But there is a warning that "The extension cannot assume much about this [EditorWindow] argument." This guarantee should rarely be an issue with patches, and the issue is not specific to 'enhancement' versus 'bugfix' patches.
As is happens, after the context menu patch was applied, it came up that extensions that added items to the context menu (rare) would be broken because the patch a) added a new item to standard rmenu_specs and b) expected every rmenu_spec to be lengthened. It is not clear whether this violates the guarantee, but there is a second patch that fixes assumption b). It should be applied when it is clear that the first patch will not have to be reverted.
References
| [1] | IDLE: Right Click Context Menu, Foord, Michael (http://bugs.python.org/issue1207589) |
| [2] | Cut/Copy/Paste items in IDLE right click context menu (http://mail.python.org/pipermail/python-dev/2012-November/122514.html) |
| [3] | Getting Started with Python (http://www.python.org/about/gettingstarted/) |
| [4] | Batteries Included (http://docs.python.org/2/tutorial/stdlib.html#batteries-included) |
| [5] | IDLE breakpoint facility undocumented, Deily, Ned (http://bugs.python.org/issue10405) |
| [6] | IDLE: calltips mishandle raw strings and other examples, Reedy, Terry (http://bugs.python.org/issue12510) |
| [7] | IDLE: replace ending with '' causes crash, Reedy, Terry (http://bugs.python.org/issue13052) |
Copyright
This document has been placed in the public domain.
pep-0435 Adding an Enum type to the Python standard library
| PEP: | 435 |
|---|---|
| Title: | Adding an Enum type to the Python standard library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org>, Eli Bendersky <eliben at gmail.com>, Ethan Furman <ethan at stoneleaf.us> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2013-02-23 |
| Python-Version: | 3.4 |
| Post-History: | 2013-02-23, 2013-05-02 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-May/126112.html |
Contents
Abstract
This PEP proposes adding an enumeration type to the Python standard library.
An enumeration is a set of symbolic names bound to unique, constant values. Within an enumeration, the values can be compared by identity, and the enumeration itself can be iterated over.
Status of discussions
The idea of adding an enum type to Python is not new - PEP 354 [2] is a previous attempt that was rejected in 2005. Recently a new set of discussions was initiated [3] on the python-ideas mailing list. Many new ideas were proposed in several threads; after a lengthy discussion Guido proposed adding flufl.enum to the standard library [4]. During the PyCon 2013 language summit the issue was discussed further. It became clear that many developers want to see an enum that subclasses int, which can allow us to replace many integer constants in the standard library by enums with friendly string representations, without ceding backwards compatibility. An additional discussion among several interested core developers led to the proposal of having IntEnum as a special case of Enum.
The key dividing issue between Enum and IntEnum is whether comparing to integers is semantically meaningful. For most uses of enumerations, it's a feature to reject comparison to integers; enums that compare to integers lead, through transitivity, to comparisons between enums of unrelated types, which isn't desirable in most cases. For some uses, however, greater interoperatiliby with integers is desired. For instance, this is the case for replacing existing standard library constants (such as socket.AF_INET) with enumerations.
Further discussion in late April 2013 led to the conclusion that enumeration members should belong to the type of their enum: type(Color.red) == Color. Guido has pronounced a decision on this issue [5], as well as related issues of not allowing to subclass enums [6], unless they define no enumeration members [7].
The PEP was accepted by Guido on May 10th, 2013 [1].
Motivation
[Based partly on the Motivation stated in PEP 354]
The properties of an enumeration are useful for defining an immutable, related set of constant values that may or may not have a semantic meaning. Classic examples are days of the week (Sunday through Saturday) and school assessment grades ('A' through 'D', and 'F'). Other examples include error status values and states within a defined process.
It is possible to simply define a sequence of values of some other basic type, such as int or str, to represent discrete arbitrary values. However, an enumeration ensures that such values are distinct from any others including, importantly, values within other enumerations, and that operations without meaning ("Wednesday times two") are not defined for these values. It also provides a convenient printable representation of enum values without requiring tedious repetition while defining them (i.e. no GREEN = 'green').
Module and type name
We propose to add a module named enum to the standard library. The main type exposed by this module is Enum. Hence, to import the Enum type user code will run:
>>> from enum import Enum
Proposed semantics for the new enumeration type
Creating an Enum
Enumerations are created using the class syntax, which makes them easy to read and write. An alternative creation method is described in Functional API. To define an enumeration, subclass Enum as follows:
>>> from enum import Enum >>> class Color(Enum): ... red = 1 ... green = 2 ... blue = 3
A note on nomenclature: we call Color an enumeration (or enum) and Color.red, Color.green are enumeration members (or enum members). Enumeration members also have values (the value of Color.red is 1, etc.)
Enumeration members have human readable string representations:
>>> print(Color.red) Color.red
...while their repr has more information:
>>> print(repr(Color.red)) <Color.red: 1>
The type of an enumeration member is the enumeration it belongs to:
>>> type(Color.red) <Enum 'Color'> >>> isinstance(Color.green, Color) True >>>
Enums also have a property that contains just their item name:
>>> print(Color.red.name) red
Enumerations support iteration, in definition order:
>>> class Shake(Enum): ... vanilla = 7 ... chocolate = 4 ... cookies = 9 ... mint = 3 ... >>> for shake in Shake: ... print(shake) ... Shake.vanilla Shake.chocolate Shake.cookies Shake.mint
Enumeration members are hashable, so they can be used in dictionaries and sets:
>>> apples = {}
>>> apples[Color.red] = 'red delicious'
>>> apples[Color.green] = 'granny smith'
>>> apples
{<Color.red: 1>: 'red delicious', <Color.green: 2>: 'granny smith'}
Programmatic access to enumeration members
Sometimes it's useful to access members in enumerations programmatically (i.e. situations where Color.red won't do because the exact color is not known at program-writing time). Enum allows such access:
>>> Color(1) <Color.red: 1> >>> Color(3) <Color.blue: 3>
If you want to access enum members by name, use item access:
>>> Color['red'] <Color.red: 1> >>> Color['green'] <Color.green: 2>
Duplicating enum members and values
Having two enum members with the same name is invalid:
>>> class Shape(Enum): ... square = 2 ... square = 3 ... Traceback (most recent call last): ... TypeError: Attempted to reuse key: square
However, two enum members are allowed to have the same value. Given two members A and B with the same value (and A defined first), B is an alias to A. By-value lookup of the value of A and B will return A. By-name lookup of B will also return A:
>>> class Shape(Enum): ... square = 2 ... diamond = 1 ... circle = 3 ... alias_for_square = 2 ... >>> Shape.square <Shape.square: 2> >>> Shape.alias_for_square <Shape.square: 2> >>> Shape(2) <Shape.square: 2>
Iterating over the members of an enum does not provide the aliases:
>>> list(Shape) [<Shape.square: 2>, <Shape.diamond: 1>, <Shape.circle: 3>]
The special attribute __members__ is an ordered dictionary mapping names to members. It includes all names defined in the enumeration, including the aliases:
>>> for name, member in Shape.__members__.items():
... name, member
...
('square', <Shape.square: 2>)
('diamond', <Shape.diamond: 1>)
('circle', <Shape.circle: 3>)
('alias_for_square', <Shape.square: 2>)
The __members__ attribute can be used for detailed programmatic access to the enumeration members. For example, finding all the aliases:
>>> [name for name, member in Shape.__members__.items() if member.name != name] ['alias_for_square']
Comparisons
Enumeration members are compared by identity:
>>> Color.red is Color.red True >>> Color.red is Color.blue False >>> Color.red is not Color.blue True
Ordered comparisons between enumeration values are not supported. Enums are not integers (but see IntEnum below):
>>> Color.red < Color.blue Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unorderable types: Color() < Color()
Equality comparisons are defined though:
>>> Color.blue == Color.red False >>> Color.blue != Color.red True >>> Color.blue == Color.blue True
Comparisons against non-enumeration values will always compare not equal (again, IntEnum was explicitly designed to behave differently, see below):
>>> Color.blue == 2 False
Allowed members and attributes of enumerations
The examples above use integers for enumeration values. Using integers is short and handy (and provided by default by the Functional API), but not strictly enforced. In the vast majority of use-cases, one doesn't care what the actual value of an enumeration is. But if the value is important, enumerations can have arbitrary values.
Enumerations are Python classes, and can have methods and special methods as usual. If we have this enumeration:
class Mood(Enum):
funky = 1
happy = 3
def describe(self):
# self is the member here
return self.name, self.value
def __str__(self):
return 'my custom str! {0}'.format(self.value)
@classmethod
def favorite_mood(cls):
# cls here is the enumeration
return cls.happy
Then:
>>> Mood.favorite_mood()
<Mood.happy: 3>
>>> Mood.happy.describe()
('happy', 3)
>>> str(Mood.funky)
'my custom str! 1'
The rules for what is allowed are as follows: all attributes defined within an enumeration will become members of this enumeration, with the exception of __dunder__ names and descriptors [9]; methods are descriptors too.
Restricted subclassing of enumerations
Subclassing an enumeration is allowed only if the enumeration does not define any members. So this is forbidden:
>>> class MoreColor(Color): ... pink = 17 ... TypeError: Cannot extend enumerations
But this is allowed:
>>> class Foo(Enum): ... def some_behavior(self): ... pass ... >>> class Bar(Foo): ... happy = 1 ... sad = 2 ...
The rationale for this decision was given by Guido in [6]. Allowing to subclass enums that define members would lead to a violation of some important invariants of types and instances. On the other hand, it makes sense to allow sharing some common behavior between a group of enumerations, and subclassing empty enumerations is also used to implement IntEnum.
IntEnum
A variation of Enum is proposed which is also a subclass of int. Members of an IntEnum can be compared to integers; by extension, integer enumerations of different types can also be compared to each other:
>>> from enum import IntEnum >>> class Shape(IntEnum): ... circle = 1 ... square = 2 ... >>> class Request(IntEnum): ... post = 1 ... get = 2 ... >>> Shape == 1 False >>> Shape.circle == 1 True >>> Shape.circle == Request.post True
However they still can't be compared to Enum:
>>> class Shape(IntEnum): ... circle = 1 ... square = 2 ... >>> class Color(Enum): ... red = 1 ... green = 2 ... >>> Shape.circle == Color.red False
IntEnum values behave like integers in other ways you'd expect:
>>> int(Shape.circle) 1 >>> ['a', 'b', 'c'][Shape.circle] 'b' >>> [i for i in range(Shape.square)] [0, 1]
For the vast majority of code, Enum is strongly recommended, since IntEnum breaks some semantic promises of an enumeration (by being comparable to integers, and thus by transitivity to other unrelated enumerations). It should be used only in special cases where there's no other choice; for example, when integer constants are replaced with enumerations and backwards compatibility is required with code that still expects integers.
Other derived enumerations
IntEnum will be part of the enum module. However, it would be very simple to implement independently:
class IntEnum(int, Enum):
pass
This demonstrates how similar derived enumerations can be defined, for example a StrEnum that mixes in str instead of int.
Some rules:
- When subclassing Enum, mix-in types must appear before Enum itself in the sequence of bases, as in the IntEnum example above.
- While Enum can have members of any type, once you mix in an additional type, all the members must have values of that type, e.g. int above. This restriction does not apply to mix-ins which only add methods and don't specify another data type such as int or str.
Pickling
Enumerations can be pickled and unpickled:
>>> from enum.tests.fruit import Fruit >>> from pickle import dumps, loads >>> Fruit.tomato is loads(dumps(Fruit.tomato)) True
The usual restrictions for pickling apply: picklable enums must be defined in the top level of a module, since unpickling requires them to be importable from that module.
Functional API
The Enum class is callable, providing the following functional API:
>>> Animal = Enum('Animal', 'ant bee cat dog')
>>> Animal
<Enum 'Animal'>
>>> Animal.ant
<Animal.ant: 1>
>>> Animal.ant.value
1
>>> list(Animal)
[<Animal.ant: 1>, <Animal.bee: 2>, <Animal.cat: 3>, <Animal.dog: 4>]
The semantics of this API resemble namedtuple. The first argument of the call to Enum is the name of the enumeration. Pickling enums created with the functional API will work on CPython and PyPy, but for IronPython and Jython you may need to specify the module name explicitly as follows:
>>> Animals = Enum('Animals', 'ant bee cat dog', module=__name__)
The second argument is the source of enumeration member names. It can be a whitespace-separated string of names, a sequence of names, a sequence of 2-tuples with key/value pairs, or a mapping (e.g. dictionary) of names to values. The last two options enable assigning arbitrary values to enumerations; the others auto-assign increasing integers starting with 1. A new class derived from Enum is returned. In other words, the above assignment to Animal is equivalent to:
>>> class Animals(Enum): ... ant = 1 ... bee = 2 ... cat = 3 ... dog = 4
The reason for defaulting to 1 as the starting number and not 0 is that 0 is False in a boolean sense, but enum members all evaluate to True.
Proposed variations
Some variations were proposed during the discussions in the mailing list. Here's some of the more popular ones.
flufl.enum
flufl.enum was the reference implementation upon which this PEP was originally based. Eventually, it was decided against the inclusion of flufl.enum because its design separated enumeration members from enumerations, so the former are not instances of the latter. Its design also explicitly permits subclassing enumerations for extending them with more members (due to the member/enum separation, the type invariants are not violated in flufl.enum with such a scheme).
Not having to specify values for enums
Michael Foord proposed (and Tim Delaney provided a proof-of-concept implementation) to use metaclass magic that makes this possible:
class Color(Enum):
red, green, blue
The values get actually assigned only when first looked up.
Pros: cleaner syntax that requires less typing for a very common task (just listing enumeration names without caring about the values).
Cons: involves much magic in the implementation, which makes even the definition of such enums baffling when first seen. Besides, explicit is better than implicit.
Using special names or forms to auto-assign enum values
A different approach to avoid specifying enum values is to use a special name or form to auto assign them. For example:
class Color(Enum):
red = None # auto-assigned to 0
green = None # auto-assigned to 1
blue = None # auto-assigned to 2
More flexibly:
class Color(Enum):
red = 7
green = None # auto-assigned to 8
blue = 19
purple = None # auto-assigned to 20
Some variations on this theme:
- A special name auto imported from the enum package.
- Georg Brandl proposed ellipsis (...) instead of None to achieve the same effect.
Pros: no need to manually enter values. Makes it easier to change the enum and extend it, especially for large enumerations.
Cons: actually longer to type in many simple cases. The argument of explicit vs. implicit applies here as well.
Use-cases in the standard library
The Python standard library has many places where the usage of enums would be beneficial to replace other idioms currently used to represent them. Such usages can be divided to two categories: user-code facing constants, and internal constants.
User-code facing constants like os.SEEK_*, socket module constants, decimal rounding modes and HTML error codes could require backwards compatibility since user code may expect integers. IntEnum as described above provides the required semantics; being a subclass of int, it does not affect user code that expects integers, while on the other hand allowing printable representations for enumeration values:
>>> import socket >>> family = socket.AF_INET >>> family == 2 True >>> print(family) SocketFamily.AF_INET
Internal constants are not seen by user code but are employed internally by stdlib modules. These can be implemented with Enum. Some examples uncovered by a very partial skim through the stdlib: binhex, imaplib, http/client, urllib/robotparser, idlelib, concurrent.futures, turtledemo.
In addition, looking at the code of the Twisted library, there are many use cases for replacing internal state constants with enums. The same can be said about a lot of networking code (especially implementation of protocols) and can be seen in test protocols written with the Tulip library as well.
Acknowledgments
This PEP was initially proposing including the flufl.enum package [8] by Barry Warsaw into the stdlib, and is inspired in large parts by it. Ben Finney is the author of the earlier enumeration PEP 354.
References
| [1] | http://mail.python.org/pipermail/python-dev/2013-May/126112.html |
| [2] | http://www.python.org/dev/peps/pep-0354/ |
| [3] | http://mail.python.org/pipermail/python-ideas/2013-January/019003.html |
| [4] | http://mail.python.org/pipermail/python-ideas/2013-February/019373.html |
| [5] | To make enums behave similarly to Python classes like bool, and behave in a more intuitive way. It would be surprising if the type of Color.red would not be Color. (Discussion in http://mail.python.org/pipermail/python-dev/2013-April/125687.html) |
| [6] | (1, 2, 3) Subclassing enums and adding new members creates an unresolvable situation; on one hand MoreColor.red and Color.red should not be the same object, and on the other isinstance checks become confusing if they are not. The discussion also links to Stack Overflow discussions that make additional arguments. (http://mail.python.org/pipermail/python-dev/2013-April/125716.html) |
| [7] | It may be useful to have a class defining some behavior (methods, with no actual enumeration members) mixed into an enum, and this would not create the problem discussed in [6]. (Discussion in http://mail.python.org/pipermail/python-dev/2013-May/125859.html) |
| [8] | http://pythonhosted.org/flufl.enum/ |
| [9] | http://docs.python.org/3/howto/descriptor.html |
Copyright
This document has been placed in the public domain.
pep-0436 The Argument Clinic DSL
| PEP: | 436 |
|---|---|
| Title: | The Argument Clinic DSL |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Larry Hastings <larry at hastings.org> |
| Discussions-To: | Python-Dev <python-dev at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 22-Feb-2013 |
Contents
Abstract
This document proposes "Argument Clinic", a DSL to facilitate argument processing for built-in functions in the implementation of CPython.
Rationale and Goals
The primary implementation of Python, "CPython", is written in a mixture of Python and C. One implementation detail of CPython is what are called "built-in" functions -- functions available to Python programs but written in C. When a Python program calls a built-in function and passes in arguments, those arguments must be translated from Python values into C values. This process is called "parsing arguments".
As of CPython 3.3, builtin functions nearly always parse their arguments with one of two functions: the original PyArg_ParseTuple(), [1] and the more modern PyArg_ParseTupleAndKeywords(). [2] The former only handles positional parameters; the latter also accommodates keyword and keyword-only parameters, and is preferred for new code.
With either function, the caller specifies the translation for parsing arguments in a "format string": [3] each parameter corresponds to a "format unit", a short character sequence telling the parsing function what Python types to accept and how to translate them into the appropriate C value for that parameter.
PyArg_ParseTuple() was reasonable when it was first conceived. There were only a dozen or so of these "format units"; each one was distinct, and easy to understand and remember. But over the years the PyArg_Parse interface has been extended in numerous ways. The modern API is complex, to the point that it is somewhat painful to use. Consider:
- There are now forty different "format units"; a few are even three characters long. This makes it difficult for the programmer to understand what the format string says--or even perhaps to parse it--without constantly cross-indexing it with the documentation.
- There are also six meta-format units that may be buried in the format string. (They are: "()|$:;".)
- The more format units are added, the less likely it is the implementer can pick an easy-to-use mnemonic for the format unit, because the character of choice is probably already in use. In other words, the more format units we have, the more obtuse the format units become.
- Several format units are nearly identical to others, having only subtle differences. This makes understanding the exact semantics of the format string even harder, and can make it difficult to figure out exactly which format unit you want.
- The docstring is specified as a static C string, making it mildly bothersome to read and edit since it must obey C string quoting rules.
- When adding a new parameter to a function using
PyArg_ParseTupleAndKeywords(), it's necessary to touch six
different places in the code: [4]
- Declaring the variable to store the argument.
- Passing in a pointer to that variable in the correct spot in PyArg_ParseTupleAndKeywords(), also passing in any "length" or "converter" arguments in the correct order.
- Adding the name of the argument in the correct spot of the "keywords" array passed in to PyArg_ParseTupleAndKeywords().
- Adding the format unit to the correct spot in the format string.
- Adding the parameter to the prototype in the docstring.
- Documenting the parameter in the docstring.
- There is currently no mechanism for builtin functions to provide their "signature" information (see inspect.getfullargspec and inspect.Signature). Adding this information using a mechanism similar to the existing PyArg_Parse functions would require repeating ourselves yet again.
The goal of Argument Clinic is to replace this API with a mechanism inheriting none of these downsides:
- You need specify each parameter only once.
- All information about a parameter is kept together in one place.
- For each parameter, you specify a conversion function; Argument Clinic handles the translation from Python value into C value for you.
- Argument Clinic also allows for fine-tuning of argument processing behavior with parameterized conversion functions.
- Docstrings are written in plain text. Function docstrings are required; per-parameter docstrings are encouraged.
- From this, Argument Clinic generates for you all the mundane, repetitious code and data structures CPython needs internally. Once you've specified the interface, the next step is simply to write your implementation using native C types. Every detail of argument parsing is handled for you.
Argument Clinic is implemented as a preprocessor. It draws inspiration for its workflow directly from [Cog] by Ned Batchelder. To use Clinic, add a block comment to your C source code beginning and ending with special text strings, then run Clinic on the file. Clinic will find the block comment, process the contents, and write the output back into your C source file directly after the comment. The intent is that Clinic's output becomes part of your source code; it's checked in to revision control, and distributed with source packages. This means that Python will still ship ready-to-build. It does complicate development slightly; in order to add a new function, or modify the arguments or documentation of an existing function using Clinic, you'll need a working Python 3 interpreter.
Future goals of Argument Clinic include:
- providing signature information for builtins,
- enabling alternative implementations of Python to create automated library compatibility tests, and
- speeding up argument parsing with improvements to the generated code.
DSL Syntax Summary
The Argument Clinic DSL is specified as a comment embedded in a C file, as follows. The "Example" column on the right shows you sample input to the Argument Clinic DSL, and the "Section" column on the left specifies what each line represents in turn.
Argument Clinic's DSL syntax mirrors the Python def statement, lending it some familiarity to Python core developers.
+-----------------------+-----------------------------------------------------------------+ | Section | Example | +-----------------------+-----------------------------------------------------------------+ | Clinic DSL start | /*[clinic] | | Module declaration | module module_name | | Class declaration | class module_name.class_name | | Function declaration | module_name.function_name -> return_annotation | | Parameter declaration | name : converter(param=value) | | Parameter docstring | Lorem ipsum dolor sit amet, consectetur | | | adipisicing elit, sed do eiusmod tempor | | Function docstring | Lorem ipsum dolor sit amet, consectetur adipisicing | | | elit, sed do eiusmod tempor incididunt ut labore et | | Clinic DSL end | [clinic]*/ | | Clinic output | ... | | Clinic output end | /*[clinic end output:<checksum>]*/ | +-----------------------+-----------------------------------------------------------------+
To give some flavor of the proposed DSL syntax, here are some sample Clinic code blocks. This first block reflects the normally preferred style, including blank lines between parameters and per-argument docstrings. It also includes a user-defined converter (path_t) created locally:
/*[clinic]
os.stat as os_stat_fn -> stat result
path: path_t(allow_fd=1)
Path to be examined; can be string, bytes, or open-file-descriptor int.
*
dir_fd: OS_STAT_DIR_FD_CONVERTER = DEFAULT_DIR_FD
If not None, it should be a file descriptor open to a directory,
and path should be a relative string; path will then be relative to
that directory.
follow_symlinks: bool = True
If False, and the last element of the path is a symbolic link,
stat will examine the symbolic link itself instead of the file
the link points to.
Perform a stat system call on the given path.
{parameters}
dir_fd and follow_symlinks may not be implemented
on your platform. If they are unavailable, using them will raise a
NotImplementedError.
It's an error to use dir_fd or follow_symlinks when specifying path as
an open file descriptor.
[clinic]*/
This second example shows a minimal Clinic code block, omitting all parameter docstrings and non-significant blank lines:
/*[clinic]
os.access
path: path
mode: int
*
dir_fd: OS_ACCESS_DIR_FD_CONVERTER = 1
effective_ids: bool = False
follow_symlinks: bool = True
Use the real uid/gid to test for access to a path.
Returns True if granted, False otherwise.
{parameters}
dir_fd, effective_ids, and follow_symlinks may not be implemented
on your platform. If they are unavailable, using them will raise a
NotImplementedError.
Note that most operations will use the effective uid/gid, therefore this
routine can be used in a suid/sgid environment to test if the invoking user
has the specified access to the path.
[clinic]*/
This final example shows a Clinic code block handling groups of optional parameters, including parameters on the left:
/*[clinic]
curses.window.addch
[
y: int
Y-coordinate.
x: int
X-coordinate.
]
ch: char
Character to add.
[
attr: long
Attributes for the character.
]
/
Paint character ch at (y, x) with attributes attr,
overwriting any character previously painter at that location.
By default, the character position and attributes are the
current settings for the window object.
[clinic]*/
General Behavior Of the Argument Clinic DSL
All lines support # as a line comment delimiter except docstrings. Blank lines are always ignored.
Like Python itself, leading whitespace is significant in the Argument Clinic DSL. The first line of the "function" section is the function declaration. Indented lines below the function declaration declare parameters, one per line; lines below those that are indented even further are per-parameter docstrings. Finally, the first line dedented back to column 0 end parameter declarations and start the function docstring.
Parameter docstrings are optional; function docstrings are not. Functions that specify no arguments may simply specify the function declaration followed by the docstring.
Module and Class Declarations
When a C file implements a module or class, this should be declared to Clinic. The syntax is simple:
module module_name
or
class module_name.class_name
(Note that these are not actually special syntax; they are implemented as Directives.)
The module name or class name should always be the full dotted path from the top-level module. Nested modules and classes are supported.
Function Declaration
The full form of the function declaration is as follows:
dotted.name [ as legal_c_id ] [ -> return_annotation ]
The dotted name should be the full name of the function, starting with the highest-level package (e.g. "os.stat" or "curses.window.addch").
The "as legal_c_id" syntax is optional. Argument Clinic uses the name of the function to create the names of the generated C functions. In some circumstances, the generated name may collide with other global names in the C program's namespace. The "as legal_c_id" syntax allows you to override the generated name with your own; substitute "legal_c_id" with any legal C identifier. If skipped, the "as" keyword must also be omitted.
The return annotation is also optional. If skipped, the arrow ("->") must also be omitted. If specified, the value for the return annotation must be compatible with ast.literal_eval, and it is interpreted as a return converter.
Parameter Declaration
The full form of the parameter declaration line as as follows:
name: converter [ (parameter=value [, parameter2=value2]) ] [ = default]
The "name" must be a legal C identifier. Whitespace is permitted between the name and the colon (though this is not the preferred style). Whitespace is permitted (and encouraged) between the colon and the converter.
The "converter" is the name of one of the "converter functions" registered with Argument Clinic. Clinic will ship with a number of built-in converters; new converters can also be added dynamically. In choosing a converter, you are automatically constraining what Python types are permitted on the input, and specifying what type the output variable (or variables) will be. Although many of the converters will resemble the names of C types or perhaps Python types, the name of a converter may be any legal Python identifier.
If the converter is followed by parentheses, these parentheses enclose parameter to the conversion function. The syntax mirrors providing arguments a Python function call: the parameter must always be named, as if they were "keyword-only parameters", and the values provided for the parameters will syntactically resemble Python literal values. These parameters are always optional, permitting all conversion functions to be called without any parameters. In this case, you may also omit the parentheses entirely; this is always equivalent to specifying empty parentheses. The values supplied for these parameters must be compatible with ast.literal_eval.
The "default" is a Python literal value. Default values are optional; if not specified you must omit the equals sign too. Parameters which don't have a default are implicitly required. The default value is dynamically assigned, "live" in the generated C code, and although it's specified as a Python value, it's translated into a native C value in the generated C code. Few default values are permitted, owing to this manual translation step.
If this were a Python function declaration, a parameter declaration would be delimited by either a trailing comma or an ending parentheses. However, Argument Clinic uses neither; parameter declarations are delimited by a newline. A trailing comma or right parenthesis is not permitted.
The first parameter declaration establishes the indent for all parameter declarations in a particular Clinic code block. All subsequent parameters must be indented to the same level.
Legacy Converters
For convenience's sake in converting existing code to Argument Clinic, Clinic provides a set of legacy converters that match PyArg_ParseTuple format units. They are specified as a C string containing the format unit. For example, to specify a parameter "foo" as taking a Python "int" and emitting a C int, you could specify:
foo : "i"
(To more closely resemble a C string, these must always use double quotes.)
Although these resemble PyArg_ParseTuple format units, no guarantee is made that the implementation will call a PyArg_Parse function for parsing.
This syntax does not support parameters. Therefore it doesn't support any of the format units that require input parameters ("O!", "O&", "es", "es#", "et", "et#"). Parameters requiring one of these conversions cannot use the legacy syntax. (You may still, however, supply a default value.)
Parameter Docstrings
All lines that appear below and are indented further than a parameter declaration are the docstring for that parameter. All such lines are "dedented" until the first line is flush left.
Special Syntax For Parameter Lines
There are four special symbols that may be used in the parameter section. Each of these must appear on a line by itself, indented to the same level as parameter declarations. The four symbols are:
- *
- Establishes that all subsequent parameters are keyword-only.
- [
- Establishes the start of an optional "group" of parameters. Note that "groups" may nest inside other "groups". See Functions With Positional-Only Parameters below. Note that currently [ is only legal for use in functions where all parameters are marked positional-only, see / below.
- ]
- Ends an optional "group" of parameters.
- /
- Establishes that all the proceeding arguments are positional-only. For now, Argument Clinic does not support functions with both positional-only and non-positional-only arguments. Therefore: if / is specified for a function, it must currently always be after the last parameter. Also, Argument Clinic does not currently support default values for positional-only parameters.
(The semantics of / follow a syntax for positional-only parameters in Python once proposed by Guido. [5] )
Function Docstring
The first line with no leading whitespace after the function declaration is the first line of the function docstring. All subsequent lines of the Clinic block are considered part of the docstring, and their leading whitespace is preserved.
If the string {parameters} appears on a line by itself inside the function docstring, Argument Clinic will insert a list of all parameters that have docstrings, each such parameter followed by its docstring. The name of the parameter is on a line by itself; the docstring starts on a subsequent line, and all lines of the docstring are indented by two spaces. (Parameters with no per-parameter docstring are suppressed.) The entire list is indented by the leading whitespace that appeared before the {parameters} token.
If the string {parameters} doesn't appear in the docstring, Argument Clinic will append one to the end of the docstring, inserting a blank line above it if the docstring does not end with a blank line, and with the parameter list at column 0.
Converters
Argument Clinic contains a pre-initialized registry of converter functions. Example converter functions:
- int
- Accepts a Python object implementing __int__; emits a C int.
- byte
- Accepts a Python int; emits an unsigned char. The integer must be in the range [0, 256).
- str
- Accepts a Python str object; emits a C char *. Automatically encodes the string using the ascii codec.
- PyObject
- Accepts any object; emits a C PyObject * without any conversion.
All converters accept the following parameters:
- doc_default
- The Python value to use in place of the parameter's actual default in Python contexts. In other words: when specified, this value will be used for the parameter's default in the docstring, and in the Signature. (TBD alternative semantics: If the string is a valid Python expression which can be rendered into a Python value using eval(), then the result of eval() on it will be used as the default in the Signature.) Ignored if there is no default.
- required
- Normally any parameter that has a default value is automatically optional. A parameter that has "required" set will be considered required (non-optional) even if it has a default value. The generated documentation will also not show any default value.
Additionally, converters may accept one or more of these optional parameters, on an individual basis:
- annotation
- Explicitly specifies the per-parameter annotation for this parameter. Normally it's the responsibility of the conversion function to generate the annotation (if any).
- bitwise
- For converters that accept unsigned integers. If the Python integer passed in is signed, copy the bits directly even if it is negative.
- encoding
- For converters that accept str. Encoding to use when encoding a Unicode string to a char *.
- immutable
- Only accept immutable values.
- length
- For converters that accept iterable types. Requests that the converter also emit the length of the iterable, passed in to the _impl function in a Py_ssize_t variable; its name will be this parameter's name appended with "_length".
- nullable
- This converter normally does not accept None, but in this case it should. If None is supplied on the Python side, the equivalent C argument will be NULL. (The _impl argument emitted by this converter will presumably be a pointer type.)
- types
A list of strings representing acceptable Python types for this object. There are also four strings which represent Python protocols:
- "buffer"
- "mapping"
- "number"
- "sequence"
- zeroes
- For converters that accept string types. The converted value should be allowed to have embedded zeroes.
Return Converters
A return converter conceptually performs the inverse operation of a converter: it converts a native C value into its equivalent Python value.
Directives
Argument Clinic also permits "directives" in Clinic code blocks. Directives are similar to pragmas in C; they are statements that modify Argument Clinic's behavior.
The format of a directive is as follows:
directive_name [argument [second_argument [ ... ]]]
Directives only take positional arguments.
A Clinic code block must contain either one or more directives, or a function declaration. It may contain both, in which case all directives must come before the function declaration.
Internally directives map directly to Python callables. The directive's arguments are passed directly to the callable as positional arguments of type str().
Example possible directives include the production, suppression, or redirection of Clinic output. Also, the "module" and "class" keywords are implemented as directives in the prototype.
Python Code
Argument Clinic also permits embedding Python code inside C files, which is executed in-place when Argument Clinic processes the file. Embedded code looks like this:
/*[python]
# this is python code!
print("/" + "* Hello world! *" + "/")
[python]*/
/* Hello world! */
/*[python end:da39a3ee5e6b4b0d3255bfef95601890afd80709]*/
The "/* Hello world! */" line above was generated by running the Python code in the preceding comment.
Any Python code is valid. Python code sections in Argument Clinic can also be used to directly interact with Clinic; see Argument Clinic Programmatic Interfaces.
Output
Argument Clinic writes its output inline in the C file, immediately after the section of Clinic code. For "python" sections, the output is everything printed using builtins.print. For "clinic" sections, the output is valid C code, including:
- a #define providing the correct methoddef structure for the function
- a prototype for the "impl" function -- this is what you'll write to implement this function
- a function that handles all argument processing, which calls your "impl" function
- the definition line of the "impl" function
- and a comment indicating the end of output.
The intention is that you write the body of your impl function immediately after the output -- as in, you write a left-curly-brace immediately after the end-of-output comment and implement builtin in the body there. (It's a bit strange at first, but oddly convenient.)
Argument Clinic will define the parameters of the impl function for you. The function will take the "self" parameter passed in originally, all the parameters you define, and possibly some extra generated parameters ("length" parameters; also "group" parameters, see next section).
Argument Clinic also writes a checksum for the output section. This is a valuable safety feature: if you modify the output by hand, Clinic will notice that the checksum doesn't match, and will refuse to overwrite the file. (You can force Clinic to overwrite with the "-f" command-line argument; Clinic will also ignore the checksums when using the "-o" command-line argument.)
Finally, Argument Clinic can also emit the boilerplate definition of the PyMethodDef array for the defined classes and modules.
Functions With Positional-Only Parameters
A significant fraction of Python builtins implemented in C use the older positional-only API for processing arguments (PyArg_ParseTuple()). In some instances, these builtins parse their arguments differently based on how many arguments were passed in. This can provide some bewildering flexibility: there may be groups of optional parameters, which must either all be specified or none specified. And occasionally these groups are on the left! (A representative example: curses.window.addch().)
Argument Clinic supports these legacy use-cases by allowing you to specify parameters in groups. Each optional group of parameters is marked with square brackets. Note that these groups are permitted on the right or left of any required parameters!
The impl function generated by Clinic will add an extra parameter for every group, "int group_{left|right}_<x>", where x is a monotonically increasing number assigned to each group as it builds away from the required arguments. This argument will be nonzero if the group was specified on this call, and zero if it was not.
Note that when operating in this mode, you cannot specify default arguments.
Also, note that it's possible to specify a set of groups to a function such that there are several valid mappings from the number of arguments to a valid set of groups. If this happens, Clinic will abort with an error message. This should not be a problem, as positional-only operation is only intended for legacy use cases, and all the legacy functions using this quirky behavior have unambiguous mappings.
Current Status
As of this writing, there is a working prototype implementation of Argument Clinic available online (though the syntax may be out of date as you read this). [6] The prototype generates code using the existing PyArg_Parse APIs. It supports translating to all current format units except the mysterious "w*". Sample functions using Argument Clinic exercise all major features, including positional-only argument parsing.
Argument Clinic Programmatic Interfaces
The prototype also currently provides an experimental extension mechanism, allowing adding support for new types on-the-fly. See Modules/posixmodule.c in the prototype for an example of its use.
In the future, Argument Clinic is expected to be automatable enough to allow querying, modification, or outright new construction of function declarations through Python code. It may even permit dynamically adding your own custom DSL!
Notes / TBD
The API for supplying inspect.Signature metadata for builtins is currently under discussion. Argument Clinic will add support for the prototype when it becomes viable.
Nick Coghlan suggests that we a) only support at most one left-optional group per function, and b) in the face of ambiguity, prefer the left group over the right group. This would solve all our existing use cases including range().
Optimally we'd want Argument Clinic run automatically as part of the normal Python build process. But this presents a bootstrapping problem; if you don't have a system Python 3, you need a Python 3 executable to build Python 3. I'm sure this is a solvable problem, but I don't know what the best solution might be. (Supporting this will also require a parallel solution for Windows.)
On a related note: inspect.Signature has no way of representing blocks of arguments, like the left-optional block of y and x for curses.window.addch. How far are we going to go in supporting this admittedly aberrant parameter paradigm?
During the PyCon US 2013 Language Summit, there was discussion of having Argument Clinic also generate the actual documentation (in ReST, processed by Sphinx) for the function. The logistics of this are TBD, but it would require that the docstrings be written in ReST, and require that Python ship a ReST -> ascii converter. It would be best to come to a decision about this before we begin any large-scale conversion of the CPython source tree to using Clinic.
Guido proposed having the "function docstring" be hand-written inline, in the middle of the output, something like this:
/*[clinic] ... prototype and parameters (including parameter docstrings) go here [clinic]*/ ... some output ... /*[clinic docstring start]*/ ... hand-edited function docstring goes here <-- you edit this by hand! /*[clinic docstring end]*/ ... more output /*[clinic output end]*/
I tried it this way and don't like it -- I think it's clumsy. I prefer that everything you write goes in one place, rather than having an island of hand-edited stuff in the middle of the DSL output.
Argument Clinic does not support automatic tuple unpacking (the "(OOO)" style format string for PyArg_ParseTuple().)
Argument Clinic removes some dynamism / flexibility. With PyArg_ParseTuple() one could theoretically pass in different encodings at runtime for the "es"/"et" format units. AFAICT CPython doesn't do this itself, however it's possible external users might do this. (Trivia: there are no uses of "es" exercised by regrtest, and all the uses of "et" exercised are in socketmodule.c, except for one in _ssl.c. They're all static, specifying the encoding "idna".)
Acknowledgements
The PEP author wishes to thank Ned Batchelder for permission to shamelessly rip off his clever design for Cog--"my favorite tool that I've never gotten to use". Thanks also to everyone who provided feedback on the [bugtracker issue] and on python-dev. Special thanks to Nick Coglan and Guido van Rossum for a rousing two-hour in-person deep dive on the topic at PyCon US 2013.
References
| [Cog] | Cog: http://nedbatchelder.com/code/cog/ |
| [1] | PyArg_ParseTuple(): http://docs.python.org/3/c-api/arg.html#PyArg_ParseTuple |
| [2] | PyArg_ParseTupleAndKeywords(): http://docs.python.org/3/c-api/arg.html#PyArg_ParseTupleAndKeywords |
| [3] | PyArg_ format units: http://docs.python.org/3/c-api/arg.html#strings-and-buffers |
| [4] | Keyword parameters for extension functions: http://docs.python.org/3/extending/extending.html#keyword-parameters-for-extension-functions |
| [5] | Guido van Rossum, posting to python-ideas, March 2012: http://mail.python.org/pipermail/python-ideas/2012-March/014364.html and http://mail.python.org/pipermail/python-ideas/2012-March/014378.html and http://mail.python.org/pipermail/python-ideas/2012-March/014417.html |
| [6] | Argument Clinic prototype: https://bitbucket.org/larry/python-clinic/ |
Copyright
This document has been placed in the public domain.
pep-0437 A DSL for specifying signatures, annotations and argument converters
| PEP: | 0437 |
|---|---|
| Title: | A DSL for specifying signatures, annotations and argument converters |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Stefan Krah <skrah at bytereef.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2013-03-11 |
| Python-Version: | 3.4 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-May/126117.html |
Contents
Abstract
The Python C-API currently has no mechanism for specifying and auto-generating function signatures, annotations or custom argument converters.
There are several possible approaches to the problem. Cython uses cdef definitions in .pyx files to generate the required information. However, CPython's C-API functions often require additional initialization and cleanup snippets that would be hard to specify in a cdef.
PEP 436 proposes a domain specific language (DSL) enclosed in C comments that largely resembles a per-parameter configuration file. A preprocessor reads the comment and emits an argument parsing function, docstrings and a header for the function that utilizes the results of the parsing step.
The latter function is subsequently referred to as the implementation function.
Rejection Notice
This PEP was rejected by Guido van Rossum at PyCon US 2013. However, several of the specific issues raised by this PEP were taken into account when designing the second iteration of the PEP 436 DSL [3].
Rationale
Opinions differ regarding the suitability of the PEP 436 DSL in the context of a C file. This PEP proposes an alternative DSL. The specific issues with PEP 436 that spurred the counter proposal will be explained in the final section of this PEP.
Scope
The PEP focuses exclusively on the DSL. Topics like the output locations of docstrings or the generated code are outside the scope of this PEP.
It is however vital that the DSL is suitable for generating custom argument parsers, a feature that is already implemented in Cython. Therefore, one of the goals of this PEP is to keep the DSL close to existing solutions, thus facilitating a possible inclusion of the relevant parts of Cython into the CPython source tree.
DSL overview
Type safety and annotations
A conversion from a Python to a C value is fully defined by the type of the converter function. The PyArg_Parse* family of functions accepts custom converters in addition to the well-known default converters "i", "f", etc.
This PEP views the default converters as abstract functions, regardless of how they are actually implemented.
Include/converters.h
Converter functions must be forward-declared. All converter functions shall be entered into the file Include/converters.h. The file is read by the preprocessor prior to translating .c files. This is an excerpt:
/*[converter] ##### Default converters ##### "s": str -> const char *res; "s*": [str, bytes, bytearray, rw_buffer] -> Py_buffer &res; [...] "es#": str -> (const char *res_encoding, char **res, Py_ssize_t *res_length); [...] ##### Custom converters ##### path_converter: [str, bytes, int] -> path_t &res; OS_STAT_DIR_FD_CONVERTER: [int, None] -> int res; [converter_end]*/
Converters are specified by their name, Python input type(s) and C output type(s). Default converters must have quoted names, custom converters must have regular names. A Python type is given by its name. If a function accepts multiple Python types, the set is written in list form.
Since the default converters may have multiple implicit return values, the C output type(s) are written according to the following convention:
The main return value must be named res. This is a placeholder for the actual variable name given later in the DSL. Additional implicit return values must be prefixed by res_.
By default the variables are passed by value to the implementation function. If the address should be passed instead, res must be prefixed with an ampersand.
Additional declarations may be placed into .c files. Duplicate declarations are allowed as long as the function types are identical.
It is encouraged to declare custom converter types a second time right above the converter function definition. The preprocessor will then catch any mismatch between the declarations.
In order to keep the converter complexity manageable, PY_SSIZE_T_CLEAN will be deprecated and Py_ssize_t will be assumed for all length arguments.
TBD: Make a list of fantasy types like rw_buffer.
Function specifications
Keyword arguments
This example contains the definition of os.stat. The individual sections will be explained in detail. Grammatically, the whole define block consists of a function specification and an output section. The function specification in turn consists of a declaration section, an optional C-declaration section and an optional cleanup code section. Sections within the function specification are separated in yacc style by '%%':
/*[define posix_stat]
def os.stat(path: path_converter, *, dir_fd: OS_STAT_DIR_FD_CONVERTER = None,
follow_symlinks: "p" = True) -> os.stat_result: pass
%%
path_t path = PATH_T_INITIALIZE("stat", 0, 1);
int dir_fd = DEFAULT_DIR_FD;
int follow_symlinks = 1;
%%
path_cleanup(&path);
[define_end]*/
<literal C output>
/*[define_output_end]*/
Define block
The function specification block starts with a /*[define token, followed by an optional C function name, followed by a right bracket. If the C function name is not given, it is generated from the declaration name. In the example, omitting the name posix_stat would result in a C function name of os_stat.
Declaration
The required declaration is (almost) a valid Python function definition. The 'def' keyword and the function body are redundant, but the author of this PEP finds the definition more readable if they are present.
The function name may be a path instead of a plain identifier. Each argument is annotated with the name of the converter function that will be applied to it.
Default values are given in the usual Python manner and may be any valid Python expression.
The return value may be any Python expression. Usually it will be the name of an object, but alternative return values could be specified in list form.
C-declarations
This optional section contains C variable declarations. Since the converter functions have been declared beforehand, the preprocessor can type-check the declarations.
Cleanup
The optional cleanup section contains literal C code that will be inserted unmodified after the implementation function.
Output
The output section contains the code emitted by the preprocessor.
Positional-only arguments
Functions that do not take keyword arguments are indicated by the presence of the slash special parameter:
/*[define stat_float_times] def os.stat_float_times(/, newval: "i") -> os.stat_result: pass %% int newval = -1; [define_end]*/
The preprocessor translates this definition to a PyArg_ParseTuple() call. All arguments to the right of the slash are optional arguments.
Left and right optional arguments
Some legacy functions contain optional arguments groups both to the left and right of a central parameter. It is debatable whether a new tool should support such functions. For completeness' sake, this is the proposed syntax:
/*[define] def curses.window.addch(y: "i", x: "i", ch: "O", attr: "l") -> None: pass where groups = [[ch], [ch, attr], [y, x, ch], [y, x, ch, attr]] [define_end]*/
Here ch is the central parameter, attr can optionally be added on the right, and the group [y, x] can optionally be added on the left.
Essentially the rule is that all ordered combinations of the central parameter and the optional groups must be possible such that no two combinations have the same length.
This is concisely expressed by putting the central parameter first in the list and subsequently adding the optional arguments groups to the left and right.
Flexibility in formatting
If the above os.stat example is considered too compact, it can easily be formatted this way:
/*[define posix_stat]
def os.stat(path: path_converter,
*,
dir_fd: OS_STAT_DIR_FD_CONVERTER = None,
follow_symlinks: "p" = True)
-> os.stat_result: pass
%%
path_t path = PATH_T_INITIALIZE("stat", 0, 1);
int dir_fd = DEFAULT_DIR_FD;
int follow_symlinks = 1;
%%
path_cleanup(&path);
[define_end]*/
<literal C output>
/*[define_output_end]*/
Benefits of a compact notation
The advantages of a concise notation are especially obvious when a large number of parameters is involved. The argument parsing part of _posixsubprocess.fork_exec is fully specified by this definition:
/*[define subprocess_fork_exec]
def _posixsubprocess.fork_exec(
process_args: "O", executable_list: "O",
close_fds: "p", py_fds_to_keep: "O",
cwd_obj: "O", env_list: "O",
p2cread: "i", p2cwrite: "i", c2pread: "i", c2pwrite: "i",
errread: "i", errwrite: "i", errpipe_read: "i", errpipe_write: "i",
restore_signals: "i", call_setsid: "i", preexec_fn: "i", /) -> int: pass
[define_end]*/
Note that the preprocess tool currently emits a redundant C-declaration section for this example, so the output is longer than necessary.
Easy validation of the definition
How can an inexperienced user validate a definition like os.stat? Simply by changing os.stat to os_stat, defining missing converters and pasting the definition into the Python interactive interpreter!
In fact, a converters.py module could be auto-generated from converters.h.
Reference implementation
A reference implementation is available at issue 16612 [1]. Since this PEP was written under time constraints and the author is unfamiliar with the PLY toolchain, the software is written in Standard ML and utilizes the ml-yacc/ml-lex toolchain.
The grammar is conflict-free and available in ml-yacc readable BNF form.
Two tools are available:
- printsemant reads a converter header and a .c file and dumps the semantically checked parse tree to stdout.
- preprocess reads a converter header and a .c file and dumps the preprocessed .c file to stdout.
Known deficiencies:
- The Python 'test' expression is not semantically checked. The syntax however is checked since it is part of the grammar.
- The lexer does not handle triple quoted strings.
- C declarations are parsed in a primitive way. The final implementation should utilize 'declarator' and 'init-declarator' from the C grammar.
- The preprocess tool does not emit code for the left-and-right optional arguments case. The printsemant tool can deal with this case.
- Since the preprocess tool generates the output from the parse tree, the original indentation of the define block is lost.
Grammar
TBD: The grammar exists in ml-yacc readable form, but should probably be included here in EBNF notation.
Comparison with PEP 436
The author of this PEP has the following concerns about the DSL proposed in PEP 436:
The whitespace sensitive configuration file like syntax looks out of place in a C file.
The structure of the function definition gets lost in the per-parameter specifications. Keywords like positional-only, required and keyword-only are scattered across too many different places.
By contrast, in the alternative DSL the structure of the function definition can be understood at a single glance.
The PEP 436 DSL has 14 documented flags and at least one undocumented (allow_fd) flag. Figuring out which of the 2**15 possible combinations are valid places an unnecessary burden on the user.
Experience with the PEP-3118 buffer flags has shown that sorting out (and exhaustively testing!) valid combinations is an extremely tedious task. The PEP-3118 flags are still not well understood by many people.
By contrast, the alternative DSL has a central file Include/converters.h that can be quickly searched for the desired converter. Many of the converters are already known, perhaps even memorized by people (due to frequent use).
The PEP 436 DSL allows too much freedom. Types can apparently be omitted, the preprocessor accepts (and ignores) unknown keywords, sometimes adding white space after a docstring results in an assertion error.
The alternative DSL on the other hand allows no such freedoms. Omitting converter or return value annotations is plainly a syntax error. The LALR(1) grammar is unambiguous and specified for the complete translation unit.
Copyright
This document is licensed under the Open Publication License [2].
References and Footnotes
| [1] | http://bugs.python.org/issue16612 |
| [2] | http://www.opencontent.org/openpub/ |
| [3] | http://hg.python.org/peps/rev/a2fa10b2424b |
pep-0438 Transitioning to release-file hosting on PyPI
| PEP: | 438 |
|---|---|
| Title: | Transitioning to release-file hosting on PyPI |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net> |
| BDFL-Delegate: | Richard Jones <richard@python.org> |
| Discussions-To: | distutils-sig at python.org |
| Status: | Accepted |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 15-Mar-2013 |
| Post-History: | 19-May-2013 |
| Resolution: | http://mail.python.org/pipermail/distutils-sig/2013-May/020773.html |
Contents
Abstract
This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, no changes to distutils or existing installation tools are required in order to benefit from the first transition phase, which will result in faster, more reliable installs for most existing packages.
The first transition phase implements easy and explicit means for a package maintainer to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controlling release file links. The first phase also will default newly-registered projects on PyPI to only serve links to release files which were uploaded to PyPI.
The second transition phase concerns end-user installation tools, which shall default to only install release files that are hosted on PyPI and tell the user if external release files exist, offering a choice to automatically use those external files. External release files shall in the future be registered together with a checksum hash so that installation tools can verify the integrity of the eventual download (PyPI-hosted release files always carry such a checksum).
Alternative PyPI server implementations should implement the new simple index serving behaviour of transition phase 1 to avoid installation tools treating their release links as external ones in phase 2.
Rationale
History and motivations for external hosting
When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Phillip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows:
- The PyPI simple/ index for a package contains all links found by scraping them from that package's long_description metadata for any release. Links in the "Download-URL" and "Home-page" metadata fields are given rel=download and rel=homepage attributes, respectively.
- Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with name in the form "packagename-version.ARCHIVEEXT", is considered a potential installation candidate by installation tools.
- Similarly, any links suffixed with an "#egg=packagename-version" fragment are considered an installation candidate.
- Additionally, the rel=homepage and rel=download links are crawled by installation tools and, if HTML, are themselves scraped for release-file links in the above formats.
See the easy_install documentation for a complete description of this behavior. [1]
Today, most packages indexed on PyPI host their release files on PyPI. Out of 29,117 total projects on PyPI, only 2,581 (less than 10%) include any links to installable files that are available only off-PyPI. [2]
There are many reasons [3] why people have chosen external hosting. To cite just a few:
- release processes and scripts have been developed already and upload to external sites
- it takes too long to upload large files from some places in the world
- export restrictions e.g. for crypto-related software
- company policies which require offering open source packages through own sites
- problems with integrating uploading to PyPI into one's release process (because of release policies)
- desiring download statistics different from those maintained by PyPI
- perceived bad reliability of PyPI
- not aware that PyPI offers file-hosting
Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. This PEP takes the position that there remain some valid reasons for external hosting even today.
Problem
Today, python package installers (pip, easy_install, buildout, and others) often need to query many non-PyPI URLs even if there are no externally hosted files. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package are crawled by an installer. The need for installers to crawl external sites slows down installation and makes for a brittle and unreliable installation process. Those sites and packages also don't take part in the PEP 381 mirroring infrastructure, further decreasing reliability and speed of automated installation processes around the world.
Most packages are hosted directly on pypi.python.org [2]. Even for these packages, installers still crawl their homepage and download-url, if specified. Many package uploaders are not aware that specifying the "homepage" or "download-url" in their package metadata will needlessly slow down the installation process for all users.
Relying on third party sites also opens up more attack vectors for injecting malicious packages into sites using automated installs. A simple attack might just involve getting hold of an old now-unused homepage domain and placing malicious packages there. Moreover, performing a Man-in-The-Middle (MITM) attack between an installation site and any of the download sites can inject malicious packages on the installation site. As many homepages and download locations are using HTTP and not HTTPS, such attacks are not hard to launch. Such MITM attacks can easily happen even for packages which never intended to host files externally as their homepages are contacted by installers anyway.
There is currently no way for package maintainers to avoid external-link crawling, other than removing all homepage/download url metadata for all historic releases. While a script [4] has been written to perform this action, it is not a good general solution because it removes useful metadata from PyPI releases.
Even if the sites referenced by "Homepage" and "Download-URL" links were not scraped for further links, there is no obvious way under the current system for a package owner to link to an installable file from a long_description metadata field (which is shown as package documentation on /pypi/PKG) without installation tools automatically considering that file a candidate for installation. Conversely, there is no way to explicitly register multiple external release files without putting them in metadata fields.
Goals
These are the goals to be achieved by implementation of this PEP:
- Package owners should be able to explicitly control which files are presented by PyPI to installer tools as installation candidates. Installation should not be slowed and made less reliable by extensive and unnecessary crawling of links that package owners did not explicitly nominate as installation files.
- It should remain possible for package owners to choose to host their release files on their own hosting, external to PyPI. It should be easy for a user to request the installation of such releases using automated installer tools, especially if the external release files were registered together with a checksum hash.
- Automated installer tools should not install externally-hosted packages by default, but require explicit authorization to do so by the user. When tools refuse to install such a package by default, they should tell the user exactly which external link(s) the installer needs to follow, and what option(s) the user can provide to authorize the tool to follow those links. PyPI should provide all necessary metadata for installer tools to implement this easily and within a single request/reply interaction.
- Migration from the status quo to the above points should be gradual and minimize breakage. This includes tooling that makes it easy for package owners with an existing release process that uploads to non-PyPI hosting to also upload those release files to PyPI.
Solution / two transition phases
The first transition phase introduces a "hosting-mode" field for each project on PyPI, allowing package owners explicit control of which release file links are served to present-day installation tools in the machine-readable simple/ index. The first transition will, after successful hosting-mode manipulations by individual early-adopters, set a default hosting mode for existing packages, based on automated analysis. Maintainers will be notified one month ahead of any such automated change. At completion of the first transition phase, all present-day existing release and installation processes and tools are expected to continue working. Any remaining errors or problems are expected to only relate to installation of individual packages and can be easily corrected by package maintainers or PyPI admins if maintainers are not reachable.
Also in the first phase, each link served in the simple/ index will be explicitly marked as rel="internal" if it is hosted by the index itself (even if on a separate domain, which may be the case if the index uses a CDN for file-serving). Any link not so marked will be considered an external link.
In the second transition phase, PyPI client installation tools shall be updated to default to only install rel="internal" packages unless a user specifies option(s) to permit installing from external links. See second transition phase for details on how installers should behave.
Maintainers of packages which currently host release files on non-PyPI sites shall receive instructions and tools to ease "re-hosting" of their historic and future package release files. This re-hosting tool MUST be available before automated hosting-mode changes are announced to package maintainers.
Implementation
Hosting modes
The foundation of the first transition phase is the introduction of three "modes" of PyPI hosting for a package, affecting which links are generated for the simple/ index. These modes are implemented without requiring changes to installation tools via changes to the algorithm for generating the machine-readable simple/ index.
The modes are:
- pypi-scrape-crawl: no change from the current situation of generating machine-readable links for installation tools, as outlined in the history.
- pypi-scrape: for a package in this mode, links to be added to the simple/ index are still scraped from package metadata. However, the "Home-page" and "Download-url" links are given rel=ext-homepage and rel=ext-download attributes instead of rel=homepage and rel=download. The effect of this (with no change in installation tools necessary) is that these links will not be followed and scraped for further candidate links by present-day installation tools: only installable files directly hosted from PyPI or linked directly from PyPI metadata will be considered for installation. Installation tools MAY evolve to offer an option to use the new rel-attribution to crawl external pages but MUST NOT default to it.
- pypi-explicit: for a package in this mode, only links to release files uploaded to PyPI, and external links to release files explicitly nominated by the package owner, will be added to the simple/ index. PyPI will provide a new interface for package owners to supply external release-file URLs. These URLs MUST include a URL fragment in the form "#hashtype=hashvalue" specifying a hash of the externally-linked file which installer tools MUST use to validate that they have downloaded the intended file.
Thus the hope is that eventually all projects on PyPI can be migrated to the pypi-explicit mode, while preserving the ability to install release files hosted externally via installer tools. Deprecation of hosting modes to eventually only allow the pypi-explicit mode is NOT REGULATED by this PEP but is expected to become feasible some time after successful implementation of the transition phases described in this PEP. It is expected that deprecation requires a new process to deal with abandoned packages because of unreachable maintainers for still popular packages.
First transition phase (PyPI)
The proposed solution consists of multiple implementation and communication steps:
- Implement in PyPI the three modes described above, with an interface for package owners to select the mode for each package and register explicit external file URLs.
- For packages in all modes, label links in the simple/ index to index-hosted files with rel="internal", to make it easier for client tools to distinguish these links in the second phase.
- Add an HTML tag <meta name="api-version" value="2"> to all simple/ index pages, to allow clients to distinguish between indexes providing the rel="internal" metadata and older ones that do not.
- Default all newly-registered packages to pypi-explicit mode (package owners can still switch to the other modes as desired).
- Determine (via automated analysis [2]) which packages have all installable files available on PyPI itself (group A), which have all installable files on PyPI or linked directly from PyPI metadata (group B), and which have installable versions available that are linked only from external homepage/download HTML pages (group C).
- Send mail to maintainers of projects in group A that their project will be automatically configured to pypi-explicit mode in one month, and similarly to maintainers of projects in group B that their project will be automatically configured to pypi-scrape mode. Inform them that this change is not expected to affect installability of their project at all, but will result in faster and safer installs for their users. Encourage them to set this mode themselves sooner to benefit their users.
- Send mail to maintainers of packages in group C that their package hosting mode is pypi-scrape-crawl, list the URLs which currently are crawled, and suggest that they either re-host their packages directly on PyPI and switch to pypi-explicit, or at least provide direct links to release files in PyPI metadata and switch to pypi-scrape. Provide instructions and tools to help with these transitions.
Second transition phase (installer tools)
For the second transition phase, maintainers of installation tools are asked to release two updates.
The first update shall provide clear warnings if externally-hosted release files (that is, files whose link does not include rel="internal") are selected for download, for which projects and URLs exactly this happens, and warn that in future versions externally-hosted downloads will be disabled by default.
The second update should change the default mode to allow only installation of rel="internal" package files, and allow installation of externally-hosted packages only when the user supplies an option.
The installer should distinguish between verifiable and non-verifiable external links. A verifiable external link is a direct link to an installable file from the PyPI simple/ index that includes a hash in the URL fragment ("#hashtype=hashvalue") which can be used to verify the integrity of the downloaded file. A non-verifiable external link is any link (other than those explicitly supplied by the user of an installer tool) without a hash, scraped from external HTML, or injected into the search via some other non-PyPI source (e.g. setuptools' dependency_links feature).
Installers should provide a blanket option to allow installing any verifiable external link. Non-verifiable external links should only be installed if the user-provided option specifies exactly which external domains can be used or for which specific package names external links can be used.
When download of an externally-hosted package is disallowed by the default configuration, the user should be notified, with instructions for how to make the install succeed and warnings about the implication (that a file will be downloaded from a site that is not part of the package index). The warning given for non-verifiable links should clearly state that the installer cannot verify the integrity of the downloaded file. The warning given for verifiable external links should simply note that the file will be downloaded from an external URL, but that the file integrity can be verified by checksum.
Alternative PyPI-compatible index implementations should upgrade to begin providing the rel="internal" metadata and the <meta name="api-version" value="2"> tag as soon as possible. For alternative indexes which do not yet provide the meta tag in their simple/ pages, installation tools should provide backwards-compatible fallback behavior (treat links as internal as in pre-PEP times and provide a warning).
API For Submitting External Distribution URLs
New distribution URLs may be submitted by performing a HTTP POST to the URL:
https://pypi.python.org/pypi
With the following form-encoded data:
| Name | Value |
| :action | The string "urls" |
| name | The package name as a string |
| version | The release version as a string |
| new-url | The new URL to store |
| submit_new_url | The string "yes" |
The POST must be accompanied by an HTTP Basic Auth header encoding the username and password of the user authorized to maintain the package on PyPI.
The HTTP response to this request will be one of:
| Code | Meaning | URL submission implications |
| 200 | OK | Everything worked just fine |
| 400 | Bad request | Data provided for submission was malformed |
| 401 | Unauthorised | The username or password supplied were incorrect |
| 403 | Forbidden | User does not have permission to update the package information (not Owner or Maintainer) |
References
| [1] | Phillip Eby, easy_install 'Package Index "API"' documentation, http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api |
| [2] | (1, 2, 3) Donald Stufft, automated analysis of PyPI project links, https://github.com/dstufft/pypi.linkcheck |
| [3] | Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html |
| [4] | Holger Krekel, script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html |
Acknowledgments
Phillip Eby for precise information and the basic ideas to implement the transition via server-side changes only.
Donald Stufft for pushing away from external hosting and offering to implement both a Pull Request for the necessary PyPI changes and the analysis tool to drive the transition phase 1.
Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for thinking through issues regarding getting rid of "external hosting".
Copyright
This document has been placed in the public domain.
pep-0439 Inclusion of implicit pip bootstrap in Python installation
| PEP: | 439 |
|---|---|
| Title: | Inclusion of implicit pip bootstrap in Python installation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Richard Jones <richard at python.org> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 18-Mar-2013 |
| Python-Version: | 3.4 |
| Post-History: | 19-Mar-2013 |
| Resolution: | http://mail.python.org/pipermail/distutils-sig/2013-August/022527.html |
Contents
Abstract
This PEP proposes the inclusion of a pip boostrap executable in the Python installation to simplify the use of 3rd-party modules by Python users.
This PEP does not propose to include the pip implementation in the Python standard library. Nor does it propose to implement any package management or installation mechanisms beyond those provided by PEP 427 ("The Wheel Binary Package Format 1.0") and TODO distlib PEP.
PEP Rejection
This PEP has been rejected in favour of a more explicit mechanism that should achieve the same end result in a more reliable fashion. The more explicit bootstrapping mechanism is described in PEP 453.
Rationale
Currently the user story for installing 3rd-party Python modules is not as simple as it could be. It requires that all 3rd-party modules inform the user of how to install the installer, typically via a link to the installer. That link may be out of date or the steps required to perform the install of the installer may be enough of a roadblock to prevent the user from further progress.
Large Python projects which emphasise a low barrier to entry have shied away from depending on third party packages because of the introduction of this potential stumbling block for new users.
With the inclusion of the package installer command in the standard Python installation the barrier to installing additional software is considerably reduced. It is hoped that this will therefore increase the likelihood that Python projects will reuse third party software.
The Python community also has an issue of complexity around the current bootstrap procedure for pip and setuptools. They all have their own bootstrap download file with slightly different usages and even refer to each other in some cases. Having a single bootstrap which is common amongst them all, with a simple usage, would be far preferable.
It is also hoped that this is reduces the number of proposals to include more and more software in the Python standard library, and therefore that more popular Python software is more easily upgradeable beyond requiring Python installation upgrades.
Proposal
The bootstrap will install the pip implementation, setuptools by downloading their installation files from PyPI.
This proposal affects two components of packaging: the pip bootstrap and, thanks to easier package installation, modifications to publishing packages.
The core of this proposal is that the user experience of using pip should not require the user to install pip.
The pip bootstrap
The Python installation includes an executable called "pip3" (see PEP 394 for naming rationale etc.) that attempts to import pip machinery. If it can then the pip command proceeds as normal. If it cannot it will bootstrap pip by downloading the pip implementation and setuptools wheel files. Hereafter the installation of the "pip implementation" will imply installation of setuptools and virtualenv. Once installed, the pip command proceeds as normal. Once the bootstrap process is complete the "pip3" command is no longer the bootstrap but rather the full pip command.
A boostrap is used in the place of a the full pip code so that we don't have to bundle pip and also pip is upgradeable outside of the regular Python upgrade timeframe and processes.
To avoid issues with sudo we will have the bootstrap default to installing the pip implementation to the per-user site-packages directory defined in PEP 370 and implemented in Python 2.6/3.0. Since we avoid installing to the system Python we also avoid conflicting with any other packaging system (on Linux systems, for example.) If the user is inside a virtual environment [1] then the pip implementation will be installed into that virtual environment.
The bootstrap process will proceed as follows:
- The user system has Python (3.4+) installed. In the "scripts" directory of the Python installation there is the bootstrap script called "pip3".
- The user will invoke a pip command, typically "pip3 install <package>", for example "pip3 install Django".
- The boostrap script will attempt to import the pip implementation. If this succeeds, the pip command is processed normally. Stop.
- On failing to import the pip implementation the bootstrap notifies the user that it needs to "install pip". It will ask the user whether it should install pip as a system-wide site-packages or as a user-only package. This choice will also be present as a command-line option to pip so non-interactive use is possible.
- The bootstrap will and contact PyPI to obtain the latest download wheel file (see PEP 427.)
- Upon downloading the file it is installed using "python setup.py install".
- The pip tool may now import the pip implementation and continues to process the requested user command normally.
Users may be running in an environment which cannot access the public Internet and are relying solely on a local package repository. They would use the "-i" (Base URL of Python Package Index) argument to the "pip3 install" command. This simply overrides the default index URL pointing to PyPI.
Some users may have no Internet access suitable for fetching the pip implementation file. These users can manually download and install the setuptools and pip tar files. Adding specific support for this use-case is unnecessary.
The download of the pip implementation install file will be performed securely. The transport from pypi.python.org will be done over HTTPS with the CA certificate check performed. This facility will be present in Python 3.4+ using Operating System certificates (see PEP XXXX).
Beyond those arguments controlling index location and download options, the "pip3" boostrap command may support further standard pip options for verbosity, quietness and logging.
The "pip3" command will support two new command-line options that are used in the boostrapping, and otherwise ignored. They control where the pip implementation is installed:
| --bootstrap | Install to the user's packages directory. The name of this option is chosen to promote it as the preferred installation option. |
| --bootstrap-to-system | |
| Install to the system site-packages directory. | |
These command-line options will also need to be implemented, but otherwise ignored, in the pip implementation.
Consideration should be given to defaulting pip to install packages to the user's packages directory if pip is installed in that location.
The "--no-install" option to the "pip3" command will not affect the bootstrapping process.
Modifications to publishing packages
An additional new Python package is proposed, "pypublish", which will be a tool for publishing packages to PyPI. It would replace the current "python setup.py register" and "python setup.py upload" distutils commands. Again because of the measured Python release cycle and extensive existing Python installations these commands are difficult to bugfix and extend. Additionally it is desired that the "register" and "upload" commands be able to be performed over HTTPS with certificate validation. Since shipping CA certificate keychains with Python is not really feasible (updating the keychain is quite difficult to manage) it is desirable that those commands, and the accompanying keychain, be made installable and upgradeable outside of Python itself.
The existing distutils mechanisms for package registration and upload would remain, though with a deprecation warning.
Implementation
The changes to pip required by this PEP are being tracked in that project's issue tracker [2]. Most notably, the addition of --bootstrap and --bootstrap- to-system to the pip command-line.
It would be preferable that the pip and setuptools projects distribute a wheel format download.
The required code for this implementation is the "pip3" command described above. The additional pypublish can be developed outside of the scope of this PEP's work.
Finally, it would be desirable that "pip3" be ported to Python 2.6+ to allow the single command to replace existing pip, setuptools and virtualenv (which would be added to the bootstrap) bootstrap scripts. Having that bootstrap included in a future Python 2.7 release would also be highly desirable.
Risks
The key that is used to sign the pip implementation download might be compromised and this PEP currently proposes no mechanism for key revocation.
There is a Perl package installer also named "pip". It is quite rare and not commonly used. The Fedora variant of Linux has historically named Python's "pip" as "python-pip" and Perl's "pip" as "perl-pip". This policy has been altered[3] so that future and upgraded Fedora installations will use the name "pip" for Python's "pip". Existing (non-upgraded) installations will still have the old name for the Python "pip", though the potential for confusion is now much reduced.
References
| [1] | PEP 405, Python Virtual Environments http://www.python.org/dev/peps/pep-0405/ |
| [2] | pip issue tracking work needed for this PEP https://github.com/pypa/pip/issues/863 |
| [3] | Fedora's python-pip package does not provide /usr/bin/pip https://bugzilla.redhat.com/show_bug.cgi?id=958377 |
Acknowledgments
Nick Coghlan for his thoughts on the proposal and dealing with the Red Hat issue.
Jannis Leidel and Carl Meyer for their thoughts. Marcus Smith for feedback.
Marcela Mašláňová for resolving the Fedora issue.
Copyright
This document has been placed in the public domain.
pep-0440 Version Identification and Dependency Specification
| PEP: | 440 |
|---|---|
| Title: | Version Identification and Dependency Specification |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Donald Stufft <donald at stufft.io> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | Distutils SIG <distutils-sig at python.org> |
| Status: | Accepted |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 18 Mar 2013 |
| Post-History: | 30 Mar 2013, 27 May 2013, 20 Jun 2013, 21 Dec 2013, 28 Jan 2014, 08 Aug 2014 22 Aug 2014 |
| Replaces: | 386 |
| Resolution: | https://mail.python.org/pipermail/distutils-sig/2014-August/024673.html |
Contents
- Abstract
- Definitions
- Version scheme
- Public version identifiers
- Local version identifiers
- Final releases
- Pre-releases
- Post-releases
- Developmental releases
- Version epochs
- Normalization
- Case sensitivity
- Integer Normalization
- Pre-release separators
- Pre-release spelling
- Implicit pre-release number
- Post release separators
- Post release spelling
- Implicit post release number
- Implicit post releases
- Development release separators
- Implicit development release number
- Local version segments
- Preceding v character
- Leading and Trailing Whitespace
- Examples of compliant version schemes
- Summary of permitted suffixes and relative ordering
- Version ordering across different metadata versions
- Compatibility with other version schemes
- Version specifiers
- Direct references
- Updating the versioning specification
- Summary of differences from pkg_resources.parse_version
- Summary of differences from PEP 386
- Changing the version scheme
- A more opinionated description of the versioning scheme
- Describing version specifiers alongside the versioning scheme
- Changing the interpretation of version specifiers
- Support for date based version identifiers
- Adding version epochs
- Adding direct references
- Adding arbitrary equality
- Adding local version identifiers
- Providing explicit version normalization rules
- Allowing Underscore in Normalization
- Summary of changes to PEP 440
- References
- Appendix A
- Copyright
Abstract
This PEP describes a scheme for identifying versions of Python software distributions, and declaring dependencies on particular versions.
This document addresses several limitations of the previous attempt at a standardized approach to versioning, as described in PEP 345 and PEP 386.
Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The following terms are to be interpreted as described in PEP 426:
- "Distributions"
- "Releases"
- "Build tools"
- "Index servers"
- "Publication tools"
- "Installation tools"
- "Automated tools"
- "Projects"
Version scheme
Distributions are identified by a public version identifier which supports all defined version comparison operations
The version scheme is used both to describe the distribution version provided by a particular distribution archive, as well as to place constraints on the version of dependencies needed in order to build or run the software.
Public version identifiers
The canonical public version identifiers MUST comply with the following scheme:
[N!]N(.N)*[{a|b|rc}N][.postN][.devN]
Public version identifiers MUST NOT include leading or trailing whitespace.
Public version identifiers MUST be unique within a given distribution.
Installation tools SHOULD ignore any public versions which do not comply with this scheme but MUST also include the normalizations specified below. Installation tools MAY warn the user when non-compliant or ambiguous versions are detected.
Public version identifiers are separated into up to five segments:
- Epoch segment: N!
- Release segment: N(.N)*
- Pre-release segment: {a|b|rc}N
- Post-release segment: .postN
- Development release segment: .devN
Any given release will be a "final release", "pre-release", "post-release" or "developmental release" as defined in the following sections.
All numeric components MUST be non-negative integers.
All numeric components MUST be interpreted and ordered according to their numeric value, not as text strings.
All numeric components MAY be zero. Except as described below for the release segment, a numeric component of zero has no special significance aside from always being the lowest possible value in the version ordering.
Note
Some hard to read version identifiers are permitted by this scheme in order to better accommodate the wide range of versioning practices across existing public and private Python projects.
Accordingly, some of the versioning practices which are technically permitted by the PEP are strongly discouraged for new projects. Where this is the case, the relevant details are noted in the following sections.
Local version identifiers
Local version identifiers MUST comply with the following scheme:
<public version identifier>[+<local version label>]
They consist of a normal public version identifier (as defined in the previous section), along with an arbitrary "local version label", separated from the public version identifier by a plus. Local version labels have no specific semantics assigned, but some syntactic restrictions are imposed.
Local version identifiers are used to denote fully API (and, if applicable, ABI) compatible patched versions of upstream projects. For example, these may be created by application developers and system integrators by applying specific backported bug fixes when upgrading to a new upstream release would be too disruptive to the application or other integrated system (such as a Linux distribution).
The inclusion of the local version label makes it possible to differentiate upstream releases from potentially altered rebuilds by downstream integrators. The use of a local version identifier does not affect the kind of a release but, when applied to a source distribution, does indicate that it may not contain the exact same code as the corresponding upstream release.
To ensure local version identifiers can be readily incorporated as part of filenames and URLs, and to avoid formatting inconsistencies in hexadecimal hash representations, local version labels MUST be limited to the following set of permitted characters:
- ASCII letters ([a-zA-Z])
- ASCII digits ([0-9])
- periods (.)
Local version labels MUST start and end with an ASCII letter or digit.
Comparison and ordering of local versions considers each segment of the local version (divided by a .) separately. If a segment consists entirely of ASCII digits then that section should be considered an integer for comparison purposes and if a segment contains any ASCII letters than that segment is compared lexicographically with case insensitivity. When comparing a numeric and lexicographic segment, the numeric section always compares as greater than the lexicographic segment. Additionally a local version with a great number of segments will always compare as greater than a local version with fewer segments, as long as the shorter local version's segments match the beginning of the longer local version's segments exactly.
An "upstream project" is a project that defines its own public versions. A "downstream project" is one which tracks and redistributes an upstream project, potentially backporting security and bug fixes from later versions of the upstream project.
Local version identifiers SHOULD NOT be used when publishing upstream projects to a public index server, but MAY be used to identify private builds created directly from the project source. Local version identifiers SHOULD be used by downstream projects when releasing a version that is API compatible with the version of the upstream project identified by the public version identifier, but contains additional changes (such as bug fixes). As the Python Package Index is intended solely for indexing and hosting upstream projects, it MUST NOT allow the use of local version identifiers.
Source distributions using a local version identifier SHOULD provide the python.integrator extension metadata (as defined in PEP 459).
Final releases
A version identifier that consists solely of a release segment and optionally an epoch identifier is termed a "final release".
The release segment consists of one or more non-negative integer values, separated by dots:
N(.N)*
Final releases within a project MUST be numbered in a consistently increasing fashion, otherwise automated tools will not be able to upgrade them correctly.
Comparison and ordering of release segments considers the numeric value of each component of the release segment in turn. When comparing release segments with different numbers of components, the shorter segment is padded out with additional zeros as necessary.
While any number of additional components after the first are permitted under this scheme, the most common variants are to use two components ("major.minor") or three components ("major.minor.micro").
For example:
0.9 0.9.1 0.9.2 ... 0.9.10 0.9.11 1.0 1.0.1 1.1 2.0 2.0.1 ...
A release series is any set of final release numbers that start with a common prefix. For example, 3.3.1, 3.3.5 and 3.3.9.45 are all part of the 3.3 release series.
Note
X.Y and X.Y.0 are not considered distinct release numbers, as the release segment comparison rules implicit expand the two component form to X.Y.0 when comparing it to any release segment that includes three components.
Date based release segments are also permitted. An example of a date based release scheme using the year and month of the release:
2012.04 2012.07 2012.10 2013.01 2013.06 ...
Pre-releases
Some projects use an "alpha, beta, release candidate" pre-release cycle to support testing by their users prior to a final release.
If used as part of a project's development cycle, these pre-releases are indicated by including a pre-release segment in the version identifier:
X.YaN # Alpha release X.YbN # Beta release X.YrcN # Release Candidate X.Y # Final release
A version identifier that consists solely of a release segment and a pre-release segment is termed a "pre-release".
The pre-release segment consists of an alphabetical identifier for the pre-release phase, along with a non-negative integer value. Pre-releases for a given release are ordered first by phase (alpha, beta, release candidate) and then by the numerical component within that phase.
Installation tools MAY accept both c and rc releases for a common release segment in order to handle some existing legacy distributions.
Installation tools SHOULD interpret c versions as being equivalent to rc versions (that is, c1 indicates the same version as rc1).
Build tools, publication tools and index servers SHOULD disallow the creation of both rc and c releases for a common release segment.
Post-releases
Some projects use post-releases to address minor errors in a final release that do not affect the distributed software (for example, correcting an error in the release notes).
If used as part of a project's development cycle, these post-releases are indicated by including a post-release segment in the version identifier:
X.Y.postN # Post-release
A version identifier that includes a post-release segment without a developmental release segment is termed a "post-release".
The post-release segment consists of the string .post, followed by a non-negative integer value. Post-releases are ordered by their numerical component, immediately following the corresponding release, and ahead of any subsequent release.
Note
The use of post-releases to publish maintenance releases containing actual bug fixes is strongly discouraged. In general, it is better to use a longer release number and increment the final component for each maintenance release.
Post-releases are also permitted for pre-releases:
X.YaN.postM # Post-release of an alpha release X.YbN.postM # Post-release of a beta release X.YrcN.postM # Post-release of a release candidate
Note
Creating post-releases of pre-releases is strongly discouraged, as it makes the version identifier difficult to parse for human readers. In general, it is substantially clearer to simply create a new pre-release by incrementing the numeric component.
Developmental releases
Some projects make regular developmental releases, and system packagers (especially for Linux distributions) may wish to create early releases directly from source control which do not conflict with later project releases.
If used as part of a project's development cycle, these developmental releases are indicated by including a developmental release segment in the version identifier:
X.Y.devN # Developmental release
A version identifier that includes a developmental release segment is termed a "developmental release".
The developmental release segment consists of the string .dev, followed by a non-negative integer value. Developmental releases are ordered by their numerical component, immediately before the corresponding release (and before any pre-releases with the same release segment), and following any previous release (including any post-releases).
Developmental releases are also permitted for pre-releases and post-releases:
X.YaN.devM # Developmental release of an alpha release X.YbN.devM # Developmental release of a beta release X.YrcN.devM # Developmental release of a release candidate X.Y.postN.devM # Developmental release of a post-release
Note
While they may be useful for continuous integration purposes, publishing developmental releases of pre-releases to general purpose public index servers is strongly discouraged, as it makes the version identifier difficult to parse for human readers. If such a release needs to be published, it is substantially clearer to instead create a new pre-release by incrementing the numeric component.
Developmental releases of post-releases are also strongly discouraged, but they may be appropriate for projects which use the post-release notation for full maintenance releases which may include code changes.
Version epochs
If included in a version identifier, the epoch appears before all other components, separated from the release segment by an exclamation mark:
E!X.Y # Version identifier with epoch
If no explicit epoch is given, the implicit epoch is 0.
Most version identifiers will not include an epoch, as an explicit epoch is only needed if a project changes the way it handles version numbering in a way that means the normal version ordering rules will give the wrong answer. For example, if a project is using date based versions like 2014.04 and would like to switch to semantic versions like 1.0, then the new releases would be identified as older than the date based releases when using the normal sorting scheme:
1.0 1.1 2.0 2013.10 2014.04
However, by specifying an explicit epoch, the sort order can be changed appropriately, as all versions from a later epoch are sorted after versions from an earlier epoch:
2013.10 2014.04 1!1.0 1!1.1 1!2.0
Normalization
In order to maintain better compatibility with existing versions there are a number of "alternative" syntaxes that MUST be taken into account when parsing versions. These syntaxes MUST be considered when parsing a version, however they should be "normalized" to the standard syntax defined above.
Case sensitivity
All ascii letters should be interpreted case insensitively within a version and the normal form is lowercase. This allows versions such as 1.1RC1 which would be normalized to 1.1rc1.
Integer Normalization
All integers are interpreted via the int() built in and normalize to the string form of the output. This means that an integer version of 00 would normalize to 0 while 09000 would normalize to 9000. This does not hold true for integers inside of an alphanumeric segment of a local version such as 1.0+foo0100 which is already in its normalized form.
Pre-release separators
Pre-releases should allow a ., -, or _ separator between the release segment and the pre-release segment. The normal form for this is without a separator. This allows versions such as 1.1.a1 or 1.1-a1 which would be normalized to 1.1a1. It should also allow a seperator to be used between the pre-release signifier and the numeral. This allows versions such as 1.0a.1 which would be normalized to 1.0a1.
Pre-release spelling
Pre-releases allow the additional spellings of alpha, beta, c, pre, and preview for a, b, rc, rc, and rc respectively. This allows versions such as 1.1alpha1, 1.1beta2, or 1.1c3 which normalize to 1.1a1, 1.1b2, and 1.1rc3. In every case the additional spelling should be considered equivalent to their normal forms.
Implicit pre-release number
Pre releases allow omitting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2a which is normalized to 1.2a0.
Post release separators
Post releases allow a ., -, or _ separator as well as omitting the separator all together. The normal form of this is with the . separator. This allows versions such as 1.2-post2 or 1.2post2 which normalize to 1.2.post2. Like the pre-release seperator this also allows an optional separator between the post release signifier and the numeral. This allows versions like 1.2.post-2 which would normalize to 1.2.post2.
Post release spelling
Post-releases allow the additional spellings of rev and r. This allows versions such as 1.0-r4 which normalizes to 1.0.post4. As with the pre-releases the additional spellings should be considered equivalent to their normal forms.
Implicit post release number
Post releases allow omiting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2.post which is normalized to 1.2.post0.
Implicit post releases
Post releases allow omitting the post signifier all together. When using this form the separator MUST be - and no other form is allowed. This allows versions such as 1.0-1 to be normalized to 1.0.post1. This particular normalization MUST NOT be used in conjunction with the implicit post release number rule. In other words 1.0- is not a valid version and it does not normalize to 1.0.post0.
Development release separators
Development releases allow a ., -, or a _ separator as well as omitting the separator all together. The normal form of this is with the . separator. This allows versions such as 1.2-dev2 or 1.2dev2 which normalize to 1.2.dev2.
Implicit development release number
Development releases allow omiting the numeral in which case it is implicitly assumed to be 0. The normal form for this is to include the 0 explicitly. This allows versions such as 1.2.dev which is normalized to 1.2.dev0.
Local version segments
With a local version, in addition to the use of . as a separator of segments, the use of - and _ is also acceptable. The normal form is using the . character. This allows versions such as 1.0+ubuntu-1 to be normalized to 1.0+ubuntu.1.
Preceding v character
In order to support the common version notation of v1.0 versions may be preceded by a single literal v character. This character MUST be ignored for all purposes and should be omitted from all normalized forms of the version. The same version with and without the v is considered equivalent.
Leading and Trailing Whitespace
Leading and trailing whitespace must be silently ignored and removed from all normalized forms of a version. This includes " ", \t, \n, \r, \f, and \v. This allows accidental whitespace to be handled sensibly, such as a version like 1.0\n which normalizes to 1.0.
Examples of compliant version schemes
The standard version scheme is designed to encompass a wide range of identification practices across public and private Python projects. In practice, a single project attempting to use the full flexibility offered by the scheme would create a situation where human users had difficulty figuring out the relative order of versions, even though the rules above ensure all compliant tools will order them consistently.
The following examples illustrate a small selection of the different approaches projects may choose to identify their releases, while still ensuring that the "latest release" and the "latest stable release" can be easily determined, both by human users and automated tools.
Simple "major.minor" versioning:
0.1 0.2 0.3 1.0 1.1 ...
Simple "major.minor.micro" versioning:
1.1.0 1.1.1 1.1.2 1.2.0 ...
"major.minor" versioning with alpha, beta and candidate pre-releases:
0.9 1.0a1 1.0a2 1.0b1 1.0rc1 1.0 1.1a1 ...
"major.minor" versioning with developmental releases, release candidates and post-releases for minor corrections:
0.9 1.0.dev1 1.0.dev2 1.0.dev3 1.0.dev4 1.0c1 1.0c2 1.0 1.0.post1 1.1.dev1 ...
Date based releases, using an incrementing serial within each year, skipping zero:
2012.1 2012.2 2012.3 ... 2012.15 2013.1 2013.2 ...
Summary of permitted suffixes and relative ordering
Note
This section is intended primarily for authors of tools that automatically process distribution metadata, rather than developers of Python distributions deciding on a versioning scheme.
The epoch segment of version identifiers MUST be sorted according to the numeric value of the given epoch. If no epoch segment is present, the implicit numeric value is 0.
The release segment of version identifiers MUST be sorted in the same order as Python's tuple sorting when the normalized release segment is parsed as follows:
tuple(map(int, release_segment.split(".")))
All release segments involved in the comparison MUST be converted to a consistent length by padding shorter segments with zeros as needed.
Within a numeric release (1.0, 2.7.3), the following suffixes are permitted and MUST be ordered as shown:
.devN, aN, bN, rcN, <no suffix>, .postN
Note that c is considered to be semantically equivalent to rc and must be sorted as if it were rc. Tools MAY reject the case of having the same N for both a c and a rc in the same release segment as ambiguous and remain in compliance with the PEP.
Within an alpha (1.0a1), beta (1.0b1), or release candidate (1.0rc1, 1.0c1), the following suffixes are permitted and MUST be ordered as shown:
.devN, <no suffix>, .postN
Within a post-release (1.0.post1), the following suffixes are permitted and MUST be ordered as shown:
.devN, <no suffix>
Note that devN and postN MUST always be preceded by a dot, even when used immediately following a numeric version (e.g. 1.0.dev456, 1.0.post1).
Within a pre-release, post-release or development release segment with a shared prefix, ordering MUST be by the value of the numeric component.
The following example covers many of the possible combinations:
1.0.dev456 1.0a1 1.0a2.dev456 1.0a12.dev456 1.0a12 1.0b1.dev456 1.0b2 1.0b2.post345.dev456 1.0b2.post345 1.0rc1.dev456 1.0rc1 1.0 1.0+abc.5 1.0+abc.7 1.0+5 1.0.post456.dev34 1.0.post456 1.1.dev1
Version ordering across different metadata versions
Metadata v1.0 (PEP 241) and metadata v1.1 (PEP 314) do not specify a standard version identification or ordering scheme. However metadata v1.2 (PEP 345) does specify a scheme which is defined in PEP 386.
Due to the nature of the simple installer API it is not possible for an installer to be aware of which metadata version a particular distribution was using. Additionally installers required the ability to create a reasonably prioritized list that includes all, or as many as possible, versions of a project to determine which versions it should install. These requirements necessitate a standardization across one parsing mechanism to be used for all versions of a project.
Due to the above, this PEP MUST be used for all versions of metadata and supersedes PEP 386 even for metadata v1.2. Tools SHOULD ignore any versions which cannot be parsed by the rules in this PEP, but MAY fall back to implementation defined version parsing and ordering schemes if no versions complying with this PEP are available.
Distribution users may wish to explicitly remove non-compliant versions from any private package indexes they control.
Compatibility with other version schemes
Some projects may choose to use a version scheme which requires translation in order to comply with the public version scheme defined in this PEP. In such cases, the project specific version can be stored in the metadata while the translated public version is published in the version field.
This allows automated distribution tools to provide consistently correct ordering of published releases, while still allowing developers to use the internal versioning scheme they prefer for their projects.
Semantic versioning
Semantic versioning [10] is a popular version identification scheme that is more prescriptive than this PEP regarding the significance of different elements of a release number. Even if a project chooses not to abide by the details of semantic versioning, the scheme is worth understanding as it covers many of the issues that can arise when depending on other distributions, and when publishing a distribution that others rely on.
The "Major.Minor.Patch" (described in this PEP as "major.minor.micro") aspects of semantic versioning (clauses 1-9 in the 2.0.0-rc-1 specification) are fully compatible with the version scheme defined in this PEP, and abiding by these aspects is encouraged.
Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.
One possible mechanism to translate such semantic versioning based source labels to compatible public versions is to use the .devN suffix to specify the appropriate version order.
Specific build information may also be included in local version labels.
DVCS based version labels
Many build tools integrate with distributed version control systems like Git and Mercurial in order to add an identifying hash to the version identifier. As hashes cannot be ordered reliably such versions are not permitted in the public version field.
As with semantic versioning, the public .devN suffix may be used to uniquely identify such releases for publication, while the original DVCS based label can be stored in the project metadata.
Identifying hash information may also be included in local version labels.
Olson database versioning
The pytz project inherits its versioning scheme from the corresponding Olson timezone database versioning scheme: the year followed by a lowercase character indicating the version of the database within that year.
This can be translated to a compliant public version identifier as <year>.<serial>, where the serial starts at zero or one (for the '<year>a' release) and is incremented with each subsequent database update within the year.
As with other translated version identifiers, the corresponding Olson database version could be recorded in the project metadata.
Version specifiers
A version specifier consists of a series of version clauses, separated by commas. For example:
~= 0.9, >= 1.0, != 1.3.4.*, < 2.0
The comparison operator determines the kind of version clause:
- ~=: Compatible release clause
- ==: Version matching clause
- !=: Version exclusion clause
- <=, >=: Inclusive ordered comparison clause
- <, >: Exclusive ordered comparison clause
- ===: Arbitrary equality clause.
The comma (",") is equivalent to a logical and operator: a candidate version must match all given version clauses in order to match the specifier as a whole.
Whitespace between a conditional operator and the following version identifier is optional, as is the whitespace around the commas.
When multiple candidate versions match a version specifier, the preferred version SHOULD be the latest version as determined by the consistent ordering defined by the standard Version scheme. Whether or not pre-releases are considered as candidate versions SHOULD be handled as described in Handling of pre-releases.
Except where specifically noted below, local version identifiers MUST NOT be permitted in version specifiers, and local version labels MUST be ignored entirely when checking if candidate versions match a given version specifier.
Compatible release
A compatible release clause consists of the compatible release operator ~= and a version identifier. It matches any candidate version that is expected to be compatible with the specified version.
The specified version identifier must be in the standard format described in Version scheme. Local version identifiers are NOT permitted in this version specifier.
For a given release identifier V.N, the compatible release clause is approximately equivalent to the pair of comparison clauses:
>= V.N, == V.*
This operator MUST NOT be used with a single segment version number such as ~=1.
For example, the following groups of version clauses are equivalent:
~= 2.2 >= 2.2, == 2.* ~= 1.4.5 >= 1.4.5, == 1.4.*
If a pre-release, post-release or developmental release is named in a compatible release clause as V.N.suffix, then the suffix is ignored when determining the required prefix match:
~= 2.2.post3 >= 2.2.post3, == 2.* ~= 1.4.5a4 >= 1.4.5a4, == 1.4.*
The padding rules for release segment comparisons means that the assumed degree of forward compatibility in a compatible release clause can be controlled by appending additional zeros to the version specifier:
~= 2.2.0 >= 2.2.0, == 2.2.* ~= 1.4.5.0 >= 1.4.5.0, == 1.4.5.*
Version matching
A version matching clause includes the version matching operator == and a version identifier.
The specified version identifier must be in the standard format described in Version scheme, but a trailing .* is permitted on public version identifiers as described below.
By default, the version matching operator is based on a strict equality comparison: the specified version must be exactly the same as the requested version. The only substitution performed is the zero padding of the release segment to ensure the release segments are compared with the same length.
Whether or not strict version matching is appropriate depends on the specific use case for the version specifier. Automated tools SHOULD at least issue warnings and MAY reject them entirely when strict version matches are used inappropriately.
Prefix matching may be requested instead of strict comparison, by appending a trailing .* to the version identifier in the version matching clause. This means that additional trailing segments will be ignored when determining whether or not a version identifier matches the clause. If the specified version includes only a release segment, than trailing components (or the lack thereof) in the release segment are also ignored.
For example, given the version 1.1.post1, the following clauses would match or not as shown:
== 1.1 # Not equal, so 1.1.post1 does not match clause == 1.1.post1 # Equal, so 1.1.post1 matches clause == 1.1.* # Same prefix, so 1.1.post1 matches clause
For purposes of prefix matching, the pre-release segment is considered to have an implied preceding ., so given the version 1.1a1, the following clauses would match or not as shown:
== 1.1 # Not equal, so 1.1a1 does not match clause == 1.1a1 # Equal, so 1.1a1 matches clause == 1.1.* # Same prefix, so 1.1a1 matches clause
An exact match is also considered a prefix match (this interpreation is implied by the usual zero padding rules for the release segment of version identifiers). Given the version 1.1, the following clauses would match or not as shown:
== 1.1 # Equal, so 1.1 matches clause == 1.1.0 # Zero padding expands 1.1 to 1.1.0, so it matches clause == 1.1.dev1 # Not equal (dev-release), so 1.1 does not match clause == 1.1a1 # Not equal (pre-release), so 1.1 does not match clause == 1.1.post1 # Not equal (post-release), so 1.1 does not match clause == 1.1.* # Same prefix, so 1.1 matches clause
It is invalid to have a prefix match containing a development or local release such as 1.0.dev1.* or 1.0+foo1.*. If present, the development release segment is always the final segment in the public version, and the local version is ignored for comparison purposes, so using either in a prefix match wouldn't make any sense.
The use of == (without at least the wildcard suffix) when defining dependencies for published distributions is strongly discouraged as it greatly complicates the deployment of security fixes. The strict version comparison operator is intended primarily for use when defining dependencies for repeatable deployments of applications while using a shared distribution index.
If the specified version identifier is a public version identifier (no local version label), then the local version label of any candidate versions MUST be ignored when matching versions.
If the specified version identifier is a local version identifier, then the local version labels of candidate versions MUST be considered when matching versions, with the public version identifier being matched as described above, and the local version label being checked for equivalence using a strict string equality comparison.
Version exclusion
A version exclusion clause includes the version exclusion operator != and a version identifier.
The allowed version identifiers and comparison semantics are the same as those of the Version matching operator, except that the sense of any match is inverted.
For example, given the version 1.1.post1, the following clauses would match or not as shown:
!= 1.1 # Not equal, so 1.1.post1 matches clause != 1.1.post1 # Equal, so 1.1.post1 does not match clause != 1.1.* # Same prefix, so 1.1.post1 does not match clause
Inclusive ordered comparison
An inclusive ordered comparison clause includes a comparison operator and a version identifier, and will match any version where the comparison is correct based on the relative position of the candidate version and the specified version given the consistent ordering defined by the standard Version scheme.
The inclusive ordered comparison operators are <= and >=.
As with version matching, the release segment is zero padded as necessary to ensure the release segments are compared with the same length.
Local version identifiers are NOT permitted in this version specifier.
Exclusive ordered comparison
The exclusive ordered comparisons > and < are similar to the inclusive ordered comparisons in that they rely on the relative position of the candidate version and the specified version given the consistent ordering defined by the standard Version scheme. However, they specifically exclude pre-releases, post-releases, and local versions of the specified version.
The exclusive ordered comparison >V MUST NOT allow a post-release of the given version unless V itself is a post release. You may mandate that releases are later than a particular post release, including additional post releases, by using >V.postN. For example, >1.7 will allow 1.7.1 but not 1.7.0.post1 and >1.7.post2 will allow 1.7.1 and 1.7.0.post3 but not 1.7.0.
The exclusive ordered comparison >V MUST NOT match a local version of the specified version.
The exclusive ordered comparison <V MUST NOT allow a pre-release of the specified version unless the specified version is itself a pre-release. Allowing pre-releases that are earlier than, but not equal to a specific pre-release may be accomplished by using <V.rc1 or similar.
As with version matching, the release segment is zero padded as necessary to ensure the release segments are compared with the same length.
Local version identifiers are NOT permitted in this version specifier.
Arbitrary equality
Arbitrary equality comparisons are simple string equality operations which do not take into account any of the semantic information such as zero padding or local versions. This operator also does not support prefix matching as the == operator does.
The primary use case for arbitrary equality is to allow for specifying a version which cannot otherwise be represented by this PEP. This operator is special and acts as an escape hatch to allow someone using a tool which implements this PEP to still install a legacy version which is otherwise incompatible with this PEP.
An example would be ===foobar which would match a version of foobar.
This operator may also be used to explicitly require an unpatched version of a project such as ===1.0 which would not match for a version 1.0+downstream1.
Use of this operator is heavily discouraged and tooling MAY display a warning when it is used.
Handling of pre-releases
Pre-releases of any kind, including developmental releases, are implicitly excluded from all version specifiers, unless they are already present on the system, explicitly requested by the user, or if the only available version that satisfies the version specifier is a pre-release.
By default, dependency resolution tools SHOULD:
- accept already installed pre-releases for all version specifiers
- accept remotely available pre-releases for version specifiers where there is no final or post release that satisfies the version specifier
- exclude all other pre-releases from consideration
Dependency resolution tools MAY issue a warning if a pre-release is needed to satisfy a version specifier.
Dependency resolution tools SHOULD also allow users to request the following alternative behaviours:
- accepting pre-releases for all version specifiers
- excluding pre-releases for all version specifiers (reporting an error or warning if a pre-release is already installed locally, or if a pre-release is the only way to satisfy a particular specifier)
Dependency resolution tools MAY also allow the above behaviour to be controlled on a per-distribution basis.
Post-releases and final releases receive no special treatment in version specifiers - they are always included unless explicitly excluded.
Examples
- ~=3.1: version 3.1 or later, but not version 4.0 or later.
- ~=3.1.2: version 3.1.2 or later, but not version 3.2.0 or later.
- ~=3.1a1: version 3.1a1 or later, but not version 4.0 or later.
- == 3.1: specifically version 3.1 (or 3.1.0), excludes all pre-releases, post releases, developmental releases and any 3.1.x maintenance releases.
- == 3.1.*: any version that starts with 3.1. Equivalent to the ~=3.1.0 compatible release clause.
- ~=3.1.0, != 3.1.3: version 3.1.0 or later, but not version 3.1.3 and not version 3.2.0 or later.
Direct references
Some automated tools may permit the use of a direct reference as an alternative to a normal version specifier. A direct reference consists of the specifier @ and an explicit URL.
Whether or not direct references are appropriate depends on the specific use case for the version specifier. Automated tools SHOULD at least issue warnings and MAY reject them entirely when direct references are used inappropriately.
Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended as a tool for software integrators rather than publishers.
Depending on the use case, some appropriate targets for a direct URL reference may be a valid source_url entry (see PEP 426), an sdist, or a wheel binary archive. The exact URLs and targets supported will be tool dependent.
For example, a local source archive may be referenced directly:
pip @ file:///localbuilds/pip-1.3.1.zip
Alternatively, a prebuilt archive may also be referenced:
pip @ file:///localbuilds/pip-1.3.1-py33-none-any.whl
All direct references that do not refer to a local file URL SHOULD specify a secure transport mechanism (such as https) AND include an expected hash value in the URL for verification purposes. If a direct reference is specified without any hash information, with hash information that the tool doesn't understand, or with a selected hash algorithm that the tool considers too weak to trust, automated tools SHOULD at least emit a warning and MAY refuse to rely on the URL. If such a direct reference also uses an insecure transport, automated tools SHOULD NOT rely on the URL.
It is RECOMMENDED that only hashes which are unconditionally provided by the latest version of the standard library's hashlib module be used for source archive hashes. At time of writing, that list consists of 'md5', 'sha1', 'sha224', 'sha256', 'sha384', and 'sha512'.
For source archive and wheel references, an expected hash value may be specified by including a <hash-algorithm>=<expected-hash> entry as part of the URL fragment.
For version control references, the VCS+protocol scheme SHOULD be used to identify both the version control system and the secure transport, and a version control system with hash based commit identifiers SHOULD be used. Automated tools MAY omit warnings about missing hashes for version control systems that do not provide hash based commit identifiers.
To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.
Note
This isn't quite the same as the existing VCS reference notation supported by pip. Firstly, the distribution name is moved in front rather than embedded as part of the URL. Secondly, the commit hash is included even when retrieving based on a tag, in order to meet the requirement above that every link should include a hash to make things harder to forge (creating a malicious repo with a particular tag is easy, creating one with a specific hash, less so).
Remote URL examples:
pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686 pip @ git+https://github.com/pypa/pip.git@7921be1537eac1e97bc40179a57f0349c2aee67d pip @ git+https://github.com/pypa/pip.git@1.3.1#7921be1537eac1e97bc40179a57f0349c2aee67d
File URLs
File URLs take the form of file://<host>/<path>. If the <host> is omitted it is assumed to be localhost and even if the <host> is omitted the third slash MUST still exist. The <path> defines what the file path on the filesystem that is to be accessed.
On the various *nix operating systems the only allowed values for <host> is for it to be ommitted, localhost, or another FQDN that the current machine believes matches its own host. In other words on *nix the file:// scheme can only be used to access paths on the local machine.
On Windows the file format should include the drive letter if applicable as part of the <path> (e.g. file:///c:/path/to/a/file). Unlike *nix on Windows the <host> parameter may be used to specify a file residing on a network share. In other words in order to translate \\machine\volume\file to a file:// url, it would end up as file://machine/volume/file. For more information on file:// URLs on Windows see MSDN [4].
Updating the versioning specification
The versioning specification may be updated with clarifications without requiring a new PEP or a change to the metadata version.
Any technical changes that impact the version identification and comparison syntax and semantics would require an updated versioning scheme to be defined in a new PEP.
Summary of differences from pkg_resources.parse_version
- Local versions sort differently, this PEP requires that they sort as greater than the same version without a local version, whereas pkg_resources.parse_version considers it a pre-release marker.
- This PEP purposely restricts the syntax which constitutes a valid version while pkg_resources.parse_version attempts to provide some meaning from any arbitrary string.
- pkg_resources.parse_version allows arbitrarily deeply nested version signifiers like 1.0.dev1.post1.dev5. This PEP however allows only a single use of each type and they must exist in a certain order.
Summary of differences from PEP 386
- Moved the description of version specifiers into the versioning PEP
- Added the "direct reference" concept as a standard notation for direct references to resources (rather than each tool needing to invent its own)
- Added the "local version identifier" and "local version label" concepts to allow system integrators to indicate patched builds in a way that is supported by the upstream tools, as well as to allow the incorporation of build tags into the versioning of binary distributions.
- Added the "compatible release" clause
- Added the trailing wildcard syntax for prefix based version matching and exclusion
- Changed the top level sort position of the .devN suffix
- Allowed single value version numbers
- Explicit exclusion of leading or trailing whitespace
- Explicit support for date based versions
- Explicit normalisation rules to improve compatibility with existing version metadata on PyPI where it doesn't introduce ambiguity
- Implicitly exclude pre-releases unless they're already present or needed to satisfy a dependency
- Treat post releases the same way as unqualified releases
- Discuss ordering and dependencies across metadata versions
- Switch from preferring c to rc.
The rationale for major changes is given in the following sections.
Changing the version scheme
One key change in the version scheme in this PEP relative to that in PEP 386 is to sort top level developmental releases like X.Y.devN ahead of alpha releases like X.Ya1. This is a far more logical sort order, as projects already using both development releases and alphas/betas/release candidates do not want their developmental releases sorted in between their release candidates and their final releases. There is no rationale for using dev releases in that position rather than merely creating additional release candidates.
The updated sort order also means the sorting of dev versions is now consistent between the metadata standard and the pre-existing behaviour of pkg_resources (and hence the behaviour of current installation tools).
Making this change should make it easier for affected existing projects to migrate to the latest version of the metadata standard.
Another change to the version scheme is to allow single number versions, similar to those used by non-Python projects like Mozilla Firefox, Google Chrome and the Fedora Linux distribution. This is actually expected to be more useful for version specifiers, but it is easier to allow it for both version specifiers and release numbers, rather than splitting the two definitions.
The exclusion of leading and trailing whitespace was made explicit after a couple of projects with version identifiers differing only in a trailing \n character were found on PyPI.
Various other normalisation rules were also added as described in the separate section on version normalisation below.
Appendix A shows detailed results of an analysis of PyPI distribution version information, as collected on 8th August, 2014. This analysis compares the behavior of the explicitly ordered version scheme defined in this PEP with the de facto standard defined by the behavior of setuptools. These metrics are useful, as the intent of this PEP is to follow existing setuptools behavior as closely as is feasible, while still throwing exceptions for unorderable versions (rather than trying to guess an appropriate order as setuptools does).
A more opinionated description of the versioning scheme
As in PEP 386, the primary focus is on codifying existing practices to make them more amenable to automation, rather than demanding that existing projects make non-trivial changes to their workflow. However, the standard scheme allows significantly more flexibility than is needed for the vast majority of simple Python packages (which often don't even need maintenance releases - many users are happy with needing to upgrade to a new feature release to get bug fixes).
For the benefit of novice developers, and for experienced developers wishing to better understand the various use cases, the specification now goes into much greater detail on the components of the defined version scheme, including examples of how each component may be used in practice.
The PEP also explicitly guides developers in the direction of semantic versioning (without requiring it), and discourages the use of several aspects of the full versioning scheme that have largely been included in order to cover esoteric corner cases in the practices of existing projects and in repackaging software for Linux distributions.
Describing version specifiers alongside the versioning scheme
The main reason to even have a standardised version scheme in the first place is to make it easier to do reliable automated dependency analysis. It makes more sense to describe the primary use case for version identifiers alongside their definition.
Changing the interpretation of version specifiers
The previous interpretation of version specifiers made it very easy to accidentally download a pre-release version of a dependency. This in turn made it difficult for developers to publish pre-release versions of software to the Python Package Index, as even marking the package as hidden wasn't enough to keep automated tools from downloading it, and also made it harder for users to obtain the test release manually through the main PyPI web interface.
The previous interpretation also excluded post-releases from some version specifiers for no adequately justified reason.
The updated interpretation is intended to make it difficult to accidentally accept a pre-release version as satisfying a dependency, while still allowing pre-release versions to be retrieved automatically when that's the only way to satisfy a dependency.
The "some forward compatibility assumed" version constraint is derived from the Ruby community's "pessimistic version constraint" operator [2] to allow projects to take a cautious approach to forward compatibility promises, while still easily setting a minimum required version for their dependencies. The spelling of the compatible release clause (~=) is inspired by the Ruby (~>) and PHP (~) equivalents.
Further improvements are also planned to the handling of parallel installation of multiple versions of the same library, but these will depend on updates to the installation database definition along with improved tools for dynamic path manipulation.
The trailing wildcard syntax to request prefix based version matching was added to make it possible to sensibly define compatible release clauses.
Support for date based version identifiers
Excluding date based versions caused significant problems in migrating pytz to the new metadata standards. It also caused concerns for the OpenStack developers, as they use a date based versioning scheme and would like to be able to migrate to the new metadata standards without changing it.
Adding version epochs
Version epochs are added for the same reason they are part of other versioning schemes, such as those of the Fedora and Debian Linux distributions: to allow projects to gracefully change their approach to numbering releases, without having a new release appear to have a lower version number than previous releases and without having to change the name of the project.
In particular, supporting version epochs allows a project that was previously using date based versioning to switch to semantic versioning by specifying a new version epoch.
The ! character was chosen to delimit an epoch version rather than the : character, which is commonly used in other systems, due to the fact that : is not a valid character in a Windows directory name.
Adding direct references
Direct references are added as an "escape clause" to handle messy real world situations that don't map neatly to the standard distribution model. This includes dependencies on unpublished software for internal use, as well as handling the more complex compatibility issues that may arise when wrapping third party libraries as C extensions (this is of especial concern to the scientific community).
Index servers are deliberately given a lot of freedom to disallow direct references, since they're intended primarily as a tool for integrators rather than publishers. PyPI in particular is currently going through the process of eliminating dependencies on external references, as unreliable external services have the effect of slowing down installation operations, as well as reducing PyPI's own apparent reliability.
Adding arbitrary equality
Arbitrary equality is added as an "escape clause" to handle the case where someone needs to install a project which uses a non compliant version. Although this PEP is able to attain ~97% compatibility with the versions that are already on PyPI there are still ~3% of versions which cannot be parsed. This operator gives a simple and effective way to still depend on them without having to "guess" at the semantics of what they mean (which would be required if anything other than strict string based equality was supported).
Adding local version identifiers
It's a fact of life that downstream integrators often need to backport upstream bug fixes to older versions. It's one of the services that gets Linux distro vendors paid, and application developers may also apply patches they need to bundled dependencies.
Historically, this practice has been invisible to cross-platform language specific distribution tools - the reported "version" in the upstream metadata is the same as for the unmodified code. This inaccuracy can then cause problems when attempting to work with a mixture of integrator provided code and unmodified upstream code, or even just attempting to identify exactly which version of the software is installed.
The introduction of local version identifiers and "local version labels" into the versioning scheme, with the corresponding python.integrator metadata extension allows this kind of activity to be represented accurately, which should improve interoperability between the upstream tools and various integrated platforms.
The exact scheme chosen is largely modeled on the existing behavior of pkg_resources.parse_version and pkg_resources.parse_requirements, with the main distinction being that where pkg_resources currently always takes the suffix into account when comparing versions for exact matches, the PEP requires that the local version label of the candidate version be ignored when no local version label is present in the version specifier clause. Furthermore, the PEP does not attempt to impose any structure on the local version labels (aside from limiting the set of permitted characters and defining their ordering).
This change is designed to ensure that an integrator provided version like pip 1.5+1 or pip 1.5+1.git.abc123de will still satisfy a version specifier like pip>=1.5.
The plus is chosen primarily for readability of local version identifiers. It was chosen instead of the hyphen to prevent pkg_resources.parse_version from parsing it as a prerelease, which is important for enabling a successful migration to the new, more structured, versioning scheme. The plus was chosen instead of a tilde because of the significance of the tilde in Debian's version ordering algorithm.
Providing explicit version normalization rules
Historically, the de facto standard for parsing versions in Python has been the pkg_resources.parse_version command from the setuptools project. This does not attempt to reject any version and instead tries to make something meaningful, with varying levels of success, out of whatever it is given. It has a few simple rules but otherwise it more or less relies largely on string comparison.
The normalization rules provided in this PEP exist primarily to either increase the compatability with pkg_resources.parse_version, particularly in documented use cases such as rev, r, pre, etc or to do something more reasonable with versions that already exist on PyPI.
All possible normalization rules were weighed against whether or not they were likely to cause any ambiguity (e.g. while someone might devise a scheme where v1.0 and 1.0 are considered distinct releases, the likelihood of anyone actually doing that, much less on any scale that is noticeable, is fairly low). They were also weighed against how pkg_resources.parse_version treated a particular version string, especially with regards to how it was sorted. Finally each rule was weighed against the kinds of additional versions it allowed, how "ugly" those versions looked, how hard there were to parse (both mentally and mechanically) and how much additional compatibility it would bring.
The breadth of possible normalizations were kept to things that could easily be implemented as part of the parsing of the version and not pre-parsing transformations applied to the versions. This was done to limit the side effects of each transformation as simple search and replace style transforms increase the likelihood of ambiguous or "junk" versions.
For an extended discussion on the various types of normalizations that were considered, please see the proof of concept for PEP 440 within pip [5].
Allowing Underscore in Normalization
There are not a lot of projects on PyPI which utilize a _ in the version string. However this PEP allows its use anywhere that - is acceptable. The reason for this is that the Wheel normalization scheme specifies that - gets normalized to a _ to enable easier parsing of the filename.
Summary of changes to PEP 440
The following changes were made to this PEP based on feedback received after the initial reference implementation was released in setuptools 8.0 and pip 6.0:
- The exclusive ordered comparisons were updated to no longer imply a !=V.* which was deemed to be surprising behavior which was too hard to accurately describe. Instead the exclusive ordered comparisons will simply disallow matching pre-releases, post-releases, and local versions of the specified version (unless the specified version is itself a pre-release, post-release or local version). For an extended discussion see the threads on distutils-sig [6] [7].
- The normalized form for release candidates was updated from 'c' to 'rc'. This change was based on user feedback received when setuptools 8.0 started applying normalisation to the release metadata generated when preparing packages for publication on PyPI [8].
References
The initial attempt at a standardised version scheme, along with the justifications for needing such a standard can be found in PEP 386.
| [1] | Reference Implementation of PEP 440 Versions and Specifiers https://github.com/pypa/packaging/pull/1 |
| [2] | Version compatibility analysis script: https://github.com/pypa/packaging/blob/master/tasks/check.py |
| [3] | Pessimistic version constraint http://guides.rubygems.org/patterns/ |
| [4] | File URIs in Windows http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx |
| [5] | Proof of Concept: PEP 440 within pip https://github.com/pypa/pip/pull/1894 |
| [6] | PEP440: foo-X.Y.Z does not satisfy "foo>X.Y" https://mail.python.org/pipermail/distutils-sig/2014-December/025451.html |
| [7] | PEP440: >1.7 vs >=1.7 https://mail.python.org/pipermail/distutils-sig/2014-December/025507.html |
| [8] | Amend PEP 440 with Wider Feedback on Release Candidates https://mail.python.org/pipermail/distutils-sig/2014-December/025409.html |
| [9] | Changing the status of PEP 440 to Provisional https://mail.python.org/pipermail/distutils-sig/2014-December/025412.html |
| [10] | http://semver.org/ |
Appendix A
Metadata v2.0 guidelines versus setuptools:
$ invoke check.pep440 Total Version Compatibility: 245806/250521 (98.12%) Total Sorting Compatibility (Unfiltered): 45441/47114 (96.45%) Total Sorting Compatibility (Filtered): 47057/47114 (99.88%) Projects with No Compatible Versions: 498/47114 (1.06%) Projects with Differing Latest Version: 688/47114 (1.46%)
Copyright
This document has been placed in the public domain.
pep-0441 Improving Python ZIP Application Support
| PEP: | 441 |
|---|---|
| Title: | Improving Python ZIP Application Support |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Daniel Holth <dholth at gmail.com>, Paul Moore <p.f.moore at gmail.com> |
| Discussions-To: | https://mail.python.org/pipermail/python-dev/2015-February/138277.html |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30 March 2013 |
| Post-History: | 30 March 2013, 1 April 2013, 16 February 2015 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-February/138578.html |
Contents
Improving Python ZIP Application Support
Python has had the ability to execute directories or ZIP-format archives as scripts since version 2.6 [1]. When invoked with a zip file or directory as its first argument the interpreter adds that directory to sys.path and executes the __main__ module. These archives provide a great way to publish software that needs to be distributed as a single file script but is complex enough to need to be written as a collection of modules.
This feature is not as popular as it should be mainly because it was not promoted as part of Python 2.6 [2], so that it is relatively unknown, but also because the Windows installer does not register a file extension (other than .py) for this format of file, to associate with the launcher.
This PEP proposes to fix these problems by re-publicising the feature, defining the .pyz and .pyzw extensions as "Python ZIP Applications" and "Windowed Python ZIP Applications", and providing some simple tooling to manage the format.
A New Python ZIP Application Extension
The terminology "Python Zip Application" will be the formal term used for a zip-format archive that contains Python code in a form that can be directly executed by Python (specifically, it must have a __main__.py file in the root directory of the archive). The extension .pyz will be formally associated with such files.
The Python 3.5 installer will associate .pyz and .pyzw "Python Zip Applications" with the platform launcher so they can be executed. A .pyz archive is a console application and a .pyzw archive is a windowed application, indicating whether the console should appear when running the app.
On Unix, it would be ideal if the .pyz extension and the name "Python Zip Application" were registered (in the mime types database?). However, such an association is out of scope for this PEP.
Python Zip applications can be prefixed with a #! line pointing to the correct Python interpreter and an optional explanation:
#!/usr/bin/env python3 # Python application packed with zipapp module (binary contents of archive)
On Unix, this allows the OS to run the file with the correct interpreter, via the standard "shebang" support. On Windows, the Python launcher implements shebang support.
However, it is always possible to execute a .pyz application by supplying the filename to the Python interpreter directly.
As background, ZIP archives are defined with a footer containing relative offsets from the end of the file. They remain valid when concatenated to the end of any other file. This feature is completely standard and is how self-extracting ZIP archives and the bdist_wininst installer format work.
Minimal Tooling: The zipapp Module
This PEP also proposes including a module for working with these archives. The module will contain functions for working with Python zip application archives, and a command line interface (via python -m zipapp) for their creation and manipulation.
More complete tools for managing Python Zip Applications are encouraged as 3rd party applications on PyPI. Currently, pyzzer [5] and pex [6] are two such tools.
Module Interface
The zipapp module will provide the following functions:
create_archive(source, target=None, interpreter=None, main=None)
Create an application archive from source. The source can be any of the following:
- The name of a directory, in which case a new application archive will be created from the content of that directory.
- The name of an existing application archive file, in which case the file is copied to the target. The file name should include the .pyz or .pyzw extension, if required.
- A file object open for reading in bytes mode. The content of the file should be an application archive, and the file object is assumed to be positioned at the start of the archive.
The target argument determines where the resulting archive will be written:
- If it is the name of a file, the archive will be written to that file.
- If it is an open file object, the archive will be written to that file object, which must be open for writing in bytes mode.
- If the target is omitted (or None), the source must be a directory and the target will be a file with the same name as the source, with a .pyz extension added.
The interpreter argument specifies the name of the Python interpreter with which the archive will be executed. It is written as a "shebang" line at the start of the archive. On Unix, this will be interpreted by the OS, and on Windows it will be handled by the Python launcher. Omitting the interpreter results in no shebang line being written. If an interpreter is specified, and the target is a filename, the executable bit of the target file will be set.
The main argument specifies the name of a callable which will be used as the main program for the archive. It can only be specified if the source is a directory, and the source does not already contain a __main__.py file. The main argument should take the form "pkg.module:callable" and the archive will be run by importing "pkg.module" and executing the given callable with no arguments. It is an error to omit main if the source is a directory and does not contain a __main__.py file, as otherwise the resulting archive would not be executable.
If a file object is specified for source or target, it is the caller's responsibility to close it after calling create_archive.
When copying an existing archive, file objects supplied only need read and readline, or write methods. When creating an archive from a directory, if the target is a file object it will be passed to the zipfile.ZipFile class, and must supply the methods needed by that class.
get_interpreter(archive)
Returns the interpreter specified in the shebang line of the archive. If there is no shebang, the function returns None. The archive argument can be a filename or a file-like object open for reading in bytes mode.
Command Line Usage
The zipapp module can be run with the python -m flag. The command line interface is as follows:
python -m zipapp directory [options]
Create an archive from the given directory. An archive will
be created from the contents of that directory. The archive
will have the same name as the source directory with a .pyz
extension.
The following options can be specified:
-o archive / --output archive
The destination archive will have the specified name. The
given name will be used as written, so should include the
".pyz" or ".pyzw" extension.
-p interpreter / --python interpreter
The given interpreter will be written to the shebang line
of the archive. If this option is not given, the archive
will have no shebang line.
-m pkg.mod:fn / --main pkg.mod:fn
The source directory must not have a __main__.py file. The
archiver will write a __main__.py file into the target
which calls fn from the module pkg.mod.
The behaviour of the command line interface matches that of zipapp.create_archive().
In addition, it is possible to use the command line interface to work with an existing archive:
python -m zipapp app.pyz --show
Displays the shebang line of an archive. Output is of the
form
Interpreter: /usr/bin/env
or
Interpreter: <none>
and is intended for diagnostic use, not for scripts.
python -m zipapp app.pyz -o newapp.pyz [-p interpreter]
Copy app.pyz to newapp.pyz, modifying the shebang line based
on the -p option (as for creating an archive, no -p option
means remove the shebang line). Specifying a destination is
mandatory.
In-place modification of an archive is *not* supported, as the
risk of damaging archives is too great for a simple tool.
As noted, the archives are standard zip files, and so can be unpacked using any standard ZIP utility or Python's zipfile module. For this reason, no interfaces to list the contents of an archive, or unpack them, are provided or needed.
FAQ
- Are you sure a standard ZIP utility can handle #! at the beginning?
- Absolutely. The zipfile specification allows for arbitrary data to be prepended to a zipfile. This feature is commonly used by "self-extracting zip" programs. If your archive program can't handle this, it is a bug in your archive program.
- Isn't zipapp just a very thin wrapper over the zipfile module?
- Yes. If you prefer to build your own Python zip application archives using other tools, they will work just as well. The zipapp module is a convenience, nothing more.
- Why not use just use a .zip or .py extension?
- Users expect a .zip file to be opened with an archive tool, and expect a .py file to contain readable text. Both would be confusing for this use case.
- How does this compete with existing package formats?
- The sdist, bdist and wheel formats are designed for packaging of modules to be installed into an existing Python installation. They are not intended to be used without installing. The executable zip format is specifically designed for standalone use, without needing to be installed. They are in effect a multi-file version of a standalone Python script.
Rejected Proposals
Convenience Values for Shebang Lines
Is it worth having "convenience" forms for any of the common interpreter values? For example, -p 3 meaning the same as -p "/usr/bin/env python3". It would save a lot of typing for the common cases, as well as giving cross-platform options for people who don't want or need to understand the intricacies of shebang handling on "other" platforms.
Downsides are that it's not obvious how to translate the abbreviations. For example, should "3" mean "/usr/bin/env python3", "/usr/bin/python3", "python3", or something else? Also, there is no obvious short form for the key case of "/usr/bin/env python" (any available version of Python), which could easily result in scripts being written with overly-restrictive shebang lines.
Overall, this seems like there are more problems than benefits, and as a result has been dropped from consideration.
Registering .pyz as a Media Type
It was suggested [3] that the .pyz extension should be registered in the Unix database of extensions. While it makes sense to do this as an equivalent of the Windows installer registering the extension, the .py extension is not listed in the media types database [4]. It doesn't seem reasonable to register .pyz without .py, so this idea has been omitted from this PEP. An interested party could arrange for both .py and .pyz to be registered at a future date.
Default Interpreter
The initial draft of this PEP proposed using /usr/bin/env python as the default interpreter. Unix users have problems with this behaviour, as the default for the python command on many distributions is Python 2, and it is felt that this PEP should prefer Python 3 by default. However, using a command of python3 can result in unexpected behaviour for Windows users, where the default behaviour of the launcher for the command python is commonly customised by users, but the behaviour of python3 may not be modified to match.
As a result, the principle "in the face of ambiguity, refuse to guess" has been invoked, and archives have no shebang line unless explicitly requested. On Windows, the archives will still be run (with the default Python) by the launcher, and on Unix, the archives can be run by explicitly invoking the desired Python interpreter.
Command Line Tool to Manage Shebang Lines
It is conceivable that users would want to modify the shebang line for an existing archive, or even just display the current shebang line. This is tricky to do so with existing tools (zip programs typically ignore prepended data totally, and text editors can have trouble editing files containing binary data).
The zipapp module provides functions to handle the shebang line, but does not include a command line interface to that functionality. This is because it is not clear how to provide one without the resulting interface being over-complex and potentially confusing. Changing the shebang line is expected to be an uncommon requirement.
Reference Implementation
A reference implementation is at http://bugs.python.org/issue23491.
References
| [1] | Allow interpreter to execute a zip file (http://bugs.python.org/issue1739468) |
| [2] | Feature is not documented (http://bugs.python.org/issue17359) |
| [3] | Discussion of adding a .pyz mime type on python-dev (https://mail.python.org/pipermail/python-dev/2015-February/138338.html) |
| [4] | Register of media types (http://www.iana.org/assignments/media-types/media-types.xhtml) |
| [5] | pyzzer - A tool for creating Python-executable archives (https://pypi.python.org/pypi/pyzzer) |
| [6] | pex - The PEX packaging toolchain (https://pypi.python.org/pypi/pex) |
The discussion of this PEP took place on the python-dev mailing list, in the thread starting at https://mail.python.org/pipermail/python-dev/2015-February/138277.html
Copyright
This document has been placed into the public domain.
pep-0442 Safe object finalization
| PEP: | 442 |
|---|---|
| Title: | Safe object finalization |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| BDFL-Delegate: | Benjamin Peterson <benjamin@python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2013-05-18 |
| Python-Version: | 3.4 |
| Post-History: | 2013-05-18 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-June/126746.html |
Contents
Abstract
This PEP proposes to deal with the current limitations of object finalization. The goal is to be able to define and run finalizers for any object, regardless of their position in the object graph.
This PEP doesn't call for any change in Python code. Objects with existing finalizers will benefit automatically.
Definitions
- Reference
- A directional link from an object to another. The target of the reference is kept alive by the reference, as long as the source is itself alive and the reference isn't cleared.
- Weak reference
- A directional link from an object to another, which doesn't keep alive its target. This PEP focusses on non-weak references.
- Reference cycle
- A cyclic subgraph of directional links between objects, which keeps those objects from being collected in a pure reference-counting scheme.
- Cyclic isolate (CI)
- A standalone subgraph of objects in which no object is referenced from the outside, containing one or several reference cycles, and whose objects are still in a usable, non-broken state: they can access each other from their respective finalizers.
- Cyclic garbage collector (GC)
- A device able to detect cyclic isolates and turn them into cyclic trash. Objects in cyclic trash are eventually disposed of by the natural effect of the references being cleared and their reference counts dropping to zero.
- Cyclic trash (CT)
- A former cyclic isolate whose objects have started being cleared by the GC. Objects in cyclic trash are potential zombies; if they are accessed by Python code, the symptoms can vary from weird AttributeErrors to crashes.
- Zombie / broken object
- An object part of cyclic trash. The term stresses that the object is not safe: its outgoing references may have been cleared, or one of the objects it references may be zombie. Therefore, it should not be accessed by arbitrary code (such as finalizers).
- Finalizer
- A function or method called when an object is intended to be disposed of. The finalizer can access the object and release any resource held by the object (for example mutexes or file descriptors). An example is a __del__ method.
- Resurrection
- The process by which a finalizer creates a new reference to an object in a CI. This can happen as a quirky but supported side-effect of __del__ methods.
Impact
While this PEP discusses CPython-specific implementation details, the change in finalization semantics is expected to affect the Python ecosystem as a whole. In particular, this PEP obsoletes the current guideline that "objects with a __del__ method should not be part of a reference cycle".
Benefits
The primary benefits of this PEP regard objects with finalizers, such as objects with a __del__ method and generators with a finally block. Those objects can now be reclaimed when they are part of a reference cycle.
The PEP also paves the way for further benefits:
- The module shutdown procedure may not need to set global variables to None anymore. This could solve a well-known class of irritating issues.
The PEP doesn't change the semantics of:
- Weak references caught in reference cycles.
- C extension types with a custom tp_dealloc function.
Description
Reference-counted disposal
In normal reference-counted disposal, an object's finalizer is called just before the object is deallocated. If the finalizer resurrects the object, deallocation is aborted.
However, if the object was already finalized, then the finalizer isn't called. This prevents us from finalizing zombies (see below).
Disposal of cyclic isolates
Cyclic isolates are first detected by the garbage collector, and then disposed of. The detection phase doesn't change and won't be described here. Disposal of a CI traditionally works in the following order:
- Weakrefs to CI objects are cleared, and their callbacks called. At this point, the objects are still safe to use.
- The CI becomes a CT as the GC systematically breaks all known references inside it (using the tp_clear function).
- Nothing. All CT objects should have been disposed of in step 2 (as a side-effect of clearing references); this collection is finished.
This PEP proposes to turn CI disposal into the following sequence (new steps are in bold):
- Weakrefs to CI objects are cleared, and their callbacks called. At this point, the objects are still safe to use.
- The finalizers of all CI objects are called.
- The CI is traversed again to determine if it is still isolated. If it is determined that at least one object in CI is now reachable from outside the CI, this collection is aborted and the whole CI is resurrected. Otherwise, proceed.
- The CI becomes a CT as the GC systematically breaks all known references inside it (using the tp_clear function).
- Nothing. All CT objects should have been disposed of in step 4 (as a side-effect of clearing references); this collection is finished.
Note
The GC doesn't recalculate the CI after step 2 above, hence the need for step 3 to check that the whole subgraph is still isolated.
C-level changes
Type objects get a new tp_finalize slot to which __del__ methods are mapped (and reciprocally). Generators are modified to use this slot, rather than tp_del. A tp_finalize function is a normal C function which will be called with a valid and alive PyObject as its only argument. It doesn't need to manipulate the object's reference count, as this will be done by the caller. However, it must ensure that the original exception state is restored before returning to the caller.
For compatibility, tp_del is kept in the type structure. Handling of objects with a non-NULL tp_del is unchanged: when part of a CI, they are not finalized and end up in gc.garbage. However, a non-NULL tp_del is not encountered anymore in the CPython source tree (except for testing purposes).
Two new C API functions are provided to ease calling of tp_finalize, especially from custom deallocators.
On the internal side, a bit is reserved in the GC header for GC-managed objects to signal that they were finalized. This helps avoid finalizing an object twice (and, especially, finalizing a CT object after it was broken by the GC).
Note
Objects which are not GC-enabled can also have a tp_finalize slot. They don't need the additional bit since their tp_finalize function can only be called from the deallocator: it therefore cannot be called twice, except when resurrected.
Discussion
Predictability
Following this scheme, an object's finalizer is always called exactly once, even if it was resurrected afterwards.
For CI objects, the order in which finalizers are called (step 2 above) is undefined.
Safety
It is important to explain why the proposed change is safe. There are two aspects to be discussed:
- Can a finalizer access zombie objects (including the object being finalized)?
- What happens if a finalizer mutates the object graph so as to impact the CI?
Let's discuss the first issue. We will divide possible cases in two categories:
- If the object being finalized is part of the CI: by construction, no objects in CI are zombies yet, since CI finalizers are called before any reference breaking is done. Therefore, the finalizer cannot access zombie objects, which don't exist.
- If the object being finalized is not part of the CI/CT: by definition, objects in the CI/CT don't have any references pointing to them from outside the CI/CT. Therefore, the finalizer cannot reach any zombie object (that is, even if the object being finalized was itself referenced from a zombie object).
Now for the second issue. There are three potential cases:
- The finalizer clears an existing reference to a CI object. The CI object may be disposed of before the GC tries to break it, which is fine (the GC simply has to be aware of this possibility).
- The finalizer creates a new reference to a CI object. This can only happen from a CI object's finalizer (see above why). Therefore, the new reference will be detected by the GC after all CI finalizers are called (step 3 above), and collection will be aborted without any objects being broken.
- The finalizer clears or creates a reference to a non-CI object. By construction, this is not a problem.
Implementation
An implementation is available in branch finalize of the repository at http://hg.python.org/features/finalize/.
Validation
Besides running the normal Python test suite, the implementation adds test cases for various finalization possibilities including reference cycles, object resurrection and legacy tp_del slots.
The implementation has also been checked to not produce any regressions on the following test suites:
- Tulip, which makes an extensive use of generators
- Tornado
- SQLAlchemy
- Django
- zope.interface
References
Notes about reference cycle collection and weak reference callbacks: http://hg.python.org/cpython/file/4e687d53b645/Modules/gc_weakref.txt
Generator memory leak: http://bugs.python.org/issue17468
Allow objects to decide if they can be collected by GC: http://bugs.python.org/issue9141
Module shutdown procedure based on GC http://bugs.python.org/issue812369
Copyright
This document has been placed in the public domain.
pep-0443 Single-dispatch generic functions
| PEP: | 443 |
|---|---|
| Title: | Single-dispatch generic functions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ĺukasz Langa <lukasz at langa.pl> |
| Discussions-To: | Python-Dev <python-dev at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 22-May-2013 |
| Post-History: | 22-May-2013, 25-May-2013, 31-May-2013 |
| Replaces: | 245 246 3124 |
Contents
Abstract
This PEP proposes a new mechanism in the functools standard library module that provides a simple form of generic programming known as single-dispatch generic functions.
A generic function is composed of multiple functions implementing the same operation for different types. Which implementation should be used during a call is determined by the dispatch algorithm. When the implementation is chosen based on the type of a single argument, this is known as single dispatch.
Rationale and Goals
Python has always provided a variety of built-in and standard-library generic functions, such as len(), iter(), pprint.pprint(), copy.copy(), and most of the functions in the operator module. However, it currently:
- does not have a simple or straightforward way for developers to create new generic functions,
- does not have a standard way for methods to be added to existing generic functions (i.e., some are added using registration functions, others require defining __special__ methods, possibly by monkeypatching).
In addition, it is currently a common anti-pattern for Python code to inspect the types of received arguments, in order to decide what to do with the objects.
For example, code may wish to accept either an object of some type, or a sequence of objects of that type. Currently, the "obvious way" to do this is by type inspection, but this is brittle and closed to extension.
Abstract Base Classes make it easier to discover present behaviour, but don't help adding new behaviour. A developer using an already-written library may be unable to change how their objects are treated by such code, especially if the objects they are using were created by a third party.
Therefore, this PEP proposes a uniform API to address dynamic overloading using decorators.
User API
To define a generic function, decorate it with the @singledispatch decorator. Note that the dispatch happens on the type of the first argument. Create your function accordingly:
>>> from functools import singledispatch
>>> @singledispatch
... def fun(arg, verbose=False):
... if verbose:
... print("Let me just say,", end=" ")
... print(arg)
To add overloaded implementations to the function, use the register() attribute of the generic function. This is a decorator, taking a type parameter and decorating a function implementing the operation for that type:
>>> @fun.register(int)
... def _(arg, verbose=False):
... if verbose:
... print("Strength in numbers, eh?", end=" ")
... print(arg)
...
>>> @fun.register(list)
... def _(arg, verbose=False):
... if verbose:
... print("Enumerate this:")
... for i, elem in enumerate(arg):
... print(i, elem)
To enable registering lambdas and pre-existing functions, the register() attribute can be used in a functional form:
>>> def nothing(arg, verbose=False):
... print("Nothing.")
...
>>> fun.register(type(None), nothing)
The register() attribute returns the undecorated function. This enables decorator stacking, pickling, as well as creating unit tests for each variant independently:
>>> @fun.register(float)
... @fun.register(Decimal)
... def fun_num(arg, verbose=False):
... if verbose:
... print("Half of your number:", end=" ")
... print(arg / 2)
...
>>> fun_num is fun
False
When called, the generic function dispatches on the type of the first argument:
>>> fun("Hello, world.")
Hello, world.
>>> fun("test.", verbose=True)
Let me just say, test.
>>> fun(42, verbose=True)
Strength in numbers, eh? 42
>>> fun(['spam', 'spam', 'eggs', 'spam'], verbose=True)
Enumerate this:
0 spam
1 spam
2 eggs
3 spam
>>> fun(None)
Nothing.
>>> fun(1.23)
0.615
Where there is no registered implementation for a specific type, its method resolution order is used to find a more generic implementation. The original function decorated with @singledispatch is registered for the base object type, which means it is used if no better implementation is found.
To check which implementation will the generic function choose for a given type, use the dispatch() attribute:
>>> fun.dispatch(float) <function fun_num at 0x104319058> >>> fun.dispatch(dict) # note: default implementation <function fun at 0x103fe0000>
To access all registered implementations, use the read-only registry attribute:
>>> fun.registry.keys()
dict_keys([<class 'NoneType'>, <class 'int'>, <class 'object'>,
<class 'decimal.Decimal'>, <class 'list'>,
<class 'float'>])
>>> fun.registry[float]
<function fun_num at 0x1035a2840>
>>> fun.registry[object]
<function fun at 0x103fe0000>
The proposed API is intentionally limited and opinionated, as to ensure it is easy to explain and use, as well as to maintain consistency with existing members in the functools module.
Implementation Notes
The functionality described in this PEP is already implemented in the pkgutil standard library module as simplegeneric. Because this implementation is mature, the goal is to move it largely as-is. The reference implementation is available on hg.python.org [1].
The dispatch type is specified as a decorator argument. An alternative form using function annotations was considered but its inclusion has been rejected. As of May 2013, this usage pattern is out of scope for the standard library [2], and the best practices for annotation usage are still debated.
Based on the current pkgutil.simplegeneric implementation, and following the convention on registering virtual subclasses on Abstract Base Classes, the dispatch registry will not be thread-safe.
Abstract Base Classes
The pkgutil.simplegeneric implementation relied on several forms of method resultion order (MRO). @singledispatch removes special handling of old-style classes and Zope's ExtensionClasses. More importantly, it introduces support for Abstract Base Classes (ABC).
When a generic function implementation is registered for an ABC, the dispatch algorithm switches to an extended form of C3 linearization, which includes the relevant ABCs in the MRO of the provided argument. The algorithm inserts ABCs where their functionality is introduced, i.e. issubclass(cls, abc) returns True for the class itself but returns False for all its direct base classes. Implicit ABCs for a given class (either registered or inferred from the presence of a special method like __len__()) are inserted directly after the last ABC explicitly listed in the MRO of said class.
In its most basic form, this linearization returns the MRO for the given type:
>>> _compose_mro(dict, []) [<class 'dict'>, <class 'object'>]
When the second argument contains ABCs that the specified type is a subclass of, they are inserted in a predictable order:
>>> _compose_mro(dict, [Sized, MutableMapping, str, ... Sequence, Iterable]) [<class 'dict'>, <class 'collections.abc.MutableMapping'>, <class 'collections.abc.Mapping'>, <class 'collections.abc.Sized'>, <class 'collections.abc.Iterable'>, <class 'collections.abc.Container'>, <class 'object'>]
While this mode of operation is significantly slower, all dispatch decisions are cached. The cache is invalidated on registering new implementations on the generic function or when user code calls register() on an ABC to implicitly subclass it. In the latter case, it is possible to create a situation with ambiguous dispatch, for instance:
>>> from collections import Iterable, Container >>> class P: ... pass >>> Iterable.register(P) <class '__main__.P'> >>> Container.register(P) <class '__main__.P'>
Faced with ambiguity, @singledispatch refuses the temptation to guess:
>>> @singledispatch ... def g(arg): ... return "base" ... >>> g.register(Iterable, lambda arg: "iterable") <function <lambda> at 0x108b49110> >>> g.register(Container, lambda arg: "container") <function <lambda> at 0x108b491c8> >>> g(P()) Traceback (most recent call last): ... RuntimeError: Ambiguous dispatch: <class 'collections.abc.Container'> or <class 'collections.abc.Iterable'>
Note that this exception would not be raised if one or more ABCs had been provided explicitly as base classes during class definition. In this case dispatch happens in the MRO order:
>>> class Ten(Iterable, Container): ... def __iter__(self): ... for i in range(10): ... yield i ... def __contains__(self, value): ... return value in range(10) ... >>> g(Ten()) 'iterable'
A similar conflict arises when subclassing an ABC is inferred from the presence of a special method like __len__() or __contains__():
>>> class Q: ... def __contains__(self, value): ... return False ... >>> issubclass(Q, Container) True >>> Iterable.register(Q) >>> g(Q()) Traceback (most recent call last): ... RuntimeError: Ambiguous dispatch: <class 'collections.abc.Container'> or <class 'collections.abc.Iterable'>
An early version of the PEP contained a custom approach that was simpler but created a number of edge cases with surprising results [3].
Usage Patterns
This PEP proposes extending behaviour only of functions specifically marked as generic. Just as a base class method may be overridden by a subclass, so too a function may be overloaded to provide custom functionality for a given type.
Universal overloading does not equal arbitrary overloading, in the sense that we need not expect people to randomly redefine the behavior of existing functions in unpredictable ways. To the contrary, generic function usage in actual programs tends to follow very predictable patterns and registered implementations are highly-discoverable in the common case.
If a module is defining a new generic operation, it will usually also define any required implementations for existing types in the same place. Likewise, if a module is defining a new type, then it will usually define implementations there for any generic functions that it knows or cares about. As a result, the vast majority of registered implementations can be found adjacent to either the function being overloaded, or to a newly-defined type for which the implementation is adding support.
It is only in rather infrequent cases that one will have implementations registered in a module that contains neither the function nor the type(s) for which the implementation is added. In the absence of incompetence or deliberate intention to be obscure, the few implementations that are not registered adjacent to the relevant type(s) or function(s), will generally not need to be understood or known about outside the scope where those implementations are defined. (Except in the "support modules" case, where best practice suggests naming them accordingly.)
As mentioned earlier, single-dispatch generics are already prolific throughout the standard library. A clean, standard way of doing them provides a way forward to refactor those custom implementations to use a common one, opening them up for user extensibility at the same time.
Alternative approaches
In PEP 3124 [4] Phillip J. Eby proposes a full-grown solution with overloading based on arbitrary rule sets (with the default implementation dispatching on argument types), as well as interfaces, adaptation and method combining. PEAK-Rules [5] is a reference implementation of the concepts described in PJE's PEP.
Such a broad approach is inherently complex, which makes reaching a consensus hard. In contrast, this PEP focuses on a single piece of functionality that is simple to reason about. It's important to note this does not preclude the use of other approaches now or in the future.
In a 2005 article on Artima [6] Guido van Rossum presents a generic function implementation that dispatches on types of all arguments on a function. The same approach was chosen in Andrey Popp's generic package available on PyPI [7], as well as David Mertz's gnosis.magic.multimethods [8].
While this seems desirable at first, I agree with Fredrik Lundh's comment that "if you design APIs with pages of logic just to sort out what code a function should execute, you should probably hand over the API design to someone else". In other words, the single argument approach proposed in this PEP is not only easier to implement but also clearly communicates that dispatching on a more complex state is an anti-pattern. It also has the virtue of corresponding directly with the familiar method dispatch mechanism in object oriented programming. The only difference is whether the custom implementation is associated more closely with the data (object-oriented methods) or the algorithm (single-dispatch overloading).
PyPy's RPython offers extendabletype [9], a metaclass which enables classes to be externally extended. In combination with pairtype() and pair() factories, this offers a form of single-dispatch generics.
Acknowledgements
Apart from Phillip J. Eby's work on PEP 3124 [4] and PEAK-Rules, influences include Paul Moore's original issue [10] that proposed exposing pkgutil.simplegeneric as part of the functools API, Guido van Rossum's article on multimethods [6], and discussions with Raymond Hettinger on a general pprint rewrite. Huge thanks to Nick Coghlan for encouraging me to create this PEP and providing initial feedback.
References
| [1] | http://hg.python.org/features/pep-443/file/tip/Lib/functools.py#l359 |
| [2] | PEP 8 states in the "Programming Recommendations" section that "the Python standard library will not use function annotations as that would result in a premature commitment to a particular annotation style". (http://www.python.org/dev/peps/pep-0008) |
| [3] | http://bugs.python.org/issue18244 |
| [4] | (1, 2) http://www.python.org/dev/peps/pep-3124/ |
| [5] | http://peak.telecommunity.com/DevCenter/PEAK_2dRules |
| [6] | (1, 2) http://www.artima.com/weblogs/viewpost.jsp?thread=101605 |
| [7] | http://pypi.python.org/pypi/generic |
| [8] | http://gnosis.cx/publish/programming/charming_python_b12.html |
| [9] | https://bitbucket.org/pypy/pypy/raw/default/rpython/tool/pairtype.py |
| [10] | http://bugs.python.org/issue5135 |
Copyright
This document has been placed in the public domain.
pep-0444 Python Web3 Interface
| PEP: | 444 |
|---|---|
| Title: | Python Web3 Interface |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Chris McDonough <chrism at plope.com>, Armin Ronacher <armin.ronacher at active-4.com> |
| Discussions-To: | Python Web-SIG <web-sig at python.org> |
| Status: | Deferred |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 19-Jul-2010 |
Contents
- Abstract
- PEP Deferral
- Rationale and Goals
- Differences from WSGI
- Specification Overview
- Specification Details
- Implementation/Application Notes
- Open Questions
- Points of Contention
- WSGI 1.0 Compatibility
- Environ and Response Values as Bytes
- Applications Should be Allowed to Read web3.input Past CONTENT_LENGTH
- web3.input Unknown Length
- read() of web3.input Should Support No-Size Calling Convention
- headers as Literal List of Two-Tuples
- Removed Requirement that Middleware Not Block
- web3.script_name and web3.path_info
- Long Response Headers
- Request Trailers and Chunked Transfer Encoding
- References
- Copyright
Abstract
This document specifies a proposed second-generation standard interface between web servers and Python web applications or frameworks.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
Note that since this PEP was first created, PEP 3333 was created as a more incremental update that permitted use of WSGI on Python 3.2+. However, an alternative specification that furthers the Python 3 goals of a cleaner separation of binary and text data may still be valuable.
Rationale and Goals
This protocol and specification is influenced heavily by the Web Services Gateway Interface (WSGI) 1.0 standard described in PEP 333 [1]. The high-level rationale for having any standard that allows Python-based web servers and applications to interoperate is outlined in PEP 333. This document essentially uses PEP 333 as a template, and changes its wording in various places for the purpose of forming a different standard.
Python currently boasts a wide variety of web application frameworks which use the WSGI 1.0 protocol. However, due to changes in the language, the WSGI 1.0 protocol is not compatible with Python 3. This specification describes a standardized WSGI-like protocol that lets Python 2.6, 2.7 and 3.1+ applications communicate with web servers. Web3 is clearly a WSGI derivative; it only uses a different name than "WSGI" in order to indicate that it is not in any way backwards compatible.
Applications and servers which are written to this specification are meant to work properly under Python 2.6.X, Python 2.7.X and Python 3.1+. Neither an application nor a server that implements the Web3 specification can be easily written which will work under Python 2 versions earlier than 2.6 nor Python 3 versions earlier than 3.1.
Note
Whatever Python 3 version fixed http://bugs.python.org/issue4006 so os.environ['foo'] returns surrogates (ala PEP 383) when the value of 'foo' cannot be decoded using the current locale instead of failing with a KeyError is the true minimum Python 3 version. In particular, however, Python 3.0 is not supported.
Note
Python 2.6 is the first Python version that supported an alias for bytes and the b"foo" literal syntax. This is why it is the minimum version supported by Web3.
Explicability and documentability are the main technical drivers for the decisions made within the standard.
Differences from WSGI
- All protocol-specific environment names are prefixed with web3. rather than wsgi., eg. web3.input rather than wsgi.input.
- All values present as environment dictionary values are explicitly bytes instances instead of native strings. (Environment keys however are native strings, always str regardless of platform).
- All values returned by an application must be bytes instances, including status code, header names and values, and the body.
- Wherever WSGI 1.0 referred to an app_iter, this specification refers to a body.
- No start_response() callback (and therefore no write() callable nor exc_info data).
- The readline() function of web3.input must support a size hint parameter.
- The read() function of web3.input must be length delimited. A call without a size argument must not read more than the content length header specifies. In case a content length header is absent the stream must not return anything on read. It must never request more data than specified from the client.
- No requirement for middleware to yield an empty string if it needs more information from an application to produce output (e.g. no "Middleware Handling of Block Boundaries").
- Filelike objects passed to a "file_wrapper" must have an __iter__ which returns bytes (never text).
- wsgi.file_wrapper is not supported.
- QUERY_STRING, SCRIPT_NAME, PATH_INFO values required to be placed in environ by server (each as the empty bytes instance if no associated value is received in the HTTP request).
- web3.path_info and web3.script_name should be put into the Web3 environment, if possible, by the origin Web3 server. When available, each is the original, plain 7-bit ASCII, URL-encoded variant of its CGI equivalent derived directly from the request URI (with %2F segment markers and other meta-characters intact). If the server cannot provide one (or both) of these values, it must omit the value(s) it cannot provide from the environment.
- This requirement was removed: "middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string."
- SERVER_PORT must be a bytes instance (not an integer).
- The server must not inject an additional Content-Length header by guessing the length from the response iterable. This must be set by the application itself in all situations.
- If the origin server advertises that it has the web3.async capability, a Web3 application callable used by the server is permitted to return a callable that accepts no arguments. When it does so, this callable is to be called periodically by the origin server until it returns a non-None response, which must be a normal Web3 response tuple.
Specification Overview
The Web3 interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.
In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.
Throughout this specification, we will use the term "application callable" to mean "a function, a method, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the application callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Application callables are only to be called, not introspected upon.
The Application/Framework Side
The application object is simply a callable object that accepts one argument. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests. If this cannot be guaranteed by the implementation of the actual application, it has to be wrapped in a function that creates a new instance on each call.
Note
Although we refer to it as an "application" object, this should not be construed to mean that application developers will use Web3 as a web programming API. It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. Web3 is a tool for framework and server developers, and is not intended to directly support application developers.)
An example of an application which is a function (simple_app):
def simple_app(environ):
"""Simplest possible application object"""
status = b'200 OK'
headers = [(b'Content-type', b'text/plain')]
body = [b'Hello world!\n']
return body, status, headers
An example of an application which is an instance (simple_app):
class AppClass(object):
"""Produce the same output, but using an instance. An
instance of this class must be instantiated before it is
passed to the server. """
def __call__(self, environ):
status = b'200 OK'
headers = [(b'Content-type', b'text/plain')]
body = [b'Hello world!\n']
return body, status, headers
simple_app = AppClass()
Alternately, an application callable may return a callable instead of the tuple if the server supports asynchronous execution. See information concerning web3.async for more information.
The Server/Gateway Side
The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.
import locale
import os
import sys
encoding = locale.getpreferredencoding()
stdout = sys.stdout
if hasattr(sys.stdout, 'buffer'):
# Python 3 compatibility; we need to be able to push bytes out
stdout = sys.stdout.buffer
def get_environ():
d = {}
for k, v in os.environ.items():
# Python 3 compatibility
if not isinstance(v, bytes):
# We must explicitly encode the string to bytes under
# Python 3.1+
v = v.encode(encoding, 'surrogateescape')
d[k] = v
return d
def run_with_cgi(application):
environ = get_environ()
environ['web3.input'] = sys.stdin
environ['web3.errors'] = sys.stderr
environ['web3.version'] = (1, 0)
environ['web3.multithread'] = False
environ['web3.multiprocess'] = True
environ['web3.run_once'] = True
environ['web3.async'] = False
if environ.get('HTTPS', b'off') in (b'on', b'1'):
environ['web3.url_scheme'] = b'https'
else:
environ['web3.url_scheme'] = b'http'
rv = application(environ)
if hasattr(rv, '__call__'):
raise TypeError('This webserver does not support asynchronous '
'responses.')
body, status, headers = rv
CLRF = b'\r\n'
try:
stdout.write(b'Status: ' + status + CRLF)
for header_name, header_val in headers:
stdout.write(header_name + b': ' + header_val + CRLF)
stdout.write(CRLF)
for chunk in body:
stdout.write(chunk)
stdout.flush()
finally:
if hasattr(body, 'close'):
body.close()
Middleware: Components that Play Both Sides
A single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:
- Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
- Allowing multiple applications or frameworks to run side-by-side in the same process.
- Load balancing and remote processing, by forwarding requests and responses over a network.
- Perform content postprocessing, such as applying XSL stylesheets.
The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".
A middleware must support asychronous execution if possible or fall back to disabling itself.
Here a middleware that changes the HTTP_HOST key if an X-Host header exists and adds a comment to all html responses:
import time
def apply_filter(app, environ, filter_func):
"""Helper function that passes the return value from an
application to a filter function when the results are
ready.
"""
app_response = app(environ)
# synchronous response, filter now
if not hasattr(app_response, '__call__'):
return filter_func(*app_response)
# asychronous response. filter when results are ready
def polling_function():
rv = app_response()
if rv is not None:
return filter_func(*rv)
return polling_function
def proxy_and_timing_support(app):
def new_application(environ):
def filter_func(body, status, headers):
now = time.time()
for key, value in headers:
if key.lower() == b'content-type' and \
value.split(b';')[0] == b'text/html':
# assumes ascii compatible encoding in body,
# but the middleware should actually parse the
# content type header and figure out the
# encoding when doing that.
body += ('<!-- Execution time: %.2fsec -->' %
(now - then)).encode('ascii')
break
return body, status, headers
then = time.time()
host = environ.get('HTTP_X_HOST')
if host is not None:
environ['HTTP_HOST'] = host
# use the apply_filter function that applies a given filter
# function for both async and sync responses.
return apply_filter(app, environ, filter_func)
return new_application
app = proxy_and_timing_support(app)
Specification Details
The application callable must accept one positional argument. For the sake of illustration, we have named it environ, but it is not required to have this name. A server or gateway must invoke the application object using a positional (not keyword) argument. (E.g. by calling body, status, headers = application(environ) as shown above.)
The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain Web3-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.
When called by the server, the application object must return a tuple yielding three elements: status, headers and body, or, if supported by an async server, an argumentless callable which either returns None or a tuple of those three elements.
The status element is a status in bytes of the form b'999 Message here'.
headers is a Python list of (header_name, header_value) pairs describing the HTTP response header. The headers structure must be a literal Python list; it must yield two-tuples. Both header_name and header_value must be bytes values.
The body is an iterable yielding zero or more bytes instances. This can be accomplished in a variety of ways, such as by returning a list containing bytes instances as body, or by returning a generator function as body that yields bytes instances, or by the body being an instance of a class which is iterable. Regardless of how it is accomplished, the application object must always return a body iterable yielding zero or more bytes instances.
The server or gateway must transmit the yielded bytes to the client in an unbuffered fashion, completing the transmission of each set of bytes before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)
The server or gateway should treat the yielded bytes as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the string(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)
If the body iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an error. This is to support resource release by the application amd is intended to complement PEP 325's generator support, and other common iterables with close() methods.
Finally, servers and gateways must not directly use any other attributes of the body iterable returned by the application.
environ Variables
The environ dictionary is required to contain various CGI environment variables, as defined by the Common Gateway Interface specification [2].
The following CGI variables must be present. Each key is a native string. Each value is a bytes instance.
Note
In Python 3.1+, a "native string" is a str type decoded using the surrogateescape error handler, as done by os.environ.__getitem__. In Python 2.6 and 2.7, a "native string" is a str types representing a set of bytes.
- REQUEST_METHOD
- The HTTP request method, such as "GET" or "POST".
- SCRIPT_NAME
- The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be the empty bytes instance if the application corresponds to the "root" of the server. SCRIPT_NAME will be a bytes instance representing a sequence of URL-encoded segments separated by the slash character (/). It is assumed that %2F characters will be decoded into literal slash characters within PATH_INFO, as per CGI.
- PATH_INFO
- The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be a bytes instance if the request URL targets the application root and does not have a trailing slash. PATH_INFO will be a bytes instance representing a sequence of URL-encoded segments separated by the slash character (/). It is assumed that %2F characters will be decoded into literal slash characters within PATH_INFO, as per CGI.
- QUERY_STRING
- The portion of the request URL (in bytes) that follows the "?", if any, or the empty bytes instance.
- SERVER_NAME, SERVER_PORT
- When combined with SCRIPT_NAME and PATH_INFO (or their raw equivalents)`, these variables can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_PORT should be a bytes instance, not an integer.
- SERVER_PROTOCOL
- The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)
The following CGI values may present be in the Web3 environment. Each key is a native string. Each value is a bytes instances.
- CONTENT_TYPE
- The contents of any Content-Type fields in the HTTP request.
- CONTENT_LENGTH
- The contents of any Content-Length fields in the HTTP request.
- HTTP_ Variables
- Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.
A server or gateway should attempt to provide as many other CGI variables as are applicable, each with a string for its key and a bytes instance for its value. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)
A Web3-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.
Note that CGI variable values must be bytes instances, if they are present at all. It is a violation of this specification for a CGI variable's value to be of any type other than bytes. On Python 2, this means they will be of type str. On Python 3, this means they will be of type bytes.
They keys of all CGI and non-CGI variables in the environ, however, must be "native strings" (on both Python 2 and Python 3, they will be of type str).
In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following Web3-defined variables.
| Variable | Value |
|---|---|
| web3.version | The tuple (1, 0), representing Web3 version 1.0. |
| web3.url_scheme | A bytes value representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value b"http" or b"https", as appropriate. |
| web3.input | An input stream (file-like object) from which bytes constituting the HTTP request body can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.) |
| web3.errors | An output stream (file-like object) to which error output text can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway. Applications may not send bytes to the 'write' method of this stream; they may only send text. For many servers, web3.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired. |
| web3.multithread | This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise. |
| web3.multiprocess | This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise. |
| web3.run_once | This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar). |
| web3.script_name | The non-URL-decoded SCRIPT_NAME value. Through a historical inequity, by virtue of the CGI specification, SCRIPT_NAME is present within the environment as an already URL-decoded string. This is the original URL-encoded value derived from the request URI. If the server cannot provide this value, it must omit it from the environ. |
| web3.path_info | The non-URL-decoded PATH_INFO value. Through a historical inequity, by virtue of the CGI specification, PATH_INFO is present within the environment as an already URL-decoded string. This is the original URL-encoded value derived from the request URI. If the server cannot provide this value, it must omit it from the environ. |
| web3.async | This is True if the webserver supports async invocation. In that case an application is allowed to return a callable instead of a tuple with the response. The exact semantics are not specified by this specification. |
Finally, the environ dictionary may also contain server-defined variables. These variables should have names which are native strings, composed of only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_web3 might define variables with names like mod_web3.some_variable.
Input Stream
The input stream (web3.input) provided by the server must support the following methods:
| Method | Notes |
|---|---|
| read(size) | 1,4 |
| readline([size]) | 1,2,4 |
| readlines([size]) | 1,3,4 |
| __iter__() | 4 |
The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:
- The server is not required to read past the client's specified Content-Length, and is allowed to simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.
- The implementation must support the optional size argument to readline().
- The application is free to not supply a size argument to readlines(), and the server or gateway is free to ignore the value of any supplied size argument.
- The read, readline and __iter__ methods must return a bytes instance. The readlines method must return a sequence which contains instances of bytes.
The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input object. In particular, applications must not attempt to close this stream, even if it possesses a close() method.
The input stream should silently ignore attempts to read more than the content length of the request. If no content length is specified the stream must be a dummy stream that does not return anything.
Error Stream
The error stream (web3.errors) provided by the server must support the following methods:
| Method | Stream | Notes |
|---|---|---|
| flush() | errors | 1 |
| write(str) | errors | 2 |
| writelines(seq) | errors | 2 |
The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:
- Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)
- The write() method must accept a string argument, but needn't necessarily accept a bytes argument. The writelines() method must accept a sequence argument that consists entirely of strings, but needn't necessarily accept any bytes instance as a member of the sequence.
The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the errors object. In particular, applications must not attempt to close this stream, even if it possesses a close() method.
Values Returned by A Web3 Application
Web3 applications return a tuple in the form (status, headers, body). If the server supports asynchronous applications (web3.async), the response may be a callable object (which accepts no arguments).
The status value is assumed by a gateway or server to be an HTTP "status" bytes instance like b'200 OK' or b'404 Not Found'. That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.
The headers value is assumed by a gateway or server to be a literal Python list of (header_name, header_value) tuples. Each header_name must be a bytes instance representing a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation. Each header_value must be a bytes instance and must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)
In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway. The gateway must however not override values with the same name if they are emitted by the application.
(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)
Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied as return values from an application in the headers structure. (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)
Dealing with Compatibility Across Python Versions
Creating Web3 code that runs under both Python 2.6/2.7 and Python 3.1+ requires some care on the part of the developer. In general, the Web3 specification assumes a certain level of equivalence between the Python 2 str type and the Python 3 bytes type. For example, under Python 2, the values present in the Web3 environ will be instances of the str type; in Python 3, these will be instances of the bytes type. The Python 3 bytes type does not possess all the methods of the Python 2 str type, and some methods which it does possess behave differently than the Python 2 str type. Effectively, to ensure that Web3 middleware and applications work across Python versions, developers must do these things:
- Do not assume comparison equivalence between text values and bytes values. If you do so, your code may work under Python 2, but it will not work properly under Python 3. For example, don't write somebytes == 'abc'. This will sometimes be true on Python 2 but it will never be true on Python 3, because a sequence of bytes never compares equal to a string under Python 3. Instead, always compare a bytes value with a bytes value, e.g. "somebytes == b'abc'". Code which does this is compatible with and works the same in Python 2.6, 2.7, and 3.1. The b in front of 'abc' signals to Python 3 that the value is a literal bytes instance; under Python 2 it's a forward compatibility placebo.
- Don't use the __contains__ method (directly or indirectly) of items that are meant to be byteslike without ensuring that its argument is also a bytes instance. If you do so, your code may work under Python 2, but it will not work properly under Python 3. For example, 'abc' in somebytes' will raise a TypeError under Python 3, but it will return True under Python 2.6 and 2.7. However, b'abc' in somebytes will work the same on both versions. In Python 3.2, this restriction may be partially removed, as it's rumored that bytes types may obtain a __mod__ implementation.
- __getitem__ should not be used.
- Dont try to use the format method or the __mod__ method of instances of bytes (directly or indirectly). In Python 2, the str type which we treat equivalently to Python 3's bytes supports these method but actual Python 3's bytes instances don't support these methods. If you use these methods, your code will work under Python 2, but not under Python 3.
- Do not try to concatenate a bytes value with a string value. This may work under Python 2, but it will not work under Python 3. For example, doing 'abc' + somebytes will work under Python 2, but it will result in a TypeError under Python 3. Instead, always make sure you're concatenating two items of the same type, e.g. b'abc' + somebytes.
Web3 expects byte values in other places, such as in all the values returned by an application.
In short, to ensure compatibility of Web3 application code between Python 2 and Python 3, in Python 2, treat CGI and server variable values in the environment as if they had the Python 3 bytes API even though they actually have a more capable API. Likewise for all stringlike values returned by a Web3 application.
Buffering and Streaming
Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.
The corresponding approach in Web3 is for the application to simply return a single-element body iterable (such as a list) containing the response body as a single string. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.
For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.
In these cases, applications will usually return a body iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).
Web3 servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:
- Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
- Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
- (Middleware only) send the entire block to its parent gateway/server.
By providing this guarantee, Web3 allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.
Unicode Issues
HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all values passed to or from the server must be of the Python 3 type bytes or instances of the Python 2 type str, not Python 2 unicode or Python 3 str objects.
All "bytes instances" referred to in this specification must:
- On Python 2, be of type str.
- On Python 3, be of type bytes.
All "bytes instances" must not :
- On Python 2, be of type unicode.
- On Python 3, be of type str.
The result of using a textlike object where a byteslike object is required is undefined.
Values returned from a Web3 app as a status or as response headers must follow RFC 2616 with respect to encoding. That is, the bytes returned must contain a character stream of ISO-8859-1 characters, or the character stream should use RFC 2047 MIME encoding.
On Python platforms which do not have a native bytes-like type (e.g. IronPython, etc.), but instead which generally use textlike strings to represent bytes data, the definition of "bytes instance" can be changed: their "bytes instances" must be native strings that contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application on such a platform to supply strings containing any other Unicode character or code point. Similarly, servers and gateways on those platforms must not supply strings to an application containing any other Unicode characters.
HTTP 1.1 Expect/Continue
Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:
- Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
- Proceed with the request normally, but provide the application with a web3.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
- Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)
Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.
Other HTTP Features
In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)
However, because Web3 servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to Web3 internal communications. Web3 applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. Web3 servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.
Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.
Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.
Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.
Thread Support
Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.
Implementation/Application Notes
Server Extension APIs
Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a Web3 extension.
In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.
In general, any extension API that duplicates, supplants, or bypasses some portion of Web3 functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically organize their frameworks to function almost entirely as middleware of various kinds.
So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some Web3 functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.
These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.
It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!
Application Configuration
This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).
Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.
Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.
Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:
from the_app import application
def new_app(environ):
environ['the_app.configval1'] = b'something'
return application(environ)
But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)
URL Reconstruction
If an application wishes to reconstruct a request's complete URL (as a bytes object), it may do so using the following algorithm:
host = environ.get('HTTP_HOST')
scheme = environ['web3.url_scheme']
port = environ['SERVER_PORT']
query = environ['QUERY_STRING']
url = scheme + b'://'
if host:
url += host
else:
url += environ['SERVER_NAME']
if scheme == b'https':
if port != b'443':
url += b':' + port
else:
if port != b'80':
url += b':' + port
if 'web3.script_name' in url:
url += url_quote(environ['web3.script_name'])
else:
url += environ['SCRIPT_NAME']
if 'web3.path_info' in environ:
url += url_quote(environ['web3.path_info'])
else:
url += environ['PATH_INFO']
if query:
url += b'?' + query
Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.
Open Questions
- file_wrapper replacement. Currently nothing is specified here but it's clear that the old system of in-band signalling is broken if it does not provide a way to figure out as a middleware in the process if the response is a file wrapper.
Points of Contention
Outlined below are potential points of contention regarding this specification.
WSGI 1.0 Compatibility
Components written using the WSGI 1.0 specification will not transparently interoperate with components written using this specification. That's because the goals of this proposal and the goals of WSGI 1.0 are not directly aligned.
WSGI 1.0 is obliged to provide specification-level backwards compatibility with versions of Python between 2.2 and 2.7. This specification, however, ditches Python 2.5 and lower compatibility in order to provide compatibility between relatively recent versions of Python 2 (2.6 and 2.7) as well as relatively recent versions of Python 3 (3.1).
It is currently impossible to write components which work reliably under both Python 2 and Python 3 using the WSGI 1.0 specification, because the specification implicitly posits that CGI and server variable values in the environ and values returned via start_response represent a sequence of bytes that can be addressed using the Python 2 string API. It posits such a thing because that sort of data type was the sensible way to represent bytes in all Python 2 versions, and WSGI 1.0 was conceived before Python 3 existed.
Python 3's str type supports the full API provided by the Python 2 str type, but Python 3's str type does not represent a sequence of bytes, it instead represents text. Therefore, using it to represent environ values also requires that the environ byte sequence be decoded to text via some encoding. We cannot decode these bytes to text (at least in any way where the decoding has any meaning other than as a tunnelling mechanism) without widening the scope of WSGI to include server and gateway knowledge of decoding policies and mechanics. WSGI 1.0 never concerned itself with encoding and decoding. It made statements about allowable transport values, and suggested that various values might be best decoded as one encoding or another, but it never required a server to perform any decoding before
Python 3 does not have a stringlike type that can be used instead to represent bytes: it has a bytes type. A bytes type operates quite a bit like a Python 2 str in Python 3.1+, but it lacks behavior equivalent to str.__mod__ and its iteration protocol, and containment, sequence treatment, and equivalence comparisons are different.
In either case, there is no type in Python 3 that behaves just like the Python 2 str type, and a way to create such a type doesn't exist because there is no such thing as a "String ABC" which would allow a suitable type to be built. Due to this design incompatibility, existing WSGI 1.0 servers, middleware, and applications will not work under Python 3, even after they are run through 2to3.
Existing Web-SIG discussions about updating the WSGI specification so that it is possible to write a WSGI application that runs in both Python 2 and Python 3 tend to revolve around creating a specification-level equivalence between the Python 2 str type (which represents a sequence of bytes) and the Python 3 str type (which represents text). Such an equivalence becomes strained in various areas, given the different roles of these types. An arguably more straightforward equivalence exists between the Python 3 bytes type API and a subset of the Python 2 str type API. This specification exploits this subset equivalence.
In the meantime, aside from any Python 2 vs. Python 3 compatibility issue, as various discussions on Web-SIG have pointed out, the WSGI 1.0 specification is too general, providing support (via .write) for asynchronous applications at the expense of implementation complexity. This specification uses the fundamental incompatibility between WSGI 1.0 and Python 3 as a natural divergence point to create a specification with reduced complexity by changing specialized support for asynchronous applications.
To provide backwards compatibility for older WSGI 1.0 applications, so that they may run on a Web3 stack, it is presumed that Web3 middleware will be created which can be used "in front" of existing WSGI 1.0 applications, allowing those existing WSGI 1.0 applications to run under a Web3 stack. This middleware will require, when under Python 3, an equivalence to be drawn between Python 3 str types and the bytes values represented by the HTTP request and all the attendant encoding-guessing (or configuration) it implies.
Note
Such middleware might in the future, instead of drawing an equivalence between Python 3 str and HTTP byte values, make use of a yet-to-be-created "ebytes" type (aka "bytes-with-benefits"), particularly if a String ABC proposal is accepted into the Python core and implemented.
Conversely, it is presumed that WSGI 1.0 middleware will be created which will allow a Web3 application to run behind a WSGI 1.0 stack on the Python 2 platform.
Environ and Response Values as Bytes
Casual middleware and application writers may consider the use of bytes as environment values and response values inconvenient. In particular, they won't be able to use common string formatting functions such as ('%s' % bytes_val) or bytes_val.format('123') because bytes don't have the same API as strings on platforms such as Python 3 where the two types differ. Likewise, on such platforms, stdlib HTTP-related API support for using bytes interchangeably with text can be spotty. In places where bytes are inconvenient or incompatible with library APIs, middleware and application writers will have to decode such bytes to text explicitly. This is particularly inconvenient for middleware writers: to work with environment values as strings, they'll have to decode them from an implied encoding and if they need to mutate an environ value, they'll then need to encode the value into a byte stream before placing it into the environ. While the use of bytes by the specification as environ values might be inconvenient for casual developers, it provides several benefits.
Using bytes types to represent HTTP and server values to an application most closely matches reality because HTTP is fundamentally a bytes-oriented protocol. If the environ values are mandated to be strings, each server will need to use heuristics to guess about the encoding of various values provided by the HTTP environment. Using all strings might increase casual middleware writer convenience, but will also lead to ambiguity and confusion when a value cannot be decoded to a meaningful non-surrogate string.
Use of bytes as environ values avoids any potential for the need for the specification to mandate that a participating server be informed of encoding configuration parameters. If environ values are treated as strings, and so must be decoded from bytes, configuration parameters may eventually become necessary as policy clues from the application deployer. Such a policy would be used to guess an appropriate decoding strategy in various circumstances, effectively placing the burden for enforcing a particular application encoding policy upon the server. If the server must serve more than one application, such configuration would quickly become complex. Many policies would also be impossible to express declaratively.
In reality, HTTP is a complicated and legacy-fraught protocol which requires a complex set of heuristics to make sense of. It would be nice if we could allow this protocol to protect us from this complexity, but we cannot do so reliably while still providing to application writers a level of control commensurate with reality. Python applications must often deal with data embedded in the environment which not only must be parsed by legacy heuristics, but does not conform even to any existing HTTP specification. While these eventualities are unpleasant, they crop up with regularity, making it impossible and undesirable to hide them from application developers, as application developers are the only people who are able to decide upon an appropriate action when an HTTP specification violation is detected.
Some have argued for mixed use of bytes and string values as environ values. This proposal avoids that strategy. Sole use of bytes as environ values makes it possible to fit this specification entirely in one's head; you won't need to guess about which values are strings and which are bytes.
This protocol would also fit in a developer's head if all environ values were strings, but this specification doesn't use that strategy. This will likely be the point of greatest contention regarding the use of bytes. In defense of bytes: developers often prefer protocols with consistent contracts, even if the contracts themselves are suboptimal. If we hide encoding issues from a developer until a value that contains surrogates causes problems after it has already reached beyond the I/O boundary of their application, they will need to do a lot more work to fix assumptions made by their application than if we were to just present the problem much earlier in terms of "here's some bytes, you decode them". This is also a counter-argument to the "bytes are inconvenient" assumption: while presenting bytes to an application developer may be inconvenient for a casual application developer who doesn't care about edge cases, they are extremely convenient for the application developer who needs to deal with complex, dirty eventualities, because use of bytes allows him the appropriate level of control with a clear separation of responsibility.
If the protocol uses bytes, it is presumed that libraries will be created to make working with bytes-only in the environ and within return values more pleasant; for example, analogues of the WSGI 1.0 libraries named "WebOb" and "Werkzeug". Such libraries will fill the gap between convenience and control, allowing the spec to remain simple and regular while still allowing casual authors a convenient way to create Web3 middleware and application components. This seems to be a reasonable alternative to baking encoding policy into the protocol, because many such libraries can be created independently from the protocol, and application developers can choose the one that provides them the appropriate levels of control and convenience for a particular job.
Here are some alternatives to using all bytes:
- Have the server decode all values representing CGI and server environ values into strings using the latin-1 encoding, which is lossless. Smuggle any undecodable bytes within the resulting string.
- Encode all CGI and server environ values to strings using the utf-8 encoding with the surrogateescape error handler. This does not work under any existing Python 2.
- Encode some values into bytes and other values into strings, as decided by their typical usages.
Applications Should be Allowed to Read web3.input Past CONTENT_LENGTH
At [6], Graham Dumpleton makes the assertion that wsgi.input should be required to return the empty string as a signifier of out-of-data, and that applications should be allowed to read past the number of bytes specified in CONTENT_LENGTH, depending only upon the empty string as an EOF marker. WSGI relies on an application "being well behaved and once all data specified by CONTENT_LENGTH is read, that it processes the data and returns any response. That same socket connection could then be used for a subsequent request." Graham would like WSGI adapters to be required to wrap raw socket connections: "this wrapper object will need to count how much data has been read, and when the amount of data reaches that as defined by CONTENT_LENGTH, any subsequent reads should return an empty string instead." This may be useful to support chunked encoding and input filters.
web3.input Unknown Length
There's no documented way to indicate that there is content in environ['web3.input'], but the content length is unknown.
read() of web3.input Should Support No-Size Calling Convention
At [6], Graham Dumpleton makes the assertion that the read() method of wsgi.input should be callable without arguments, and that the result should be "all available request content". Needs discussion.
Comment Armin: I changed the spec to require that from an implementation. I had too much pain with that in the past already. Open for discussions though.
Input Filters should set environ CONTENT_LENGTH to -1
At [6], Graham Dumpleton suggests that an input filter might set environ['CONTENT_LENGTH'] to -1 to indicate that it mutated the input.
headers as Literal List of Two-Tuples
Why do we make applications return a headers structure that is a literal list of two-tuples? I think the iterability of headers needs to be maintained while it moves up the stack, but I don't think we need to be able to mutate it in place at all times. Could we loosen that requirement?
Comment Armin: Strong yes
Removed Requirement that Middleware Not Block
This requirement was removed: "middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty string." This requirement existed to support asynchronous applications and servers (see PEP 333's "Middleware Handling of Block Boundaries"). Asynchronous applications are now serviced explicitly by web3.async capable protocol (a Web3 application callable may itself return a callable).
web3.script_name and web3.path_info
These values are required to be placed into the environment by an origin server under this specification. Unlike SCRIPT_NAME and PATH_INFO, these must be the original URL-encoded variants derived from the request URI. We probably need to figure out how these should be computed originally, and what their values should be if the server performs URL rewriting.
Long Response Headers
Bob Brewer notes on Web-SIG [7]:
Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.) [1]
That's understandable, but HTTP headers are defined as (mostly) *TEXT, and "words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047." [2] And RFC 2047 specifies that "an 'encoded-word' may not be more than 75 characters long... If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used." [3] This satisfies HTTP header folding rules, as well: "Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT." [1]
So in my reading of HTTP, some code somewhere should introduce newlines in longish, encoded response header values. I see three options:
- Keep things as they are and disallow response header values if they contain words over 75 chars that are outside the ISO-8859-1 character set.
- Allow newline characters in WSGI response headers.
- Require/strongly suggest WSGI servers to do the encoding and folding before sending the value over HTTP.
Request Trailers and Chunked Transfer Encoding
When using chunked transfer encoding on request content, the RFCs allow there to be request trailers. These are like request headers but come after the final null data chunk. These trailers are only available when the chunked data stream is finite length and when it has all been read in. Neither WSGI nor Web3 currently supports them.
References
| [1] | (1, 2, 3) PEP 333: Python Web Services Gateway Interface (http://www.python.org/dev/peps/pep-0333/) |
| [2] | (1, 2) The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://cgi-spec.golux.com/draft-coar-cgi-v11-03.txt) |
| [3] | "Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1) |
| [4] | "End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1) |
| [5] | mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25) |
| [6] | (1, 2, 3) Details on WSGI 1.0 amendments/clarifications. (http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html) |
| [7] | [Web-SIG] WSGI and long response header values http://mail.python.org/pipermail/web-sig/2006-September/002244.html |
Copyright
This document has been placed in the public domain.
pep-0445 Add new APIs to customize Python memory allocators
| PEP: | 445 |
|---|---|
| Title: | Add new APIs to customize Python memory allocators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| BDFL-Delegate: | Antoine Pitrou <solipsis@pitrou.net> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-june-2013 |
| Python-Version: | 3.4 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-July/127222.html |
Contents
- Abstract
- Rationale
- Proposal
- Examples
- Performances
- Rejected Alternatives
- More specific functions to get/set memory allocators
- Make PyMem_Malloc() reuse PyMem_RawMalloc() by default
- Add a new PYDEBUGMALLOC environment variable
- Use macros to get customizable allocators
- Pass the C filename and line number
- GIL-free PyMem_Malloc()
- Don't add PyMem_RawMalloc()
- Use existing debug tools to analyze memory use
- Add a msize() function
- No context argument
- External Libraries
- Memory Allocators
- Links
- Copyright
Abstract
This PEP proposes new Application Programming Interfaces (API) to customize Python memory allocators. The only implementation required to conform to this PEP is CPython, but other implementations may choose to be compatible, or to re-use a similar scheme.
Rationale
Use cases:
- Applications embedding Python which want to isolate Python memory from the memory of the application, or want to use a different memory allocator optimized for its Python usage
- Python running on embedded devices with low memory and slow CPU. A custom memory allocator can be used for efficiency and/or to get access all the memory of the device.
- Debug tools for memory allocators:
- track the memory usage (find memory leaks)
- get the location of a memory allocation: Python filename and line number, and the size of a memory block
- detect buffer underflow, buffer overflow and misuse of Python allocator APIs (see Redesign Debug Checks on Memory Block Allocators as Hooks)
- force memory allocations to fail to test handling of the MemoryError exception
Proposal
New Functions and Structures
Add a new GIL-free (no need to hold the GIL) memory allocator:
- void* PyMem_RawMalloc(size_t size)
- void* PyMem_RawRealloc(void *ptr, size_t new_size)
- void PyMem_RawFree(void *ptr)
- The newly allocated memory will not have been initialized in any way.
- Requesting zero bytes returns a distinct non-NULL pointer if possible, as if PyMem_Malloc(1) had been called instead.
Add a new PyMemAllocator structure:
typedef struct { /* user context passed as the first argument to the 3 functions */ void *ctx; /* allocate a memory block */ void* (*malloc) (void *ctx, size_t size); /* allocate or resize a memory block */ void* (*realloc) (void *ctx, void *ptr, size_t new_size); /* release a memory block */ void (*free) (void *ctx, void *ptr); } PyMemAllocator;Add a new PyMemAllocatorDomain enum to choose the Python allocator domain. Domains:
- PYMEM_DOMAIN_RAW: PyMem_RawMalloc(), PyMem_RawRealloc() and PyMem_RawFree()
- PYMEM_DOMAIN_MEM: PyMem_Malloc(), PyMem_Realloc() and PyMem_Free()
- PYMEM_DOMAIN_OBJ: PyObject_Malloc(), PyObject_Realloc() and PyObject_Free()
Add new functions to get and set memory block allocators:
- void PyMem_GetAllocator(PyMemAllocatorDomain domain, PyMemAllocator *allocator)
- void PyMem_SetAllocator(PyMemAllocatorDomain domain, PyMemAllocator *allocator)
- The new allocator must return a distinct non-NULL pointer when requesting zero bytes
- For the PYMEM_DOMAIN_RAW domain, the allocator must be thread-safe: the GIL is not held when the allocator is called.
Add a new PyObjectArenaAllocator structure:
typedef struct { /* user context passed as the first argument to the 2 functions */ void *ctx; /* allocate an arena */ void* (*alloc) (void *ctx, size_t size); /* release an arena */ void (*free) (void *ctx, void *ptr, size_t size); } PyObjectArenaAllocator;Add new functions to get and set the arena allocator used by pymalloc:
- void PyObject_GetArenaAllocator(PyObjectArenaAllocator *allocator)
- void PyObject_SetArenaAllocator(PyObjectArenaAllocator *allocator)
Add a new function to reinstall the debug checks on memory allocators when a memory allocator is replaced with PyMem_SetAllocator():
- void PyMem_SetupDebugHooks(void)
- Install the debug hooks on all memory block allocators. The function can be called more than once, hooks are only installed once.
- The function does nothing is Python is not compiled in debug mode.
Memory block allocators always return NULL if size is greater than PY_SSIZE_T_MAX. The check is done before calling the inner function.
Note
The pymalloc allocator is optimized for objects smaller than 512 bytes with a short lifetime. It uses memory mappings with a fixed size of 256 KB called "arenas".
Here is how the allocators are set up by default:
- PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_MEM: malloc(), realloc() and free(); call malloc(1) when requesting zero bytes
- PYMEM_DOMAIN_OBJ: pymalloc allocator which falls back on PyMem_Malloc() for allocations larger than 512 bytes
- pymalloc arena allocator: VirtualAlloc() and VirtualFree() on Windows, mmap() and munmap() when available, or malloc() and free()
Redesign Debug Checks on Memory Block Allocators as Hooks
Since Python 2.3, Python implements different checks on memory allocators in debug mode:
- Newly allocated memory is filled with the byte 0xCB, freed memory is filled with the byte 0xDB.
- Detect API violations, ex: PyObject_Free() called on a memory block allocated by PyMem_Malloc()
- Detect write before the start of the buffer (buffer underflow)
- Detect write after the end of the buffer (buffer overflow)
In Python 3.3, the checks are installed by replacing PyMem_Malloc(), PyMem_Realloc(), PyMem_Free(), PyObject_Malloc(), PyObject_Realloc() and PyObject_Free() using macros. The new allocator allocates a larger buffer and writes a pattern to detect buffer underflow, buffer overflow and use after free (by filling the buffer with the byte 0xDB). It uses the original PyObject_Malloc() function to allocate memory. So PyMem_Malloc() and PyMem_Realloc() indirectly call``PyObject_Malloc()`` and PyObject_Realloc().
This PEP redesigns the debug checks as hooks on the existing allocators in debug mode. Examples of call traces without the hooks:
- PyMem_RawMalloc() => _PyMem_RawMalloc() => malloc()
- PyMem_Realloc() => _PyMem_RawRealloc() => realloc()
- PyObject_Free() => _PyObject_Free()
Call traces when the hooks are installed (debug mode):
- PyMem_RawMalloc() => _PyMem_DebugMalloc() => _PyMem_RawMalloc() => malloc()
- PyMem_Realloc() => _PyMem_DebugRealloc() => _PyMem_RawRealloc() => realloc()
- PyObject_Free() => _PyMem_DebugFree() => _PyObject_Free()
As a result, PyMem_Malloc() and PyMem_Realloc() now call malloc() and realloc() in both release mode and debug mode, instead of calling PyObject_Malloc() and PyObject_Realloc() in debug mode.
When at least one memory allocator is replaced with PyMem_SetAllocator(), the PyMem_SetupDebugHooks() function must be called to reinstall the debug hooks on top on the new allocator.
Don't call malloc() directly anymore
PyObject_Malloc() falls back on PyMem_Malloc() instead of malloc() if size is greater or equal than 512 bytes, and PyObject_Realloc() falls back on PyMem_Realloc() instead of realloc()
Direct calls to malloc() are replaced with PyMem_Malloc(), or PyMem_RawMalloc() if the GIL is not held.
External libraries like zlib or OpenSSL can be configured to allocate memory using PyMem_Malloc() or PyMem_RawMalloc(). If the allocator of a library can only be replaced globally (rather than on an object-by-object basis), it shouldn't be replaced when Python is embedded in an application.
For the "track memory usage" use case, it is important to track memory allocated in external libraries to have accurate reports, because these allocations can be large (e.g. they can raise a MemoryError exception) and would otherwise be missed in memory usage reports.
Examples
Use case 1: Replace Memory Allocators, keep pymalloc
Dummy example wasting 2 bytes per memory block, and 10 bytes per pymalloc arena:
#include <stdlib.h>
size_t alloc_padding = 2;
size_t arena_padding = 10;
void* my_malloc(void *ctx, size_t size)
{
int padding = *(int *)ctx;
return malloc(size + padding);
}
void* my_realloc(void *ctx, void *ptr, size_t new_size)
{
int padding = *(int *)ctx;
return realloc(ptr, new_size + padding);
}
void my_free(void *ctx, void *ptr)
{
free(ptr);
}
void* my_alloc_arena(void *ctx, size_t size)
{
int padding = *(int *)ctx;
return malloc(size + padding);
}
void my_free_arena(void *ctx, void *ptr, size_t size)
{
free(ptr);
}
void setup_custom_allocator(void)
{
PyMemAllocator alloc;
PyObjectArenaAllocator arena;
alloc.ctx = &alloc_padding;
alloc.malloc = my_malloc;
alloc.realloc = my_realloc;
alloc.free = my_free;
PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);
PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);
/* leave PYMEM_DOMAIN_OBJ unchanged, use pymalloc */
arena.ctx = &arena_padding;
arena.alloc = my_alloc_arena;
arena.free = my_free_arena;
PyObject_SetArenaAllocator(&arena);
PyMem_SetupDebugHooks();
}
Use case 2: Replace Memory Allocators, override pymalloc
If you have a dedicated allocator optimized for allocations of objects smaller than 512 bytes with a short lifetime, pymalloc can be overriden (replace PyObject_Malloc()).
Dummy example wasting 2 bytes per memory block:
#include <stdlib.h>
size_t padding = 2;
void* my_malloc(void *ctx, size_t size)
{
int padding = *(int *)ctx;
return malloc(size + padding);
}
void* my_realloc(void *ctx, void *ptr, size_t new_size)
{
int padding = *(int *)ctx;
return realloc(ptr, new_size + padding);
}
void my_free(void *ctx, void *ptr)
{
free(ptr);
}
void setup_custom_allocator(void)
{
PyMemAllocator alloc;
alloc.ctx = &padding;
alloc.malloc = my_malloc;
alloc.realloc = my_realloc;
alloc.free = my_free;
PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);
PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);
PyMem_SetAllocator(PYMEM_DOMAIN_OBJ, &alloc);
PyMem_SetupDebugHooks();
}
The pymalloc arena does not need to be replaced, because it is no more used by the new allocator.
Use case 3: Setup Hooks On Memory Block Allocators
Example to setup hooks on all memory block allocators:
struct {
PyMemAllocator raw;
PyMemAllocator mem;
PyMemAllocator obj;
/* ... */
} hook;
static void* hook_malloc(void *ctx, size_t size)
{
PyMemAllocator *alloc = (PyMemAllocator *)ctx;
void *ptr;
/* ... */
ptr = alloc->malloc(alloc->ctx, size);
/* ... */
return ptr;
}
static void* hook_realloc(void *ctx, void *ptr, size_t new_size)
{
PyMemAllocator *alloc = (PyMemAllocator *)ctx;
void *ptr2;
/* ... */
ptr2 = alloc->realloc(alloc->ctx, ptr, new_size);
/* ... */
return ptr2;
}
static void hook_free(void *ctx, void *ptr)
{
PyMemAllocator *alloc = (PyMemAllocator *)ctx;
/* ... */
alloc->free(alloc->ctx, ptr);
/* ... */
}
void setup_hooks(void)
{
PyMemAllocator alloc;
static int installed = 0;
if (installed)
return;
installed = 1;
alloc.malloc = hook_malloc;
alloc.realloc = hook_realloc;
alloc.free = hook_free;
PyMem_GetAllocator(PYMEM_DOMAIN_RAW, &hook.raw);
PyMem_GetAllocator(PYMEM_DOMAIN_MEM, &hook.mem);
PyMem_GetAllocator(PYMEM_DOMAIN_OBJ, &hook.obj);
alloc.ctx = &hook.raw;
PyMem_SetAllocator(PYMEM_DOMAIN_RAW, &alloc);
alloc.ctx = &hook.mem;
PyMem_SetAllocator(PYMEM_DOMAIN_MEM, &alloc);
alloc.ctx = &hook.obj;
PyMem_SetAllocator(PYMEM_DOMAIN_OBJ, &alloc);
}
Note
PyMem_SetupDebugHooks() does not need to be called because memory allocator are not replaced: the debug checks on memory block allocators are installed automatically at startup.
Performances
The implementation of this PEP (issue #3329) has no visible overhead on the Python benchmark suite.
Results of the Python benchmarks suite (-b 2n3): some tests are 1.04x faster, some tests are 1.04 slower. Results of pybench microbenchmark: "+0.1%" slower globally (diff between -4.9% and +5.6%).
The full output of benchmarks is attached to the issue #3329.
Rejected Alternatives
More specific functions to get/set memory allocators
It was originally proposed a larger set of C API functions, with one pair of functions for each allocator domain:
- void PyMem_GetRawAllocator(PyMemAllocator *allocator)
- void PyMem_GetAllocator(PyMemAllocator *allocator)
- void PyObject_GetAllocator(PyMemAllocator *allocator)
- void PyMem_SetRawAllocator(PyMemAllocator *allocator)
- void PyMem_SetAllocator(PyMemAllocator *allocator)
- void PyObject_SetAllocator(PyMemAllocator *allocator)
This alternative was rejected because it is not possible to write generic code with more specific functions: code must be duplicated for each memory allocator domain.
Make PyMem_Malloc() reuse PyMem_RawMalloc() by default
If PyMem_Malloc() called PyMem_RawMalloc() by default, calling PyMem_SetAllocator(PYMEM_DOMAIN_RAW, alloc) would also patch PyMem_Malloc() indirectly.
This alternative was rejected because PyMem_SetAllocator() would have a different behaviour depending on the domain. Always having the same behaviour is less error-prone.
Add a new PYDEBUGMALLOC environment variable
It was proposed to add a new PYDEBUGMALLOC environment variable to enable debug checks on memory block allocators. It would have had the same effect as calling the PyMem_SetupDebugHooks(), without the need to write any C code. Another advantage is to allow to enable debug checks even in release mode: debug checks would always be compiled in, but only enabled when the environment variable is present and non-empty.
This alternative was rejected because a new environment variable would make Python initialization even more complex. PEP 432 tries to simplify the CPython startup sequence.
Use macros to get customizable allocators
To have no overhead in the default configuration, customizable allocators would be an optional feature enabled by a configuration option or by macros.
This alternative was rejected because the use of macros implies having to recompile extensions modules to use the new allocator and allocator hooks. Not having to recompile Python nor extension modules makes debug hooks easier to use in practice.
Pass the C filename and line number
Define allocator functions as macros using __FILE__ and __LINE__ to get the C filename and line number of a memory allocation.
Example of PyMem_Malloc macro with the modified PyMemAllocator structure:
typedef struct {
/* user context passed as the first argument
to the 3 functions */
void *ctx;
/* allocate a memory block */
void* (*malloc) (void *ctx, const char *filename, int lineno,
size_t size);
/* allocate or resize a memory block */
void* (*realloc) (void *ctx, const char *filename, int lineno,
void *ptr, size_t new_size);
/* release a memory block */
void (*free) (void *ctx, const char *filename, int lineno,
void *ptr);
} PyMemAllocator;
void* _PyMem_MallocTrace(const char *filename, int lineno,
size_t size);
/* the function is still needed for the Python stable ABI */
void* PyMem_Malloc(size_t size);
#define PyMem_Malloc(size) \
_PyMem_MallocTrace(__FILE__, __LINE__, size)
The GC allocator functions would also have to be patched. For example, _PyObject_GC_Malloc() is used in many C functions and so objects of different types would have the same allocation location.
This alternative was rejected because passing a filename and a line number to each allocator makes the API more complex: pass 3 new arguments (ctx, filename, lineno) to each allocator function, instead of just a context argument (ctx). Having to also modify GC allocator functions adds too much complexity for a little gain.
GIL-free PyMem_Malloc()
In Python 3.3, when Python is compiled in debug mode, PyMem_Malloc() indirectly calls PyObject_Malloc() which requires the GIL to be held (it isn't thread-safe). That's why PyMem_Malloc() must be called with the GIL held.
This PEP changes PyMem_Malloc(): it now always calls malloc() rather than PyObject_Malloc(). The "GIL must be held" restriction could therefore be removed from PyMem_Malloc().
This alternative was rejected because allowing to call PyMem_Malloc() without holding the GIL can break applications which setup their own allocators or allocator hooks. Holding the GIL is convenient to develop a custom allocator: no need to care about other threads. It is also convenient for a debug allocator hook: Python objects can be safely inspected, and the C API may be used for reporting.
Moreover, calling PyGILState_Ensure() in a memory allocator has unexpected behaviour, especially at Python startup and when creating of a new Python thread state. It is better to free custom allocators of the responsibility of acquiring the GIL.
Don't add PyMem_RawMalloc()
Replace malloc() with PyMem_Malloc(), but only if the GIL is held. Otherwise, keep malloc() unchanged.
The PyMem_Malloc() is used without the GIL held in some Python functions. For example, the main() and Py_Main() functions of Python call PyMem_Malloc() whereas the GIL do not exist yet. In this case, PyMem_Malloc() would be replaced with malloc() (or PyMem_RawMalloc()).
This alternative was rejected because PyMem_RawMalloc() is required for accurate reports of the memory usage. When a debug hook is used to track the memory usage, the memory allocated by direct calls to malloc() cannot be tracked. PyMem_RawMalloc() can be hooked and so all the memory allocated by Python can be tracked, including memory allocated without holding the GIL.
Use existing debug tools to analyze memory use
There are many existing debug tools to analyze memory use. Some examples: Valgrind, Purify, Clang AddressSanitizer, failmalloc, etc.
The problem is to retrieve the Python object related to a memory pointer to read its type and/or its content. Another issue is to retrieve the source of the memory allocation: the C backtrace is usually useless (same reasoning than macros using __FILE__ and __LINE__, see Pass the C filename and line number), the Python filename and line number (or even the Python traceback) is more useful.
This alternative was rejected because classic tools are unable to introspect Python internals to collect such information. Being able to setup a hook on allocators called with the GIL held allows to collect a lot of useful data from Python internals.
Add a msize() function
Add another function to PyMemAllocator and PyObjectArenaAllocator structures:
size_t msize(void *ptr);
This function returns the size of a memory block or a memory mapping. Return (size_t)-1 if the function is not implemented or if the pointer is unknown (ex: NULL pointer).
On Windows, this function can be implemented using _msize() and VirtualQuery().
The function can be used to implement a hook tracking the memory usage. The free() method of an allocator only gets the address of a memory block, whereas the size of the memory block is required to update the memory usage.
The additional msize() function was rejected because only few platforms implement it. For example, Linux with the GNU libc does not provide a function to get the size of a memory block. msize() is not currently used in the Python source code. The function would only be used to track memory use, and make the API more complex. A debug hook can implement the function internally, there is no need to add it to PyMemAllocator and PyObjectArenaAllocator structures.
No context argument
Simplify the signature of allocator functions, remove the context argument:
- void* malloc(size_t size)
- void* realloc(void *ptr, size_t new_size)
- void free(void *ptr)
It is likely for an allocator hook to be reused for PyMem_SetAllocator() and PyObject_SetAllocator(), or even PyMem_SetRawAllocator(), but the hook must call a different function depending on the allocator. The context is a convenient way to reuse the same custom allocator or hook for different Python allocators.
In C++, the context can be used to pass this.
External Libraries
Examples of API used to customize memory allocators.
Libraries used by Python:
- OpenSSL: CRYPTO_set_mem_functions() to set memory management functions globally
- expat: parserCreate() has a per-instance memory handler
- zlib: zlib 1.2.8 Manual, pass an opaque pointer
- bz2: bzip2 and libbzip2, version 1.0.5, pass an opaque pointer
- lzma: LZMA SDK - How to Use, pass an opaque pointer
- lipmpdec: no opaque pointer (classic malloc API)
Other libraries:
- glib: g_mem_set_vtable()
- libxml2: xmlGcMemSetup(), global
- Oracle's OCI: Oracle Call Interface Programmer's Guide, Release 2 (9.2), pass an opaque pointer
The new ctx parameter of this PEP was inspired by the API of zlib and Oracle's OCI libraries.
See also the GNU libc: Memory Allocation Hooks which uses a different approach to hook memory allocators.
Memory Allocators
The C standard library provides the well known malloc() function. Its implementation depends on the platform and of the C library. The GNU C library uses a modified ptmalloc2, based on "Doug Lea's Malloc" (dlmalloc). FreeBSD uses jemalloc. Google provides tcmalloc which is part of gperftools.
malloc() uses two kinds of memory: heap and memory mappings. Memory mappings are usually used for large allocations (ex: larger than 256 KB), whereas the heap is used for small allocations.
On UNIX, the heap is handled by brk() and sbrk() system calls, and it is contiguous. On Windows, the heap is handled by HeapAlloc() and can be discontiguous. Memory mappings are handled by mmap() on UNIX and VirtualAlloc() on Windows, they can be discontiguous.
Releasing a memory mapping gives back immediatly the memory to the system. On UNIX, the heap memory is only given back to the system if the released block is located at the end of the heap. Otherwise, the memory will only be given back to the system when all the memory located after the released memory is also released.
To allocate memory on the heap, an allocator tries to reuse free space. If there is no contiguous space big enough, the heap must be enlarged, even if there is more free space than required size. This issue is called the "memory fragmentation": the memory usage seen by the system is higher than real usage. On Windows, HeapAlloc() creates a new memory mapping with VirtualAlloc() if there is not enough free contiguous memory.
CPython has a pymalloc allocator for allocations smaller than 512 bytes. This allocator is optimized for small objects with a short lifetime. It uses memory mappings called "arenas" with a fixed size of 256 KB.
Other allocators:
- Windows provides a Low-fragmentation Heap.
- The Linux kernel uses slab allocation.
- The glib library has a Memory Slice API: efficient way to allocate groups of equal-sized chunks of memory
This PEP allows to choose exactly which memory allocator is used for your application depending on its usage of the memory (number of allocations, size of allocations, lifetime of objects, etc.).
Links
CPython issues related to memory allocation:
- Issue #3329: Add new APIs to customize memory allocators
- Issue #13483: Use VirtualAlloc to allocate memory arenas
- Issue #16742: PyOS_Readline drops GIL and calls PyOS_StdioReadline, which isn't thread safe
- Issue #18203: Replace calls to malloc() with PyMem_Malloc() or PyMem_RawMalloc()
- Issue #18227: Use Python memory allocators in external libraries like zlib or OpenSSL
Projects analyzing the memory usage of Python applications:
Copyright
This document has been placed into the public domain.
pep-0446 Make newly created file descriptors non-inheritable
| PEP: | 446 |
|---|---|
| Title: | Make newly created file descriptors non-inheritable |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 5-August-2013 |
| Python-Version: | 3.4 |
Contents
- Abstract
- Rationale
- Inheritance of File Descriptors
- Inheritance of File Descriptors on Windows
- Only Inherit Some Handles on Windows
- Inheritance of File Descriptors on UNIX
- Issues with Inheritable File Descriptors
- Security Vulnerability
- Issues fixed in the subprocess module
- Atomic Creation of non-inheritable File Descriptors
- Status of Python 3.3
- Closing All Open File Descriptors
- Proposal
- Backward Compatibility
- Related Work
- Rejected Alternatives
- Python Issues
- Copyright
Abstract
Leaking file descriptors in child processes causes various annoying issues and is a known major security vulnerability. Using the subprocess module with the close_fds parameter set to True is not possible in all cases.
This PEP proposes to make all file descriptors created by Python non-inheritable by default to reduce the risk of these issues. This PEP fixes also a race condition in multi-threaded applications on operating systems supporting atomic flags to create non-inheritable file descriptors.
We are aware of the code breakage this is likely to cause, and doing it anyway for the good of mankind. (Details in the section "Backward Compatibility" below.)
Rationale
Inheritance of File Descriptors
Each operating system handles the inheritance of file descriptors differently. Windows creates non-inheritable handles by default, whereas UNIX and the POSIX API on Windows create inheritable file descriptors by default. Python prefers the POSIX API over the native Windows API, to have a single code base and to use the same type for file descriptors, and so it creates inheritable file descriptors.
There is one exception: os.pipe() creates non-inheritable pipes on Windows, whereas it creates inheritable pipes on UNIX. The reason is an implementation artifact: os.pipe() calls CreatePipe() on Windows (native API), whereas it calls pipe() on UNIX (POSIX API). The call to CreatePipe() was added in Python in 1994, before the introduction of pipe() in the POSIX API in Windows 98. The issue #4708 proposes to change os.pipe() on Windows to create inheritable pipes.
Inheritance of File Descriptors on Windows
On Windows, the native type of file objects is handles (C type HANDLE). These handles have a HANDLE_FLAG_INHERIT flag which defines if a handle can be inherited in a child process or not. For the POSIX API, the C runtime (CRT) also provides file descriptors (C type int). The handle of a file descriptor can be retrieve using the function _get_osfhandle(fd). A file descriptor can be created from a handle using the function _open_osfhandle(handle).
Using CreateProcess(), handles are only inherited if their inheritable flag (HANDLE_FLAG_INHERIT) is set and the bInheritHandles parameter of CreateProcess() is TRUE; all file descriptors except standard streams (0, 1, 2) are closed in the child process, even if bInheritHandles is TRUE. Using the spawnv() function, all inheritable handles and all inheritable file descriptors are inherited in the child process. This function uses the undocumented fields cbReserved2 and lpReserved2 of the STARTUPINFO structure to pass an array of file descriptors.
To replace standard streams (stdin, stdout, stderr) using CreateProcess(), the STARTF_USESTDHANDLES flag must be set in the dwFlags field of the STARTUPINFO structure and the bInheritHandles parameter of CreateProcess() must be set to TRUE. So when at least one standard stream is replaced, all inheritable handles are inherited by the child process.
The default value of the close_fds parameter of subprocess process is True (bInheritHandles=FALSE) if stdin, stdout and stderr parameters are None, False (bInheritHandles=TRUE) otherwise.
See also:
Only Inherit Some Handles on Windows
Since Windows Vista, CreateProcess() supports an extension of the STARTUPINFO struture: the STARTUPINFOEX structure. Using this new structure, it is possible to specify a list of handles to inherit: PROC_THREAD_ATTRIBUTE_HANDLE_LIST. Read Programmatically controlling which handles are inherited by new processes in Win32 (Raymond Chen, Dec 2011) for more information.
Before Windows Vista, it is possible to make handles inheritable and call CreateProcess() with bInheritHandles=TRUE. This option works if all other handles are non-inheritable. There is a race condition: if another thread calls CreateProcess() with bInheritHandles=TRUE, handles will also be inherited in the second process.
Microsoft suggests to use a lock to avoid the race condition: read Q315939: PRB: Child Inherits Unintended Handles During CreateProcess Call (last review: November 2006). The Python issue #16500 "Add an atfork module" proposes to add such lock, it can be used to make handles non-inheritable without the race condition. Such lock only protects against a race condition between Python threads; C threads are not protected.
Another option is to duplicate handles that must be inherited, passing the values of the duplicated handles to the child process, so the child process can steal duplicated handles using DuplicateHandle() with DUPLICATE_CLOSE_SOURCE. Handle values change between the parent and the child process because the handles are duplicated (twice); the parent and/or the child process must be adapted to handle this change. If the child program cannot be modified, an intermediate program can be used to steal handles from the parent process before spawning the final child program. The intermediate program has to pass the handle from the child process to the parent process. The parent may have to close duplicated handles if all handles were not stolen, for example if the intermediate process fails. If the command line is used to pass the handle values, the command line must be modified when handles are duplicated, because their values are modified.
This PEP does not include a solution to this problem because there is no perfect solution working on all Windows versions. This point is deferred until use cases relying on handle or file descriptor inheritance on Windows are well known, so we can choose the best solution and carefully test its implementation.
Inheritance of File Descriptors on UNIX
POSIX provides a close-on-exec flag on file descriptors to automatically close a file descriptor when the C function execv() is called. File descriptors with the close-on-exec flag cleared are inherited in the child process, file descriptors with the flag set are closed in the child process.
The flag can be set in two syscalls (one to get current flags, a second to set new flags) using fcntl():
int flags, res;
flags = fcntl(fd, F_GETFD);
if (flags == -1) { /* handle the error */ }
flags |= FD_CLOEXEC;
/* or "flags &= ~FD_CLOEXEC;" to clear the flag */
res = fcntl(fd, F_SETFD, flags);
if (res == -1) { /* handle the error */ }
FreeBSD, Linux, Mac OS X, NetBSD, OpenBSD and QNX also support setting the flag in a single syscall using ioctl():
int res;
res = ioctl(fd, FIOCLEX, 0);
if (!res) { /* handle the error */ }
NOTE: The close-on-exec flag has no effect on fork(): all file descriptors are inherited by the child process. The Python issue #16500 "Add an atfork module" proposes to add a new atfork module to execute code at fork, which may be used to automatically close file descriptors.
Issues with Inheritable File Descriptors
Most of the time, inheritable file descriptors "leaked" to child processes are not noticed, because they don't cause major bugs. It does not mean that these bugs must not be fixed.
Two common issues with inherited file descriptors:
- On Windows, a directory cannot be removed before all file handles open in the directory are closed. The same issue can be seen with files, except if the file was created with the FILE_SHARE_DELETE flag (O_TEMPORARY mode for open()).
- If a listening socket is leaked to a child process, the socket address cannot be reused before the parent and child processes terminated. For example, if a web server spawns a new program to handle a process, and the server restarts while the program is not done, the server cannot start because the TCP port is still in use.
Example of issues in open source projects:
- Mozilla (Firefox): open since 2002-05
- dbus library: fixed in 2008-05 (dbus commit), close file descriptors in the child process
- autofs: fixed in 2009-02, set the CLOEXEC flag
- qemu: fixed in 2009-12 (qemu commit), set CLOEXEC flag
- Tor: fixed in 2010-12, set CLOEXEC flag
- OCaml: open since 2011-04, "PR#5256: Processes opened using Unix.open_process* inherit all opened file descriptors (including sockets)"
- ĂMQ: open since 2012-08
- Squid: open since 2012-07
See also: Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012) for SELinux issues with leaked file descriptors.
Security Vulnerability
Leaking sensitive file handles and file descriptors can lead to security vulnerabilities. An untrusted child process might read sensitive data like passwords or take control of the parent process though a leaked file descriptor. With a leaked listening socket, a child process can accept new connections to read sensitive data.
Example of vulnerabilities:
- Hijacking Apache https by mod_php (2003)
- Apache: Apr should set FD_CLOEXEC if APR_FOPEN_NOCLEANUP is not set: fixed in 2009
- PHP: system() (and similar) don't cleanup opened handles of Apache: open since 2006
- CWE-403: Exposure of File Descriptor to Unintended Control Sphere (2008)
- OpenSSH Security Advisory: portable-keysign-rand-helper.adv (2011)
Read also the CERT Secure Coding Standards: FIO42-C. Ensure files are properly closed when they are no longer needed.
Issues fixed in the subprocess module
Inherited file descriptors caused 4 issues in the subprocess module:
- Issue #2320: Race condition in subprocess using stdin (opened in 2008)
- Issue #3006: subprocess.Popen causes socket to remain open after close (opened in 2008)
- Issue #7213: subprocess leaks open file descriptors between Popen instances causing hangs (opened in 2009)
- Issue #12786: subprocess wait() hangs when stdin is closed (opened in 2011)
These issues were fixed in Python 3.2 by 4 different changes in the subprocess module:
- Pipes are now non-inheritable;
- The default value of the close_fds parameter is now True, with one exception on Windows: the default value is False if at least one standard stream is replaced;
- A new pass_fds parameter has been added;
- Creation of a _posixsubprocess module implemented in C.
Atomic Creation of non-inheritable File Descriptors
In a multi-threaded application, an inheritable file descriptor may be created just before a new program is spawned, before the file descriptor is made non-inheritable. In this case, the file descriptor is leaked to the child process. This race condition could be avoided if the file descriptor is created directly non-inheritable.
FreeBSD, Linux, Mac OS X, Windows and many other operating systems support creating non-inheritable file descriptors with the inheritable flag cleared atomically at the creation of the file descriptor.
A new WSA_FLAG_NO_HANDLE_INHERIT flag for WSASocket() was added in Windows 7 SP1 and Windows Server 2008 R2 SP1 to create non-inheritable sockets. If this flag is used on an older Windows version (ex: Windows XP SP3), WSASocket() fails with WSAEPROTOTYPE.
On UNIX, new flags were added for files and sockets:
- O_CLOEXEC: available on Linux (2.6.23), FreeBSD (8.3), Mac OS 10.8, OpenBSD 5.0, Solaris 11, QNX, BeOS, next NetBSD release (6.1?). This flag is part of POSIX.1-2008.
- SOCK_CLOEXEC flag for socket() and socketpair(), available on Linux 2.6.27, OpenBSD 5.2, NetBSD 6.0.
- fcntl(): F_DUPFD_CLOEXEC flag, available on Linux 2.6.24, OpenBSD 5.0, FreeBSD 9.1, NetBSD 6.0, Solaris 11. This flag is part of POSIX.1-2008.
- fcntl(): F_DUP2FD_CLOEXEC flag, available on FreeBSD 9.1 and Solaris 11.
- recvmsg(): MSG_CMSG_CLOEXEC, available on Linux 2.6.23, NetBSD 6.0.
On Linux older than 2.6.23, O_CLOEXEC flag is simply ignored. So fcntl() must be called to check if the file descriptor is non-inheritable: O_CLOEXEC is not supported if the FD_CLOEXEC flag is missing. On Linux older than 2.6.27, socket() or socketpair() fail with errno set to EINVAL if the SOCK_CLOEXEC flag is set in the socket type.
New functions:
- dup3(): available on Linux 2.6.27 (and glibc 2.9)
- pipe2(): available on Linux 2.6.27 (and glibc 2.9)
- accept4(): available on Linux 2.6.28 (and glibc 2.10)
On Linux older than 2.6.28, accept4() fails with errno set to ENOSYS.
Summary:
| Operating System | Atomic File | Atomic Socket |
|---|---|---|
| FreeBSD | 8.3 (2012) | X |
| Linux | 2.6.23 (2007) | 2.6.27 (2008) |
| Mac OS X | 10.8 (2012) | X |
| NetBSD | 6.1 (?) | 6.0 (2012) |
| OpenBSD | 5.0 (2011) | 5.2 (2012) |
| Solaris | 11 (2011) | X |
| Windows | XP (2001) | Seven SP1 (2011), 2008 R2 SP1 (2011) |
Legend:
- "Atomic File": first version of the operating system supporting creating atomically a non-inheritable file descriptor using open()
- "Atomic Socket": first version of the operating system supporting creating atomically a non-inheritable socket
- "X": not supported yet
See also:
- Secure File Descriptor Handling (Ulrich Drepper, 2008)
- Ghosts of Unix past, part 2: Conflated designs (Neil Brown, 2010) explains the history of O_CLOEXEC and O_NONBLOCK flags
- File descriptor handling changes in 2.6.27
- FreeBSD: atomic close on exec
Status of Python 3.3
Python 3.3 creates inheritable file descriptors on all platforms, except os.pipe() which creates non-inheritable file descriptors on Windows.
New constants and functions related to the atomic creation of non-inheritable file descriptors were added to Python 3.3: os.O_CLOEXEC, os.pipe2() and socket.SOCK_CLOEXEC.
On UNIX, the subprocess module closes all file descriptors in the child process by default, except standard streams (0, 1, 2) and file descriptors of the pass_fds parameter. If the close_fds parameter is set to False, all inheritable file descriptors are inherited in the child process.
On Windows, the subprocess closes all handles and file descriptors in the child process by default. If at least one standard stream (stdin, stdout or stderr) is replaced (ex: redirected into a pipe), all inheritable handles and file descriptors 0, 1 and 2 are inherited in the child process.
Using the functions of the os.execv*() and os.spawn*() families, all inheritable handles and all inheritable file descriptors are inherited by the child process.
On UNIX, the multiprocessing module uses os.fork() and so all file descriptors are inherited by child processes.
On Windows, all inheritable handles and file descriptors 0, 1 and 2 are inherited by the child process using the multiprocessing module, all file descriptors except standard streams are closed.
Summary:
| Module | FD on UNIX | Handles on Windows | FD on Windows |
|---|---|---|---|
| subprocess, default | STD, pass_fds | none | STD |
| subprocess, replace stdout | STD, pass_fds | all | STD |
| subprocess, close_fds=False | all | all | STD |
| multiprocessing | not applicable | all | STD |
| os.execv(), os.spawn() | all | all | all |
Legend:
- "all": all inheritable file descriptors or handles are inherited in the child process
- "none": all handles are closed in the child process
- "STD": only file descriptors 0 (stdin), 1 (stdout) and 2 (stderr) are inherited in the child process
- "pass_fds": file descriptors of the pass_fds parameter of the subprocess are inherited
- "not applicable": on UNIX, the multiprocessing uses fork(), so this case is not affected by this PEP.
Closing All Open File Descriptors
On UNIX, the subprocess module closes almost all file descriptors in the child process. This operation requires MAXFD system calls, where MAXFD is the maximum number of file descriptors, even if there are only few open file descriptors. This maximum can be read using: os.sysconf("SC_OPEN_MAX").
The operation can be slow if MAXFD is large. For example, on a FreeBSD buildbot with MAXFD=655,000, the operation took 300 ms: see issue #11284: slow close file descriptors.
On Linux, Python 3.3 gets the list of all open file descriptors from /proc/<PID>/fd/, and so performances depends on the number of open file descriptors, not on MAXFD.
See also:
- Python issue #1663329: subprocess close_fds perform poor if SC_OPEN_MAX is high
- Squid Bug #837033: Squid should set CLOEXEC on opened FDs. "32k+ close() calls in each child process take a long time ([12-56] seconds) in Xen PV guests."
Proposal
Non-inheritable File Descriptors
The following functions are modified to make newly created file descriptors non-inheritable by default:
- asyncore.dispatcher.create_socket()
- io.FileIO
- io.open()
- open()
- os.dup()
- os.fdopen()
- os.open()
- os.openpty()
- os.pipe()
- select.devpoll()
- select.epoll()
- select.kqueue()
- socket.socket()
- socket.socket.accept()
- socket.socket.dup()
- socket.socket.fromfd()
- socket.socketpair()
os.dup2() still creates inheritable by default, see below.
When available, atomic flags are used to make file descriptors non-inheritable. The atomicity is not guaranteed because a fallback is required when atomic flags are not available.
New Functions And Methods
New functions available on all platforms:
- os.get_inheritable(fd: int): return True if the file descriptor can be inherited by child processes, False otherwise.
- os.set_inheritable(fd: int, inheritable: bool): set the inheritable flag of the specified file descriptor.
New functions only available on Windows:
- os.get_handle_inheritable(handle: int): return True if the handle can be inherited by child processes, False otherwise.
- os.set_handle_inheritable(handle: int, inheritable: bool): set the inheritable flag of the specified handle.
New methods:
- socket.socket.get_inheritable(): return True if the socket can be inherited by child processes, False otherwise.
- socket.socket.set_inheritable(inheritable: bool): set the inheritable flag of the specified socket.
Other Changes
On UNIX, subprocess makes file descriptors of the pass_fds parameter inheritable. The file descriptor is made inheritable in the child process after the fork() and before execv(), so the inheritable flag of file descriptors is unchanged in the parent process.
os.dup2() has a new optional inheritable parameter: os.dup2(fd, fd2, inheritable=True). fd2 is created inheritable by default, but non-inheritable if inheritable is False.
os.dup2() behaves differently than os.dup() because the most common use case of os.dup2() is to replace the file descriptors of the standard streams: stdin (0), stdout (1) and stderr (2). Standard streams are expected to be inherited by child processes.
Backward Compatibility
This PEP break applications relying on inheritance of file descriptors. Developers are encouraged to reuse the high-level Python module subprocess which handles the inheritance of file descriptors in a portable way.
Applications using the subprocess module with the pass_fds parameter or using only os.dup2() to redirect standard streams should not be affected.
Python no longer conform to POSIX, since file descriptors are now made non-inheritable by default. Python was not designed to conform to POSIX, but was designed to develop portable applications.
Rejected Alternatives
Add a new open_noinherit() function
In June 2007, Henning von Bargen proposed on the python-dev mailing list to add a new open_noinherit() function to fix issues of inherited file descriptors in child processes. At this time, the default value of the close_fds parameter of the subprocess module was False.
Read the mail thread: [Python-Dev] Proposal for a new function "open_noinherit" to avoid problems with subprocesses and security risks.
Python Issues
- #10115: Support accept4() for atomic setting of flags at socket creation
- #12105: open() does not able to set flags, such as O_CLOEXEC
- #12107: TCP listening sockets created without FD_CLOEXEC flag
- #16850: Add "e" mode to open(): close-and-exec (O_CLOEXEC) / O_NOINHERIT
- #16860: Use O_CLOEXEC in the tempfile module
- #16946: subprocess: _close_open_fd_range_safe() does not set close-on-exec flag on Linux < 2.6.23 if O_CLOEXEC is defined
- #17070: Use the new cloexec to improve security and avoid bugs
- #18571: Implementation of the PEP 446: non-inheritable file descriptors
Copyright
This document has been placed into the public domain.
pep-0447 Add __getdescriptor__ method to metaclass
| PEP: | 447 |
|---|---|
| Title: | Add __getdescriptor__ method to metaclass |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ronald Oussoren <ronaldoussoren at mac.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Jun-2013 |
| Post-History: | 2-Jul-2013, 15-Jul-2013, 29-Jul-2013 |
Abstract
Currently object.__getattribute__ and super.__getattribute__ peek in the __dict__ of classes on the MRO for a class when looking for an attribute. This PEP adds an optional __getdescriptor__ method to a metaclass that can be used to override this behavior.
That is, the MRO walking loop in _PyType_Lookup and super.__getattribute__ gets changed from:
def lookup(mro_list, name):
for cls in mro_list:
if name in cls.__dict__:
return cls.__dict__
return NotFound
to:
def lookup(mro_list, name):
for cls in mro_list:
try:
return cls.__getdescriptor__(name)
except AttributeError:
pass
return NotFound
Rationale
It is currently not possible to influence how the super class [2] looks up attributes (that is, super.__getattribute__ unconditionally peeks in the class __dict__), and that can be problematic for dynamic classes that can grow new methods on demand.
The __getdescriptor__ method makes it possible to dynamically add attributes even when looking them up using the super class [2].
The new method affects object.__getattribute__ (and PyObject_GenericGetAttr [3]) as well for consistency and to have a single place to implement dynamic attribute resolution for classes.
Background
The current behavior of super.__getattribute__ causes problems for classes that are dynamic proxies for other (non-Python) classes or types, an example of which is PyObjC [6]. PyObjC creates a Python class for every class in the Objective-C runtime, and looks up methods in the Objective-C runtime when they are used. This works fine for normal access, but doesn't work for access with super objects. Because of this PyObjC currently includes a custom super that must be used with its classes.
The API in this PEP makes it possible to remove the custom super and simplifies the implementation because the custom lookup behavior can be added in a central location.
The superclass attribute lookup hook
Both super.__getattribute__ and object.__getattribute__ (or PyObject_GenericGetAttr [3] and in particular _PyType_Lookup in C code) walk an object's MRO and currently peek in the class' __dict__ to look up attributes.
With this proposal both lookup methods no longer peek in the class __dict__ but call the special method __getdescriptor__, which is a slot defined on the metaclass. The default implementation of that method looks up the name the class __dict__, which means that attribute lookup is unchanged unless a metatype actually defines the new special method.
Aside: Attribute resolution algorithm in Python
The attribute resolution proces as implemented by object.__getattribute__ (or PyObject_GenericGetAttr`` in CPython's implementation) is fairly straightforward, but not entirely so without reading C code.
The current CPython implementation of object.__getattribute__ is basicly equivalent to the following (pseudo-) Python code (excluding some house keeping and speed tricks):
def _PyType_Lookup(tp, name):
mro = tp.mro()
assert isinstance(mro, tuple)
for base in mro:
assert isinstance(base, type)
# PEP 447 will change these lines:
try:
return base.__dict__[name]
except KeyError:
pass
return None
class object:
def __getattribute__(self, name):
assert isinstance(name, str)
tp = type(self)
descr = _PyType_Lookup(tp, name)
f = None
if descr is not None:
f = descr.__get__
if f is not None and descr.__set__ is not None:
# Data descriptor
return f(descr, self, type(self))
dict = self.__dict__
if dict is not None:
try:
return self.__dict__[name]
except KeyError:
pass
if f is not None:
# Non-data descriptor
return f(descr, self, type(self))
if descr is not None:
# Regular class attribute
return descr
raise AttributeError(name)
class super:
def __getattribute__(self, name):
assert isinstance(name, unicode)
if name != '__class__':
starttype = self.__self_type__
mro = startype.mro()
try:
idx = mro.index(self.__thisclass__)
except ValueError:
pass
else:
for base in mro[idx+1:]:
# PEP 447 will change these lines:
try:
descr = base.__dict__[name]
except KeyError:
continue
f = descr.__get__
if f is not None:
return f(descr,
None if (self.__self__ is self.__self_type__) else self.__self__,
starttype)
else:
return descr
return object.__getattribute__(self, name)
This PEP should change the dict lookup at the lines starting at "# PEP 447" with a method call to perform the actual lookup, making is possible to affect that lookup both for normal attribute access and access through the super proxy [2].
Note that specific classes can already completely override the default behaviour by implementing their own __getattribute__ slot (with or without calling the super class implementation).
In Python code
A meta type can define a method __getdescriptor__ that is called during attribute resolution by both super.__getattribute__ and object.__getattribute:
class MetaType(type):
def __getdescriptor__(cls, name):
try:
return cls.__dict__[name]
except KeyError:
raise AttributeError(name) from None
The __getdescriptor__ method has as its arguments a class (which is an instance of the meta type) and the name of the attribute that is looked up. It should return the value of the attribute without invoking descriptors, and should raise AttributeError [5] when the name cannot be found.
The type [4] class provides a default implementation for __getdescriptor__, that looks up the name in the class dictionary.
Example usage
The code below implements a silly metaclass that redirects attribute lookup to uppercase versions of names:
class UpperCaseAccess (type):
def __getdescriptor__(cls, name):
try:
return cls.__dict__[name.upper()]
except KeyError:
raise AttributeError(name) from None
class SillyObject (metaclass=UpperCaseAccess):
def m(self):
return 42
def M(self):
return "fourtytwo"
obj = SillyObject()
assert obj.m() == "fortytwo"
As mentioned earlier in this PEP a more realistic use case of this functionallity is a __getdescriptor__ method that dynamicly populates the class __dict__ based on attribute access, primarily when it is not possible to reliably keep the class dict in sync with its source, for example because the source used to populate __dict__ is dynamic as well and does not have triggers that can be used to detect changes to that source.
An example of that are the class bridges in PyObjC: the class bridge is a Python object (class) that represents an Objective-C class and conceptually has a Python method for every Objective-C method in the Objective-C class. As with Python it is possible to add new methods to an Objective-C class, or replace existing ones, and there are no callbacks that can be used to detect this.
In C code
A new slot tp_getdescriptor is added to the PyTypeObject struct, this slot corresponds to the __getdescriptor__ method on type [4].
The slot has the following prototype:
PyObject* (*getdescriptorfunc)(PyTypeObject* cls, PyObject* name);
This method should lookup name in the namespace of cls, without looking at superclasses, and should not invoke descriptors. The method returns NULL without setting an exception when the name cannot be found, and returns a new reference otherwise (not a borrowed reference).
Use of this hook by the interpreter
The new method is required for metatypes and as such is defined on type_. Both super.__getattribute__ and object.__getattribute__/PyObject_GenericGetAttr [3] (through _PyType_Lookup) use the this __getdescriptor__ method when walking the MRO.
Other changes to the implementation
The change for PyObject_GenericGetAttr [3] will be done by changing the private function _PyType_Lookup. This currently returns a borrowed reference, but must return a new reference when the __getdescriptor__ method is present. Because of this _PyType_Lookup will be renamed to _PyType_LookupName, this will cause compile-time errors for all out-of-tree users of this private API.
The attribute lookup cache in Objects/typeobject.c is disabled for classes that have a metaclass that overrides __getdescriptor__, because using the cache might not be valid for such classes.
Impact of this PEP on introspection
Use of the method introduced in this PEP can affect introspection of classes with a metaclass that uses a custom __getdescriptor__ method. This section lists those changes.
dir might not show all attributes
As with a custom __getattribute__ method dir() might not see all (instance) attributes when using the __getdescriptor__() method to dynamicly resolve attributes.
The solution for that is quite simple: classes using __getdescriptor__ should also implement __dir__ if they want full support for the builtin dir function.
inspect.getattr_static might not show all attributes
The function inspect.getattr_static intentionally does not invoke __getattribute__ and descriptors to avoid invoking user code during introspection with this function. The __getdescriptor__ method will also be ignored and is another way in which the result of inspect.getattr_static can be different from that of builtin.getattr.
inspect.getmembers and inspect.get_class_attrs
Both of these functions directly access the class __dict__ of classes along the MRO, and hence can be affected by a custom __getdescriptor__ method.
TODO: I haven't fully worked out what the impact of this is, and if there are mitigations for those using either updates to these functions, or additional methods that users should implement to be fully compatible with these functions.
Performance impact
The pybench output below compares an implementation of this PEP with the regular source tree, both based on changeset a5681f50bae2, run on an idle machine an Core i7 processor running Centos 6.4.
Even though the machine was idle there were clear differences between runs, I've seen difference in "minimum time" vary from -0.1% to +1.5%, with similar (but slightly smaller) differences in the "average time" difference.
-------------------------------------------------------------------------------
PYBENCH 2.1
-------------------------------------------------------------------------------
* using CPython 3.4.0a0 (default, Jul 29 2013, 13:01:34) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=1e-09, implementation=clock_gettime(CLOCK_MONOTONIC)
-------------------------------------------------------------------------------
Benchmark: pep447.pybench
-------------------------------------------------------------------------------
Rounds: 10
Warp: 10
Timer: time.perf_counter
Machine Details:
Platform ID: Linux-2.6.32-358.114.1.openstack.el6.x86_64-x86_64-with-centos-6.4-Final
Processor: x86_64
Python:
Implementation: CPython
Executable: /tmp/default-pep447/bin/python3
Version: 3.4.0a0
Compiler: GCC 4.4.7 20120313 (Red Hat 4.4.7-3)
Bits: 64bit
Build: Jul 29 2013 14:09:12 (#default)
Unicode: UCS4
-------------------------------------------------------------------------------
Comparing with: default.pybench
-------------------------------------------------------------------------------
Rounds: 10
Warp: 10
Timer: time.perf_counter
Machine Details:
Platform ID: Linux-2.6.32-358.114.1.openstack.el6.x86_64-x86_64-with-centos-6.4-Final
Processor: x86_64
Python:
Implementation: CPython
Executable: /tmp/default/bin/python3
Version: 3.4.0a0
Compiler: GCC 4.4.7 20120313 (Red Hat 4.4.7-3)
Bits: 64bit
Build: Jul 29 2013 13:01:34 (#default)
Unicode: UCS4
Test minimum run-time average run-time
this other diff this other diff
-------------------------------------------------------------------------------
BuiltinFunctionCalls: 45ms 44ms +1.3% 45ms 44ms +1.3%
BuiltinMethodLookup: 26ms 27ms -2.4% 27ms 27ms -2.2%
CompareFloats: 33ms 34ms -0.7% 33ms 34ms -1.1%
CompareFloatsIntegers: 66ms 67ms -0.9% 66ms 67ms -0.8%
CompareIntegers: 51ms 50ms +0.9% 51ms 50ms +0.8%
CompareInternedStrings: 34ms 33ms +0.4% 34ms 34ms -0.4%
CompareLongs: 29ms 29ms -0.1% 29ms 29ms -0.0%
CompareStrings: 43ms 44ms -1.8% 44ms 44ms -1.8%
ComplexPythonFunctionCalls: 44ms 42ms +3.9% 44ms 42ms +4.1%
ConcatStrings: 33ms 33ms -0.4% 33ms 33ms -1.0%
CreateInstances: 47ms 48ms -2.9% 47ms 49ms -3.4%
CreateNewInstances: 35ms 36ms -2.5% 36ms 36ms -2.5%
CreateStringsWithConcat: 69ms 70ms -0.7% 69ms 70ms -0.9%
DictCreation: 52ms 50ms +3.1% 52ms 50ms +3.0%
DictWithFloatKeys: 40ms 44ms -10.1% 43ms 45ms -5.8%
DictWithIntegerKeys: 32ms 36ms -11.2% 35ms 37ms -4.6%
DictWithStringKeys: 29ms 34ms -15.7% 35ms 40ms -11.0%
ForLoops: 30ms 29ms +2.2% 30ms 29ms +2.2%
IfThenElse: 38ms 41ms -6.7% 38ms 41ms -6.9%
ListSlicing: 36ms 36ms -0.7% 36ms 37ms -1.3%
NestedForLoops: 43ms 45ms -3.1% 43ms 45ms -3.2%
NestedListComprehensions: 39ms 40ms -1.7% 39ms 40ms -2.1%
NormalClassAttribute: 86ms 82ms +5.1% 86ms 82ms +5.0%
NormalInstanceAttribute: 42ms 42ms +0.3% 42ms 42ms +0.0%
PythonFunctionCalls: 39ms 38ms +3.5% 39ms 38ms +2.8%
PythonMethodCalls: 51ms 49ms +3.0% 51ms 50ms +2.8%
Recursion: 67ms 68ms -1.4% 67ms 68ms -1.4%
SecondImport: 41ms 36ms +12.5% 41ms 36ms +12.6%
SecondPackageImport: 45ms 40ms +13.1% 45ms 40ms +13.2%
SecondSubmoduleImport: 92ms 95ms -2.4% 95ms 98ms -3.6%
SimpleComplexArithmetic: 28ms 28ms -0.1% 28ms 28ms -0.2%
SimpleDictManipulation: 57ms 57ms -1.0% 57ms 58ms -1.0%
SimpleFloatArithmetic: 29ms 28ms +4.7% 29ms 28ms +4.9%
SimpleIntFloatArithmetic: 37ms 41ms -8.5% 37ms 41ms -8.7%
SimpleIntegerArithmetic: 37ms 41ms -9.4% 37ms 42ms -10.2%
SimpleListComprehensions: 33ms 33ms -1.9% 33ms 34ms -2.9%
SimpleListManipulation: 28ms 30ms -4.3% 29ms 30ms -4.1%
SimpleLongArithmetic: 26ms 26ms +0.5% 26ms 26ms +0.5%
SmallLists: 40ms 40ms +0.1% 40ms 40ms +0.1%
SmallTuples: 46ms 47ms -2.4% 46ms 48ms -3.0%
SpecialClassAttribute: 126ms 120ms +4.7% 126ms 121ms +4.4%
SpecialInstanceAttribute: 42ms 42ms +0.6% 42ms 42ms +0.8%
StringMappings: 94ms 91ms +3.9% 94ms 91ms +3.8%
StringPredicates: 48ms 49ms -1.7% 48ms 49ms -2.1%
StringSlicing: 45ms 45ms +1.4% 46ms 45ms +1.5%
TryExcept: 23ms 22ms +4.9% 23ms 22ms +4.8%
TryFinally: 32ms 32ms -0.1% 32ms 32ms +0.1%
TryRaiseExcept: 17ms 17ms +0.9% 17ms 17ms +0.5%
TupleSlicing: 49ms 48ms +1.1% 49ms 49ms +1.0%
WithFinally: 48ms 47ms +2.3% 48ms 47ms +2.4%
WithRaiseExcept: 45ms 44ms +0.8% 45ms 45ms +0.5%
-------------------------------------------------------------------------------
Totals: 2284ms 2287ms -0.1% 2306ms 2308ms -0.1%
(this=pep447.pybench, other=default.pybench)
A run of the benchmark suite (with option "-b 2n3") also seems to indicate that the performance impact is minimal:
Report on Linux fangorn.local 2.6.32-358.114.1.openstack.el6.x86_64 #1 SMP Wed Jul 3 02:11:25 EDT 2013 x86_64 x86_64 Total CPU cores: 8 ### call_method_slots ### Min: 0.304120 -> 0.282791: 1.08x faster Avg: 0.304394 -> 0.282906: 1.08x faster Significant (t=2329.92) Stddev: 0.00016 -> 0.00004: 4.1814x smaller ### call_simple ### Min: 0.249268 -> 0.221175: 1.13x faster Avg: 0.249789 -> 0.221387: 1.13x faster Significant (t=2770.11) Stddev: 0.00012 -> 0.00013: 1.1101x larger ### django_v2 ### Min: 0.632590 -> 0.601519: 1.05x faster Avg: 0.635085 -> 0.602653: 1.05x faster Significant (t=321.32) Stddev: 0.00087 -> 0.00051: 1.6933x smaller ### fannkuch ### Min: 1.033181 -> 0.999779: 1.03x faster Avg: 1.036457 -> 1.001840: 1.03x faster Significant (t=260.31) Stddev: 0.00113 -> 0.00070: 1.6112x smaller ### go ### Min: 0.526714 -> 0.544428: 1.03x slower Avg: 0.529649 -> 0.547626: 1.03x slower Significant (t=-93.32) Stddev: 0.00136 -> 0.00136: 1.0028x smaller ### iterative_count ### Min: 0.109748 -> 0.116513: 1.06x slower Avg: 0.109816 -> 0.117202: 1.07x slower Significant (t=-357.08) Stddev: 0.00008 -> 0.00019: 2.3664x larger ### json_dump_v2 ### Min: 2.554462 -> 2.609141: 1.02x slower Avg: 2.564472 -> 2.620013: 1.02x slower Significant (t=-76.93) Stddev: 0.00538 -> 0.00481: 1.1194x smaller ### meteor_contest ### Min: 0.196336 -> 0.191925: 1.02x faster Avg: 0.196878 -> 0.192698: 1.02x faster Significant (t=61.86) Stddev: 0.00053 -> 0.00041: 1.2925x smaller ### nbody ### Min: 0.228039 -> 0.235551: 1.03x slower Avg: 0.228857 -> 0.236052: 1.03x slower Significant (t=-54.15) Stddev: 0.00130 -> 0.00029: 4.4810x smaller ### pathlib ### Min: 0.108501 -> 0.105339: 1.03x faster Avg: 0.109084 -> 0.105619: 1.03x faster Significant (t=311.08) Stddev: 0.00022 -> 0.00011: 1.9314x smaller ### regex_effbot ### Min: 0.057905 -> 0.056447: 1.03x faster Avg: 0.058055 -> 0.056760: 1.02x faster Significant (t=79.22) Stddev: 0.00006 -> 0.00015: 2.7741x larger ### silent_logging ### Min: 0.070810 -> 0.072436: 1.02x slower Avg: 0.070899 -> 0.072609: 1.02x slower Significant (t=-191.59) Stddev: 0.00004 -> 0.00008: 2.2640x larger ### spectral_norm ### Min: 0.290255 -> 0.299286: 1.03x slower Avg: 0.290335 -> 0.299541: 1.03x slower Significant (t=-572.10) Stddev: 0.00005 -> 0.00015: 2.8547x larger ### threaded_count ### Min: 0.107215 -> 0.115206: 1.07x slower Avg: 0.107488 -> 0.115996: 1.08x slower Significant (t=-109.39) Stddev: 0.00016 -> 0.00076: 4.8665x larger The following not significant results are hidden, use -v to show them: call_method, call_method_unknown, chaos, fastpickle, fastunpickle, float, formatted_logging, hexiom2, json_load, normal_startup, nqueens, pidigits, raytrace, regex_compile, regex_v8, richards, simple_logging, startup_nosite, telco, unpack_sequence.
Alternative proposals
__getattribute_super__
An earlier version of this PEP used the following static method on classes:
def __getattribute_super__(cls, name, object, owner): pass
This method performed name lookup as well as invoking descriptors and was necessarily limited to working only with super.__getattribute__.
Reuse tp_getattro
It would be nice to avoid adding a new slot, thus keeping the API simpler and easier to understand. A comment on Issue 18181 [1] asked about reusing the tp_getattro slot, that is super could call the tp_getattro slot of all methods along the MRO.
That won't work because tp_getattro will look in the instance __dict__ before it tries to resolve attributes using classes in the MRO. This would mean that using tp_getattro instead of peeking the class dictionaries changes the semantics of the super class [2].
References
- Issue 18181 [1] contains a prototype implementation
| [1] | (1, 2) http://bugs.python.org/issue18181 |
| [2] | (1, 2, 3, 4) http://docs.python.org/3/library/functions.html#super |
| [3] | (1, 2, 3, 4) http://docs.python.org/3/c-api/object.html#PyObject_GenericGetAttr |
| [4] | (1, 2) http://docs.python.org/3/library/functions.html#type |
| [5] | http://docs.python.org/3/library/exceptions.html#AttributeError |
| [6] | http://pyobjc.sourceforge.net/ |
Copyright
This document has been placed in the public domain.
pep-0448 Additional Unpacking Generalizations
| PEP: | 448 |
|---|---|
| Title: | Additional Unpacking Generalizations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Joshua Landau <joshua at landau.ws> |
| Discussions-To: | python-ideas at python.org |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Jun-2013 |
| Python-Version: | 3.5 |
| Post-History: |
Contents
Abstract
This PEP proposes extended usages of the * iterable unpacking operator and ** dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in additional circumstances. Specifically, in function calls, in comprehensions and generator expressions, and in displays.
Function calls are proposed to support an arbitrary number of unpackings rather than just one:
>>> print(*[1], *[2], 3)
1 2 3
>>> dict(**{'x': 1}, y=2, **{'z': 3})
{'x': 1, 'y': 2, 'z': 3}
Unpacking is proposed to be allowed inside tuple, list, set, and dictionary displays:
>>> *range(4), 4
(0, 1, 2, 3, 4)
>>> [*range(4), 4]
[0, 1, 2, 3, 4]
>>> {*range(4), 4}
{0, 1, 2, 3, 4}
>>> {'x': 1, **{'y': 2}}
{'x': 1, 'y': 2}
In dictionaries, later values will always override earlier ones:
>>> {'x': 1, **{'x': 2}}
{'x': 2}
>>> {**{'x': 2}, 'x': 1}
{'x': 1}
This PEP does not include unpacking operators inside list, set and dictionary comprehensions although this has not been ruled out for future proposals.
Rationale
Current usage of the * iterable unpacking operator features unnecessary restrictions that can harm readability.
Unpacking multiple times has an obvious rationale. When you want to unpack several iterables into a function definition or follow an unpack with more positional arguments, the most natural way would be to write:
function(**kw_arguments, **more_arguments) function(*arguments, argument)
Simple examples where this is useful are print and str.format. Instead, you could be forced to write:
kwargs = dict(kw_arguments) kwargs.update(more_arguments) function(**kwargs) args = list(arguments) args.append(arg) function(*args)
or, if you know to do so:
from collections import ChainMap function(**ChainMap(more_arguments, arguments)) from itertools import chain function(*chain(args, [arg]))
which add unnecessary line-noise and, with the first methods, causes duplication of work.
There are two primary rationales for unpacking inside of containers. Firstly there is a symmetry of assignment, where fst, *other, lst = elems and elems = fst, *other, lst are approximate inverses, ignoring the specifics of types. This, in effect, simplifies the language by removing special cases.
Secondly, it vastly simplifies types of "addition" such as combining dictionaries, and does so in an unambiguous and well-defined way:
combination = {**first_dictionary, "x": 1, "y": 2}
instead of:
combination = first_dictionary.copy()
combination.update({"x": 1, "y": 2})
which is especially important in contexts where expressions are preferred. This is also useful as a more readable way of summing iterables into a list, such as my_list + list(my_tuple) + list(my_range) which is now equivalent to just [*my_list, *my_tuple, *my_range].
Specification
Function calls may accept an unbounded number of * and ** unpackings. There will be no restriction of the order of positional arguments with relation to * unpackings nor any restriction of the order of keyword arguments with relation to ** unpackings.
Function calls continue to have the restriction that keyword arguments must follow positional arguments and ** unpackings must additionally follow * unpackings.
Currently, if an argument is given multiple times â such as a positional argument given both positionally and by keyword â a TypeError is raised. This remains true for duplicate arguments provided through multiple ** unpackings, e.g. f(**{'x': 2}, **{'x': 3}), except that the error will be detected at runtime.
A function looks like this:
function(
argument or *args, argument or *args, ...,
kwargument or *args, kwargument or *args, ...,
kwargument or **kwargs, kwargument or **kwargs, ...
)
Tuples, lists, sets and dictionaries will allow unpacking. This will act as if the elements from unpacked items were inserted in order at the site of unpacking, much as happens in unpacking in a function-call. Dictionaries require ** unpacking; all the others require * unpacking.
The keys in a dictionary remain in a right-to-left priority order, so {**{'a': 1}, 'a': 2, **{'a': 3}} evaluates to {'a': 3}. There is no restriction on the number or position of unpackings.
Disadvantages
The allowable orders for arguments in a function call are more complicated than before. The simplest explanation for the rules may be "positional arguments precede keyword arguments and ** unpacking; * unpacking precedes ** unpacking".
Whilst *elements, = iterable causes elements to be a list, elements = *iterable, causes elements to be a tuple. The reason for this may confuse people unfamiliar with the construct.
Concerns have been raised about the unexpected difference between duplicate keys in dictionaries being allowed but duplicate keys in function call syntax raising an error. Although this is already the case with current syntax, this proposal might exacerbate the issue. It remains to be seen how much of an issue this is in practice.
Variations
The PEP originally considered whether the ordering of argument types in a function call (positional, keyword, * or **) could become less strict. This met little support so the idea was shelved.
Earlier iterations of this PEP allowed unpacking operators inside list, set, and dictionary comprehensions as a flattening operator over iterables of containers:
>>> ranges = [range(i) for i in range(5)]
>>> [*item for item in ranges]
[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]
>>> {*item for item in ranges}
{0, 1, 2, 3}
This was met with a mix of strong concerns about readability and mild support. In order not to disadvantage the less controversial aspects of the PEP, this was not accepted with the rest of the proposal.
Unbracketed comprehensions in function calls, such as f(x for x in it), are already valid. These could be extended to:
f(*x for x in it) == f((*x for x in it))
f(**x for x in it) == f({**x for x in it})
However, it wasn't clear if this was the best behaviour or if it should unpack into the arguments of the call to f. Since this is likely to be confusing and is of only very marginal utility, it is not included in this PEP. Instead, these will throw a SyntaxError and comprehensions with explicit brackets should be used instead.
Implementation
An implementation for Python 3.5 is found at Issue 2292 on bug tracker [2]. This currently includes support for unpacking inside comprehensions, which should be removed.
References
| [1] | PEP accepted, "PEP 448 review", Guido van Rossum (https://mail.python.org/pipermail/python-dev/2015-February/138564.html) |
| [2] | Issue 2292, "Missing *-unpacking generalizations", Thomas Wouters (http://bugs.python.org/issue2292) |
| [3] | Discussion on Python-ideas list, "list / array comprehensions extension", Alexander Heger (http://mail.python.org/pipermail/python-ideas/2011-December/013097.html) |
Copyright
This document has been placed in the public domain.
pep-0449 Removal of the PyPI Mirror Auto Discovery and Naming Scheme
| PEP: | 449 |
|---|---|
| Title: | Removal of the PyPI Mirror Auto Discovery and Naming Scheme |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io> |
| BDFL-Delegate: | Richard Jones <richard@python.org> |
| Discussions-To: | distutils-sig at python.org |
| Status: | Accepted |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 04-Aug-2013 |
| Post-History: | 04-Aug-2013 |
| Replaces: | 381 |
| Resolution: | http://mail.python.org/pipermail/distutils-sig/2013-August/022518.html |
Contents
Abstract
This PEP provides a path to deprecate and ultimately remove the auto discovery of PyPI mirrors as well as the hard coded naming scheme which requires delegating a domain name under pypi.python.org to a third party.
Rationale
The PyPI mirroring infrastructure (defined in PEP381 [1]) provides a means to mirror the content of PyPI used by the automatic installers. It also provides a method for auto discovery of mirrors and a consistent naming scheme.
There are a number of problems with the auto discovery protocol and the naming scheme:
- They give control over a *.python.org domain name to a third party, allowing that third party to set or read cookies on the pypi.python.org and python.org domain name.
- The use of a sub domain of pypi.python.org means that the mirror operators will never be able to get a SSL certificate of their own, and giving them one for a python.org domain name is unlikely to happen.
- The auto discovery uses an unauthenticated protocol (DNS).
- The lack of a TLS certificate on these domains means that clients can not be sure that they have not been a victim of DNS poisoning or a MITM attack.
- The auto discovery protocol was designed to enable a client to automatically select a mirror for use. This is no longer a requirement because the CDN that PyPI is now using a globally distributed network of servers which will automatically select one close to the client without any effort on the clients part.
- The auto discovery protocol and use of the consistent naming scheme has only ever been implemented by one installer (pip), and its implementation, besides being insecure, has serious issues with performance and is slated for removal with it's next release (1.5).
- While there are provisions in PEP381 [1] that would solve some of these issues for a dedicated client it would not solve the issues that affect a users browser. Additionally these provisions have not been implemented by any installer to date.
Due to the number of issues, some of them very serious, and the CDN which provides most of the benefit of the auto discovery and consistent naming scheme this PEP proposes to first deprecate and then remove the [a..z].pypi.python.org names for mirrors and the last.pypi.python.org name for the auto discovery protocol. The ability to mirror and the method of mirror will not be affected and will continue to exist as written in PEP381 [1]. Operators of existing mirrors are encouraged to acquire their own domains and certificates to use for their mirrors if they wish to continue hosting them.
Plan for Deprecation & Removal
Immediately upon acceptance of this PEP documentation on PyPI will be updated to reflect the deprecated nature of the official public mirrors and will direct users to external resources like http://www.pypi-mirrors.org/ to discover unofficial public mirrors if they wish to use one.
Mirror operators, if they wish to continue operating their mirror, should acquire a domain name to represent their mirror and, if they are able, a TLS certificate. Once they have acquired a domain they should redirect their assigned N.pypi.python.org domain name to their new domain. On Feb 15th, 2014 the DNS entries for [a..z].pypi.python.org and last.pypi.python.org will be removed. At any time prior to Feb 15th, 2014 a mirror operator may request that their domain name be reclaimed by PyPI and pointed back at the master.
Why Feb 15th, 2014
The most critical decision of this PEP is the final cut off date. If the date is too soon then it needlessly punishes people by forcing them to drop everything to update their deployment scripts. If the date is too far away then the extended period of time does not help with the migration effort and merely puts off the migration until a later date.
The date of Feb 15th, 2014 has been chosen because it is roughly 6 months from the date of the PEP. This should ensure a lengthy period of time to enable people to update their deployment procedures to point to the new domains names without merely padding the cut off date.
Why the DNS entries must be removed
While it would be possible to simply reclaim the domain names used in mirror and direct them back at PyPI in order to prevent users from needing to update configurations to point away from those domains this has a number of issues.
- Anyone who currently has these names hard coded in their configuration has them hard coded as HTTP. This means that by allowing these names to continue resolving we make it simple for a MITM operator to attack users by rewriting the redirect to HTTPS prior to giving it to the client.
- The overhead of maintaining several domains pointing at PyPI has proved troublesome for the small number of N.pypi.python.org domains that have already been reclaimed. They often times get mis-configured when things change on the service which often leaves them broken for months at a time until somebody notices. By leaving them in we leave users of these domains open to random breakages which are less likely to get caught or noticed.
- People using these domains have explicitly chosen to use them for one reason or another. One such reason may be because they do not wish to deploy from a host located in a particular country. If these domains continue to resolve but do not point at their existing locations we have silently removed this choice from the existing users of those domains.
That being said, removing the entries will require users who have modified their configuration to either point back at the master (PyPI) or select a new mirror name to point at. This is regarded as a regrettable requirement to protect PyPI itself and the users of the mirrors from the attacks outlined above or, at the very least, require them to make an informed decision about the insecurity.
Public or Private Mirrors
The mirroring protocol will continue to exist as defined in PEP381 [1] and people are encouraged to to host public and private mirrors if they so desire. The recommended mirroring client is Bandersnatch [2].
References
| [1] | (1, 2, 3, 4) http://www.python.org/dev/peps/pep-0381/ |
| [2] | https://pypi.python.org/pypi/bandersnatch |
Copyright
This document has been placed in the public domain.
pep-0450 Adding A Statistics Module To The Standard Library
| PEP: | 450 |
|---|---|
| Title: | Adding A Statistics Module To The Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Steven D'Aprano <steve at pearwood.info> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 01-Aug-2013 |
| Python-Version: | 3.4 |
| Post-History: | 13-Sep-2013 |
Abstract
This PEP proposes the addition of a module for common statistics functions
such as mean, median, variance and standard deviation to the Python
standard library. See also http://bugs.python.org/issue18606
Rationale
The proposed statistics module is motivated by the "batteries included"
philosophy towards the Python standard library. Raymond Hettinger and
other senior developers have requested a quality statistics library that
falls somewhere in between high-end statistics libraries and ad hoc
code.[1] Statistical functions such as mean, standard deviation and others
are obvious and useful batteries, familiar to any Secondary School student.
Even cheap scientific calculators typically include multiple statistical
functions such as:
- mean
- population and sample variance
- population and sample standard deviation
- linear regression
- correlation coefficient
Graphing calculators aimed at Secondary School students typically
include all of the above, plus some or all of:
- median
- mode
- functions for calculating the probability of random variables
from the normal, t, chi-squared, and F distributions
- inference on the mean
and others[2]. Likewise spreadsheet applications such as Microsoft Excel,
LibreOffice and Gnumeric include rich collections of statistical
functions[3].
In contrast, Python currently has no standard way to calculate even the
simplest and most obvious statistical functions such as mean. For those
who need statistical functions in Python, there are two obvious solutions:
- install numpy and/or scipy[4];
- or use a Do It Yourself solution.
Numpy is perhaps the most full-featured solution, but it has a few
disadvantages:
- It may be overkill for many purposes. The documentation for numpy even
warns
"It can be hard to know what functions are available in
numpy. This is not a complete list, but it does cover
most of them."[5]
and then goes on to list over 270 functions, only a small number of
which are related to statistics.
- Numpy is aimed at those doing heavy numerical work, and may be
intimidating to those who don't have a background in computational
mathematics and computer science. For example, numpy.mean takes four
arguments:
mean(a, axis=None, dtype=None, out=None)
although fortunately for the beginner or casual numpy user, three are
optional and numpy.mean does the right thing in simple cases:
>>> numpy.mean([1, 2, 3, 4])
2.5
- For many people, installing numpy may be difficult or impossible. For
example, people in corporate environments may have to go through a
difficult, time-consuming process before being permitted to install
third-party software. For the casual Python user, having to learn about
installing third-party packages in order to average a list of numbers is
unfortunate.
This leads to option number 2, DIY statistics functions. At first glance,
this appears to be an attractive option, due to the apparent simplicity of
common statistical functions. For example:
def mean(data):
return sum(data)/len(data)
def variance(data):
# Use the Computational Formula for Variance.
n = len(data)
ss = sum(x**2 for x in data) - (sum(data)**2)/n
return ss/(n-1)
def standard_deviation(data):
return math.sqrt(variance(data))
The above appears to be correct with a casual test:
>>> data = [1, 2, 4, 5, 8]
>>> variance(data)
7.5
But adding a constant to every data point should not change the variance:
>>> data = [x+1e12 for x in data]
>>> variance(data)
0.0
And variance should *never* be negative:
>>> variance(data*100)
-1239429440.1282566
By contrast, the proposed reference implementation gets the exactly correct
answer 7.5 for the first two examples, and a reasonably close answer for
the third: 6.012. numpy does no better[6].
Even simple statistical calculations contain traps for the unwary, starting
with the Computational Formula itself. Despite the name, it is numerically
unstable and can be extremely inaccurate, as can be seen above. It is
completely unsuitable for computation by computer[7]. This problem plagues
users of many programming language, not just Python[8], as coders reinvent
the same numerically inaccurate code over and over again[9], or advise
others to do so[10].
It isn't just the variance and standard deviation. Even the mean is not
quite as straight-forward as it might appear. The above implementation
seems too simple to have problems, but it does:
- The built-in sum can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive mean fails this
"torture test":
assert mean([1e30, 1, 3, -1e30]) == 1
returning 0 instead of 1, a purely computational error of 100%.
- Using math.fsum inside mean will make it more accurate with float data,
but it also has the side-effect of converting any arguments to float
even when unnecessary. E.g. we should expect the mean of a list of
Fractions to be a Fraction, not a float.
While the above mean implementation does not fail quite as catastrophically
as the naive variance does, a standard library function can do much better
than the DIY versions.
The example above involves an especially bad set of data, but even for
more realistic data sets accuracy is important. The first step in
interpreting variation in data (including dealing with ill-conditioned
data) is often to standardize it to a series with variance 1 (and often
mean 0). This standardization requires accurate computation of the mean
and variance of the raw series. Naive computation of mean and variance
can lose precision very quickly. Because precision bounds accuracy, it is
important to use the most precise algorithms for computing mean and
variance that are practical, or the results of standardization are
themselves useless.
Comparison To Other Languages/Packages
The proposed statistics library is not intended to be a competitor to such
third-party libraries as numpy/scipy, or of proprietary full-featured
statistics packages aimed at professional statisticians such as Minitab,
SAS and Matlab. It is aimed at the level of graphing and scientific
calculators.
Most programming languages have little or no built-in support for
statistics functions. Some exceptions:
R
R (and its proprietary cousin, S) is a programming language designed
for statistics work. It is extremely popular with statisticians and
is extremely feature-rich[11].
C#
The C# LINQ package includes extension methods to calculate the
average of enumerables[12].
Ruby
Ruby does not ship with a standard statistics module, despite some
apparent demand[13]. Statsample appears to be a feature-rich third-
party library, aiming to compete with R[14].
PHP
PHP has an extremely feature-rich (although mostly undocumented) set
of advanced statistical functions[15].
Delphi
Delphi includes standard statistical functions including Mean, Sum,
Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
GNU Scientific Library
The GNU Scientific Library includes standard statistical functions,
percentiles, median and others[17]. One innovation I have borrowed
from the GSL is to allow the caller to optionally specify the pre-
calculated mean of the sample (or an a priori known population mean)
when calculating the variance and standard deviation[18].
Design Decisions Of The Module
My intention is to start small and grow the library as needed, rather than
try to include everything from the start. Consequently, the current
reference implementation includes only a small number of functions: mean,
variance, standard deviation, median, mode. (See the reference
implementation for a full list.)
I have aimed for the following design features:
- Correctness over speed. It is easier to speed up a correct but slow
function than to correct a fast but buggy one.
- Concentrate on data in sequences, allowing two-passes over the data,
rather than potentially compromise on accuracy for the sake of a one-pass
algorithm. Functions expect data will be passed as a list or other
sequence; if given an iterator, they may internally convert to a list.
- Functions should, as much as possible, honour any type of numeric data.
E.g. the mean of a list of Decimals should be a Decimal, not a float.
When this is not possible, treat float as the "lowest common data type".
- Although functions support data sets of floats, Decimals or Fractions,
there is no guarantee that *mixed* data sets will be supported. (But on
the other hand, they aren't explicitly rejected either.)
- Plenty of documentation, aimed at readers who understand the basic
concepts but may not know (for example) which variance they should use
(population or sample?). Mathematicians and statisticians have a terrible
habit of being inconsistent with both notation and terminology[19], and
having spent many hours making sense of the contradictory/confusing
definitions in use, it is only fair that I do my best to clarify rather
than obfuscate the topic.
- But avoid going into tedious[20] mathematical detail.
API
The initial version of the library will provide univariate (single
variable) statistics functions. The general API will be based on a
functional model ``function(data, ...) -> result``, where ``data``
is a mandatory iterable of (usually) numeric data.
The author expects that lists will be the most common data type used,
but any iterable type should be acceptable. Where necessary, functions
may convert to lists internally. Where possible, functions are
expected to conserve the type of the data values, for example, the mean
of a list of Decimals should be a Decimal rather than float.
Calculating mean, median and mode
The ``mean``, ``median*`` and ``mode`` functions take a single
mandatory argument and return the appropriate statistic, e.g.:
>>> mean([1, 2, 3])
2.0
Functions provided are:
* mean(data) -> arithmetic mean of data.
* median(data) -> median (middle value) of data, taking the
average of the two middle values when there are an even
number of values.
* median_high(data) -> high median of data, taking the
larger of the two middle values when the number of items
is even.
* median_low(data) -> low median of data, taking the smaller
of the two middle values when the number of items is even.
* median_grouped(data, interval=1) -> 50th percentile of
grouped data, using interpolation.
* mode(data) -> most common data point.
``mode`` is the sole exception to the rule that the data argument
must be numeric. It will also accept an iterable of nominal data,
such as strings.
Calculating variance and standard deviation
In order to be similar to scientific calculators, the statistics
module will include separate functions for population and sample
variance and standard deviation. All four functions have similar
signatures, with a single mandatory argument, an iterable of
numeric data, e.g.:
>>> variance([1, 2, 2, 2, 3])
0.5
All four functions also accept a second, optional, argument, the
mean of the data. This is modelled on a similar API provided by
the GNU Scientific Library[18]. There are three use-cases for
using this argument, in no particular order:
1) The value of the mean is known *a priori*.
2) You have already calculated the mean, and wish to avoid
calculating it again.
3) You wish to (ab)use the variance functions to calculate
the second moment about some given point other than the
mean.
In each case, it is the caller's responsibility to ensure that
given argument is meaningful.
Functions provided are:
* variance(data, xbar=None) -> sample variance of data,
optionally using xbar as the sample mean.
* stdev(data, xbar=None) -> sample standard deviation of
data, optionally using xbar as the sample mean.
* pvariance(data, mu=None) -> population variance of data,
optionally using mu as the population mean.
* pstdev(data, mu=None) -> population standard deviation of
data, optionally using mu as the population mean.
Other functions
There is one other public function:
* sum(data, start=0) -> high-precision sum of numeric data.
Specification
As the proposed reference implementation is in pure Python,
other Python implementations can easily make use of the module
unchanged, or adapt it as they see fit.
What Should Be The Name Of The Module?
This will be a top-level module "statistics".
There was some interest in turning math into a package, and making this a
sub-module of math, but the general consensus eventually agreed on a
top-level module. Other potential but rejected names included "stats" (too
much risk of confusion with existing "stat" module), and "statslib"
(described as "too C-like").
Discussion And Resolved Issues
This proposal has been previously discussed here[21].
A number of design issues were resolved during the discussion on
Python-Ideas and the initial code review. There was a lot of concern
about the addition of yet another ``sum`` function to the standard
library, see the FAQs below for more details. In addition, the
initial implementation of ``sum`` suffered from some rounding issues
and other design problems when dealing with Decimals. Oscar
Benjamin's assistance in resolving this was invaluable.
Another issue was the handling of data in the form of iterators. The
first implementation of variance silently swapped between a one- and
two-pass algorithm, depending on whether the data was in the form of
an iterator or sequence. This proved to be a design mistake, as the
calculated variance could differ slightly depending on the algorithm
used, and ``variance`` etc. were changed to internally generate a list
and always use the more accurate two-pass implementation.
One controversial design involved the functions to calculate median,
which were implemented as attributes on the ``median`` callable, e.g.
``median``, ``median.low``, ``median.high`` etc. Although there is
at least one existing use of this style in the standard library, in
``unittest.mock``, the code reviewers felt that this was too unusual
for the standard library. Consequently, the design has been changed
to a more traditional design of separate functions with a pseudo-
namespace naming convention, ``median_low``, ``median_high``, etc.
Another issue that was of concern to code reviewers was the existence
of a function calculating the sample mode of continuous data, with
some people questioning the choice of algorithm, and whether it was
a sufficiently common need to be included. So it was dropped from
the API, and ``mode`` now implements only the basic schoolbook
algorithm based on counting unique values.
Another significant point of discussion was calculating statistics of
timedelta objects. Although the statistics module will not directly
support timedelta objects, it is possible to support this use-case by
converting them to numbers first using the ``timedelta.total_seconds``
method.
Frequently Asked Questions
Q: Shouldn't this module spend time on PyPI before being considered for
the standard library?
A: Older versions of this module have been available on PyPI[22] since
2010. Being much simpler than numpy, it does not require many years of
external development.
Q: Does the standard library really need yet another version of ``sum``?
A: This proved to be the most controversial part of the reference
implementation. In one sense, clearly three sums is two too many. But
in another sense, yes. The reasons why the two existing versions are
unsuitable are described here[23] but the short summary is:
- the built-in sum can lose precision with floats;
- the built-in sum accepts any non-numeric data type that supports
the + operator, apart from strings and bytes;
- math.fsum is high-precision, but coerces all arguments to float.
There was some interest in "fixing" one or the other of the existing
sums. If this occurs before 3.4 feature-freeze, the decision to keep
statistics.sum can be re-considered.
Q: Will this module be backported to older versions of Python?
A: The module currently targets 3.3, and I will make it available on PyPI
for 3.3 for the foreseeable future. Backporting to older versions of
the 3.x series is likely (but not yet decided). Backporting to 2.7 is
less likely but not ruled out.
Q: Is this supposed to replace numpy?
A: No. While it is likely to grow over the years (see open issues below)
it is not aimed to replace, or even compete directly with, numpy. Numpy
is a full-featured numeric library aimed at professionals, the nuclear
reactor of numeric libraries in the Python ecosystem. This is just a
battery, as in "batteries included", and is aimed at an intermediate
level somewhere between "use numpy" and "roll your own version".
Future Work
- At this stage, I am unsure of the best API for multivariate statistical
functions such as linear regression, correlation coefficient, and
covariance. Possible APIs include:
* Separate arguments for x and y data:
function([x0, x1, ...], [y0, y1, ...])
* A single argument for (x, y) data:
function([(x0, y0), (x1, y1), ...])
This API is preferred by GvR[24].
* Selecting arbitrary columns from a 2D array:
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
* Some combination of the above.
In the absence of a consensus of preferred API for multivariate stats,
I will defer including such multivariate functions until Python 3.5.
- Likewise, functions for calculating probability of random variables and
inference testing (e.g. Student's t-test) will be deferred until 3.5.
- There is considerable interest in including one-pass functions that can
calculate multiple statistics from data in iterator form, without having
to convert to a list. The experimental "stats" package on PyPI includes
co-routine versions of statistics functions. Including these will be
deferred to 3.5.
References
[1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html
[2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
[3] Gnumeric:
https://projects.gnome.org/gnumeric/functions.shtml
LibreOffice:
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
[4] Scipy: http://scipy-central.org/
Numpy: http://www.numpy.org/
[5] http://wiki.scipy.org/Numpy_Functions_by_Category
[6] Tested with numpy 1.6.1 and Python 2.7.
[7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
[8] http://rosettacode.org/wiki/Standard_deviation
[9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
[10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
[11] http://www.r-project.org/
[12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
[13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
[14] http://ruby-statsample.rubyforge.org/
[15] http://www.php.net/manual/en/ref.stats.php
[16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
[17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
[18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
[19] http://mathworld.wolfram.com/Skewness.html
[20] At least, tedious to those who don't like this sort of thing.
[21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html
[22] https://pypi.python.org/pypi/stats/
[23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
Copyright
This document has been placed in the public domain.
pep-0451 A ModuleSpec Type for the Import System
| PEP: | 451 |
|---|---|
| Title: | A ModuleSpec Type for the Import System |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Eric Snow <ericsnowcurrently at gmail.com> |
| BDFL-Delegate: | Brett Cannon <brett@python.org>, Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | import-sig at python.org |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 8-Aug-2013 |
| Python-Version: | 3.4 |
| Post-History: | 8-Aug-2013, 28-Aug-2013, 18-Sep-2013, 24-Sep-2013, 4-Oct-2013 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130104.html |
Contents
Abstract
This PEP proposes to add a new class to importlib.machinery called "ModuleSpec". It will provide all the import-related information used to load a module and will be available without needing to load the module first. Finders will directly provide a module's spec instead of a loader (which they will continue to provide indirectly). The import machinery will be adjusted to take advantage of module specs, including using them to load modules.
Terms and Concepts
The changes in this proposal are an opportunity to make several existing terms and concepts more clear, whereas currently they are (unfortunately) ambiguous. New concepts are also introduced in this proposal. Finally, it's worth explaining a few other existing terms with which people may not be so familiar. For the sake of context, here is a brief summary of all three groups of terms and concepts. A more detailed explanation of the import system is found at [2].
name
In this proposal, a module's "name" refers to its fully-qualified name, meaning the fully-qualified name of the module's parent (if any) joined to the simple name of the module by a period.
finder
A "finder" is an object that identifies the loader that the import system should use to load a module. Currently this is accomplished by calling the finder's find_module() method, which returns the loader.
Finders are strictly responsible for providing the loader, which they do through their find_module() method. The import system then uses that loader to load the module.
loader
A "loader" is an object that is used to load a module during import. Currently this is done by calling the loader's load_module() method. A loader may also provide APIs for getting information about the modules it can load, as well as about data from sources associated with such a module.
Right now loaders (via load_module()) are responsible for certain boilerplate, import-related operations. These are:
- Perform some (module-related) validation
- Create the module object
- Set import-related attributes on the module
- "Register" the module to sys.modules
- Exec the module
- Clean up in the event of failure while loading the module
This all takes place during the import system's call to Loader.load_module().
origin
This is a new term and concept. The idea of it exists subtly in the import system already, but this proposal makes the concept explicit.
"origin" in an import context means the system (or resource within a system) from which a module originates. For the purposes of this proposal, "origin" is also a string which identifies such a resource or system. "origin" is applicable to all modules.
For example, the origin for built-in and frozen modules is the interpreter itself. The import system already identifies this origin as "built-in" and "frozen", respectively. This is demonstrated in the following module repr: "<module 'sys' (built-in)>".
In fact, the module repr is already a relatively reliable, though implicit, indicator of a module's origin. Other modules also indicate their origin through other means, as described in the entry for "location".
It is up to the loader to decide on how to interpret and use a module's origin, if at all.
location
This is a new term. However the concept already exists clearly in the import system, as associated with the __file__ and __path__ attributes of modules, as well as the name/term "path" elsewhere.
A "location" is a resource or "place", rather than a system at large, from which a module is loaded. It qualifies as an "origin". Examples of locations include filesystem paths and URLs. A location is identified by the name of the resource, but may not necessarily identify the system to which the resource pertains. In such cases the loader would have to identify the system itself.
In contrast to other kinds of module origin, a location cannot be inferred by the loader just by the module name. Instead, the loader must be provided with a string to identify the location, usually by the finder that generates the loader. The loader then uses this information to locate the resource from which it will load the module. In theory you could load the module at a given location under various names.
The most common example of locations in the import system are the files from which source and extension modules are loaded. For these modules the location is identified by the string in the __file__ attribute. Although __file__ isn't particularly accurate for some modules (e.g. zipped), it is currently the only way that the import system indicates that a module has a location.
A module that has a location may be called "locatable".
cache
The import system stores compiled modules in the __pycache__ directory as an optimization. This module cache that we use today was provided by PEP 3147. For this proposal, the relevant API for module caching is the __cache__ attribute of modules and the cache_from_source() function in importlib.util. Loaders are responsible for putting modules into the cache (and loading out of the cache). Currently the cache is only used for compiled source modules. However, loaders may take advantage of the module cache for other kinds of modules.
package
The concept does not change, nor does the term. However, the distinction between modules and packages is mostly superficial. Packages are modules. They simply have a __path__ attribute and import may add attributes bound to submodules. The typically perceived difference is a source of confusion. This proposal explicitly de-emphasizes the distinction between packages and modules where it makes sense to do so.
Motivation
The import system has evolved over the lifetime of Python. In late 2002 PEP 302 introduced standardized import hooks via finders and loaders and sys.meta_path. The importlib module, introduced with Python 3.1, now exposes a pure Python implementation of the APIs described by PEP 302, as well as of the full import system. It is now much easier to understand and extend the import system. While a benefit to the Python community, this greater accessabilty also presents a challenge.
As more developers come to understand and customize the import system, any weaknesses in the finder and loader APIs will be more impactful. So the sooner we can address any such weaknesses the import system, the better...and there are a couple we hope to take care of with this proposal.
Firstly, any time the import system needs to save information about a module we end up with more attributes on module objects that are generally only meaningful to the import system. It would be nice to have a per-module namespace in which to put future import-related information and to pass around within the import system. Secondly, there's an API void between finders and loaders that causes undue complexity when encountered. The PEP 420 (namespace packages) implementation had to work around this. The complexity surfaced again during recent efforts on a separate proposal. [1]
The finder and loader sections above detail current responsibility of both. Notably, loaders are not required to provide any of the functionality of their load_module() method through other methods. Thus, though the import-related information about a module is likely available without loading the module, it is not otherwise exposed.
Furthermore, the requirements associated with load_module() are common to all loaders and mostly are implemented in exactly the same way. This means every loader has to duplicate the same boilerplate code. importlib.util provides some tools that help with this, but it would be more helpful if the import system simply took charge of these responsibilities. The trouble is that this would limit the degree of customization that load_module() could easily continue to facilitate.
More importantly, While a finder could provide the information that the loader's load_module() would need, it currently has no consistent way to get it to the loader. This is a gap between finders and loaders which this proposal aims to fill.
Finally, when the import system calls a finder's find_module(), the finder makes use of a variety of information about the module that is useful outside the context of the method. Currently the options are limited for persisting that per-module information past the method call, since it only returns the loader. Popular options for this limitation are to store the information in a module-to-info mapping somewhere on the finder itself, or store it on the loader.
Unfortunately, loaders are not required to be module-specific. On top of that, some of the useful information finders could provide is common to all finders, so ideally the import system could take care of those details. This is the same gap as before between finders and loaders.
As an example of complexity attributable to this flaw, the implementation of namespace packages in Python 3.3 (see PEP 420) added FileFinder.find_loader() because there was no good way for find_module() to provide the namespace search locations.
The answer to this gap is a ModuleSpec object that contains the per-module information and takes care of the boilerplate functionality involved with loading the module.
Specification
The goal is to address the gap between finders and loaders while changing as little of their semantics as possible. Though some functionality and information is moved to the new ModuleSpec type, their behavior should remain the same. However, for the sake of clarity the finder and loader semantics will be explicitly identified.
Here is a high-level summary of the changes described by this PEP. More detail is available in later sections.
importlib.machinery.ModuleSpec (new)
An encapsulation of a module's import-system-related state during import. See the ModuleSpec section below for a more detailed description.
- ModuleSpec(name, loader, *, origin=None, loader_state=None, is_package=None)
Attributes:
- name - a string for the fully-qualified name of the module.
- loader - the loader to use for loading.
- origin - the name of the place from which the module is loaded, e.g. "builtin" for built-in modules and the filename for modules loaded from source.
- submodule_search_locations - list of strings for where to find submodules, if a package (None otherwise).
- loader_state - a container of extra module-specific data for use during loading.
- cached (property) - a string for where the compiled module should be stored.
- parent (RO-property) - the fully-qualified name of the package to which the module belongs as a submodule (or None).
- has_location (RO-property) - a flag indicating whether or not the module's "origin" attribute refers to a location.
importlib.util Additions
These are ModuleSpec factory functions, meant as a convenience for finders. See the Factory Functions section below for more detail.
- spec_from_file_location(name, location, *, loader=None, submodule_search_locations=None) - build a spec from file-oriented information and loader APIs.
- spec_from_loader(name, loader, *, origin=None, is_package=None) - build a spec with missing information filled in by using loader APIs.
Other API Additions
- importlib.find_spec(name, path=None, target=None) will work exactly the same as importlib.find_loader() (which it replaces), but return a spec instead of a loader.
For finders:
- importlib.abc.MetaPathFinder.find_spec(name, path, target) and importlib.abc.PathEntryFinder.find_spec(name, target) will return a module spec to use during import.
For loaders:
- importlib.abc.Loader.exec_module(module) will execute a module in its own namespace. It replaces importlib.abc.Loader.load_module(), taking over its module execution functionality.
- importlib.abc.Loader.create_module(spec) (optional) will return the module to use for loading.
For modules:
- Module objects will have a new attribute: __spec__.
API Changes
- InspectLoader.is_package() will become optional.
Deprecations
- importlib.abc.MetaPathFinder.find_module()
- importlib.abc.PathEntryFinder.find_module()
- importlib.abc.PathEntryFinder.find_loader()
- importlib.abc.Loader.load_module()
- importlib.abc.Loader.module_repr()
- importlib.util.set_package()
- importlib.util.set_loader()
- importlib.find_loader()
Removals
These were introduced prior to Python 3.4's release, so they can simply be removed.
- importlib.abc.Loader.init_module_attrs()
- importlib.util.module_to_load()
Other Changes
- The import system implementation in importlib will be changed to make use of ModuleSpec.
- importlib.reload() will make use of ModuleSpec.
- A module's import-related attributes (other than __spec__) will no longer be used directly by the import system during that module's import. However, this does not impact use of those attributes (e.g. __path__) when loading other modules (e.g. submodules).
- Import-related attributes should no longer be added to modules directly, except by the import system.
- The module type's __repr__() will be a thin wrapper around a pure Python implementation which will leverage ModuleSpec.
- The spec for the __main__ module will reflect the appropriate name and origin.
Backward-Compatibility
- If a finder does not define find_spec(), a spec is derived from the loader returned by find_module().
- PathEntryFinder.find_loader() still takes priority over find_module().
- Loader.load_module() is used if exec_module() is not defined.
What Will not Change?
- The syntax and semantics of the import statement.
- Existing finders and loaders will continue to work normally.
- The import-related module attributes will still be initialized with the same information.
- Finders will still create loaders (now storing them in specs).
- Loader.load_module(), if a module defines it, will have all the same requirements and may still be called directly.
- Loaders will still be responsible for module data APIs.
- importlib.reload() will still overwrite the import-related attributes.
Responsibilities
Here's a quick breakdown of where responsibilities lie after this PEP.
finders:
- create/identify a loader that can load the module.
- create the spec for the module.
loaders:
- create the module (optional).
- execute the module.
ModuleSpec:
- orchestrate module loading
- boilerplate for module loading, including managing sys.modules and setting import-related attributes
- create module if loader doesn't
- call loader.exec_module(), passing in the module in which to exec
- contain all the information the loader needs to exec the module
- provide the repr for modules
What Will Existing Finders and Loaders Have to Do Differently?
Immediately? Nothing. The status quo will be deprecated, but will continue working. However, here are the things that the authors of finders and loaders should change relative to this PEP:
- Implement find_spec() on finders.
- Implement exec_module() on loaders, if possible.
The ModuleSpec factory functions in importlib.util are intended to be helpful for converting existing finders. spec_from_loader() and spec_from_file_location() are both straight-forward utilities in this regard.
For existing loaders, exec_module() should be a relatively direct conversion from the non-boilerplate portion of load_module(). In some uncommon cases the loader should also implement create_module().
ModuleSpec Users
ModuleSpec objects have 3 distinct target audiences: Python itself, import hooks, and normal Python users.
Python will use specs in the import machinery, in interpreter startup, and in various standard library modules. Some modules are import-oriented, like pkgutil, and others are not, like pickle and pydoc. In all cases, the full ModuleSpec API will get used.
Import hooks (finders and loaders) will make use of the spec in specific ways. First of all, finders may use the spec factory functions in importlib.util to create spec objects. They may also directly adjust the spec attributes after the spec is created. Secondly, the finder may bind additional information to the spec (in finder_extras) for the loader to consume during module creation/execution. Finally, loaders will make use of the attributes on a spec when creating and/or executing a module.
Python users will be able to inspect a module's __spec__ to get import-related information about the object. Generally, Python applications and interactive users will not be using the ModuleSpec factory functions nor any the instance methods.
How Loading Will Work
Here is an outline of what the import machinery does during loading, adjusted to take advantage of the module's spec and the new loader API:
module = None
if spec.loader is not None and hasattr(spec.loader, 'create_module'):
module = spec.loader.create_module(spec)
if module is None:
module = ModuleType(spec.name)
# The import-related module attributes get set here:
_init_module_attrs(spec, module)
if spec.loader is None and spec.submodule_search_locations is not None:
# Namespace package
sys.modules[spec.name] = module
elif not hasattr(spec.loader, 'exec_module'):
spec.loader.load_module(spec.name)
# __loader__ and __package__ would be explicitly set here for
# backwards-compatibility.
else:
sys.modules[spec.name] = module
try:
spec.loader.exec_module(module)
except BaseException:
try:
del sys.modules[spec.name]
except KeyError:
pass
raise
module_to_return = sys.modules[spec.name]
These steps are exactly what Loader.load_module() is already expected to do. Loaders will thus be simplified since they will only need to implement exec_module().
Note that we must return the module from sys.modules. During loading the module may have replaced itself in sys.modules. Since we don't have a post-import hook API to accommodate the use case, we have to deal with it. However, in the replacement case we do not worry about setting the import-related module attributes on the object. The module writer is on their own if they are doing this.
How Reloading Will Work
Here is the corresponding outline for reload():
_RELOADING = {}
def reload(module):
try:
name = module.__spec__.name
except AttributeError:
name = module.__name__
spec = find_spec(name, target=module)
if sys.modules.get(name) is not module:
raise ImportError
if spec in _RELOADING:
return _RELOADING[name]
_RELOADING[name] = module
try:
if spec.loader is None:
# Namespace loader
_init_module_attrs(spec, module)
return module
if spec.parent and spec.parent not in sys.modules:
raise ImportError
_init_module_attrs(spec, module)
# Ignoring backwards-compatibility call to load_module()
# for simplicity.
spec.loader.exec_module(module)
return sys.modules[name]
finally:
del _RELOADING[name]
A key point here is the switch to Loader.exec_module() means that loaders will no longer have an easy way to know at execution time if it is a reload or not. Before this proposal, they could simply check to see if the module was already in sys.modules. Now, by the time exec_module() is called during load (not reload) the import machinery would already have placed the module in sys.modules. This is part of the reason why find_spec() has the "target" parameter.
The semantics of reload will remain essentially the same as they exist already [5]. The impact of this PEP on some kinds of lazy loading modules was a point of discussion. [4]
ModuleSpec
Attributes
Each of the following names is an attribute on ModuleSpec objects. A value of None indicates "not set". This contrasts with module objects where the attribute simply doesn't exist. Most of the attributes correspond to the import-related attributes of modules. Here is the mapping. The reverse of this mapping describes how the import machinery sets the module attributes right before calling exec_module().
| On ModuleSpec | On Modules |
|---|---|
| name | __name__ |
| loader | __loader__ |
| parent | __package__ |
| origin | __file__* |
| cached | __cached__*,** |
| submodule_search_locations | __path__** |
| loader_state | - |
| has_location | - |
While parent and has_location are read-only properties, the remaining attributes can be replaced after the module spec is created and even after import is complete. This allows for unusual cases where directly modifying the spec is the best option. However, typical use should not involve changing the state of a module's spec.
origin
"origin" is a string for the name of the place from which the module originates. See origin above. Aside from the informational value, it is also used in the module's repr. In the case of a spec where "has_location" is true, __file__ is set to the value of "origin". For built-in modules "origin" would be set to "built-in".
has_location
As explained in the location section above, many modules are "locatable", meaning there is a corresponding resource from which the module will be loaded and that resource can be described by a string. In contrast, non-locatable modules can't be loaded in this fashion, e.g. builtin modules and modules dynamically created in code. For these, the name is the only way to access them, so they have an "origin" but not a "location".
"has_location" is true if the module is locatable. In that case the spec's origin is used as the location and __file__ is set to spec.origin. If additional location information is required (e.g. zipimport), that information may be stored in spec.loader_state.
"has_location" may be implied from the existence of a load_data() method on the loader.
Incidentally, not all locatable modules will be cache-able, but most will.
submodule_search_locations
The list of location strings, typically directory paths, in which to search for submodules. If the module is a package this will be set to a list (even an empty one). Otherwise it is None.
The name of the corresponding module attribute, __path__, is relatively ambiguous. Instead of mirroring it, we use a more explicit attribute name that makes the purpose clear.
loader_state
A finder may set loader_state to any value to provide additional data for the loader to use during loading. A value of None is the default and indicates that there is no additional data. Otherwise it can be set to any object, such as a dict, list, or types.SimpleNamespace, containing the relevant extra information.
For example, zipimporter could use it to pass the zip archive name to the loader directly, rather than needing to derive it from origin or create a custom loader for each find operation.
loader_state is meant for use by the finder and corresponding loader. It is not guaranteed to be a stable resource for any other use.
Factory Functions
spec_from_file_location(name, location, *, loader=None, submodule_search_locations=None)
Build a spec from file-oriented information and loader APIs.
- "origin" will be set to the location.
- "has_location" will be set to True.
- "cached" will be set to the result of calling cache_from_source().
- "origin" can be deduced from loader.get_filename() (if "location" is not passed in.
- "loader" can be deduced from suffix if the location is a filename.
- "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename.
spec_from_loader(name, loader, *, origin=None, is_package=None)
Build a spec with missing information filled in by using loader APIs.
- "has_location" can be deduced from loader.get_data.
- "origin" can be deduced from loader.get_filename().
- "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename.
Backward Compatibility
ModuleSpec doesn't have any. This would be a different story if Finder.find_module() were to return a module spec instead of loader. In that case, specs would have to act like the loader that would have been returned instead. Doing so would be relatively simple, but is an unnecessary complication. It was part of earlier versions of this PEP.
Subclassing
Subclasses of ModuleSpec are allowed, but should not be necessary. Simply setting loader_state or adding functionality to a custom finder or loader will likely be a better fit and should be tried first. However, as long as a subclass still fulfills the requirements of the import system, objects of that type are completely fine as the return value of Finder.find_spec(). The same points apply to duck-typing.
Existing Types
Module Objects
Other than adding __spec__, none of the import-related module attributes will be changed or deprecated, though some of them could be; any such deprecation can wait until Python 4.
A module's spec will not be kept in sync with the corresponding import- related attributes. Though they may differ, in practice they will typically be the same.
One notable exception is that case where a module is run as a script by using the -m flag. In that case module.__spec__.name will reflect the actual module name while module.__name__ will be __main__.
A module's spec is not guaranteed to be identical between two modules with the same name. Likewise there is no guarantee that successive calls to importlib.find_spec() will return the same object or even an equivalent object, though at least the latter is likely.
Finders
Finders are still responsible for identifying, and typically creating, the loader that should be used to load a module. That loader will now be stored in the module spec returned by find_spec() rather than returned directly. As is currently the case without the PEP, if a loader would be costly to create, that loader can be designed to defer the cost until later.
MetaPathFinder.find_spec(name, path=None, target=None)
PathEntryFinder.find_spec(name, target=None)
Finders must return ModuleSpec objects when find_spec() is called. This new method replaces find_module() and find_loader() (in the PathEntryFinder case). If a loader does not have find_spec(), find_module() and find_loader() are used instead, for backward-compatibility.
Adding yet another similar method to loaders is a case of practicality. find_module() could be changed to return specs instead of loaders. This is tempting because the import APIs have suffered enough, especially considering PathEntryFinder.find_loader() was just added in Python 3.3. However, the extra complexity and a less-than- explicit method name aren't worth it.
The "target" parameter of find_spec()
A call to find_spec() may optionally include a "target" argument. This is the module object that will be used subsequently as the target of loading. During normal import (and by default) "target" is None, meaning the target module has yet to be created. During reloading the module passed in to reload() is passed through to find_spec() as the target. This argument allows the finder to build the module spec with more information than is otherwise available. Doing so is particularly relevant in identifying the loader to use.
Through find_spec() the finder will always identify the loader it will return in the spec (or return None). At the point the loader is identified, the finder should also decide whether or not the loader supports loading into the target module, in the case that "target" is passed in. This decision may entail consulting with the loader.
If the finder determines that the loader does not support loading into the target module, it should either find another loader or raise ImportError (completely stopping import of the module). This determination is especially important during reload since, as noted in How Reloading Will Work, loaders will no longer be able to trivially identify a reload situation on their own.
Two alternatives were presented to the "target" parameter: Loader.supports_reload() and adding "target" to Loader.exec_module() instead of find_spec(). supports_reload() was the initial approach to the reload situation. [6] However, there was some opposition to the loader-specific, reload-centric approach. [7]
As to "target" on exec_module(), the loader may need other information from the target module (or spec) during reload, more than just "does this loader support reloading this module", that is no longer available with the move away from load_module(). A proposal on the table was to add something like "target" to exec_module(). [8] However, putting "target" on find_spec() instead is more in line with the goals of this PEP. Furthermore, it obviates the need for supports_reload().
Namespace Packages
Currently a path entry finder may return (None, portions) from find_loader() to indicate it found part of a possible namespace package. To achieve the same effect, find_spec() must return a spec with "loader" set to None (a.k.a. not set) and with submodule_search_locations set to the same portions as would have been provided by find_loader(). It's up to PathFinder how to handle such specs.
Loaders
Loader.exec_module(module)
Loaders will have a new method, exec_module(). Its only job is to "exec" the module and consequently populate the module's namespace. It is not responsible for creating or preparing the module object, nor for any cleanup afterward. It has no return value. exec_module() will be used during both loading and reloading.
exec_module() should properly handle the case where it is called more than once. For some kinds of modules this may mean raising ImportError every time after the first time the method is called. This is particularly relevant for reloading, where some kinds of modules do not support in-place reloading.
Loader.create_module(spec)
Loaders may also implement create_module() that will return a new module to exec. It may return None to indicate that the default module creation code should be used. One use case, though atypical, for create_module() is to provide a module that is a subclass of the builtin module type. Most loaders will not need to implement create_module(),
create_module() should properly handle the case where it is called more than once for the same spec/module. This may include returning None or raising ImportError.
Note
exec_module() and create_module() should not set any import-related module attributes. The fact that load_module() does is a design flaw that this proposal aims to correct.
Other changes:
PEP 420 introduced the optional module_repr() loader method to limit the amount of special-casing in the module type's __repr__(). Since this method is part of ModuleSpec, it will be deprecated on loaders. However, if it exists on a loader it will be used exclusively.
Loader.init_module_attr() method, added prior to Python 3.4's release, will be removed in favor of the same method on ModuleSpec.
However, InspectLoader.is_package() will not be deprecated even though the same information is found on ModuleSpec. ModuleSpec can use it to populate its own is_package if that information is not otherwise available. Still, it will be made optional.
In addition to executing a module during loading, loaders will still be directly responsible for providing APIs concerning module-related data.
Other Changes
- The various finders and loaders provided by importlib will be updated to comply with this proposal.
- Any other implmentations of or dependencies on the import-related APIs (particularly finders and loaders) in the stdlib will be likewise adjusted to this PEP. While they should continue to work, any such changes that get missed should be considered bugs for the Python 3.4.x series.
- The spec for the __main__ module will reflect how the interpreter was started. For instance, with -m the spec's name will be that of the module used, while __main__.__name__ will still be "__main__".
- We will add importlib.find_spec() to mirror importlib.find_loader() (which becomes deprecated).
- importlib.reload() is changed to use ModuleSpec.
- importlib.reload() will now make use of the per-module import lock.
Reference Implementation
A reference implementation is available at http://bugs.python.org/issue18864.
Implementation Notes
* The implementation of this PEP needs to be cognizant of its impact on pkgutil (and setuptools). pkgutil has some generic function-based extensions to PEP 302 which may break if importlib starts wrapping loaders without the tools' knowledge.
* Other modules to look at: runpy (and pythonrun.c), pickle, pydoc, inspect.
For instance, pickle should be updated in the __main__ case to look at module.__spec__.name.
Rejected Additions to the PEP
There were a few proposed additions to this proposal that did not fit well enough into its scope.
There is no "PathModuleSpec" subclass of ModuleSpec that separates out has_location, cached, and submodule_search_locations. While that might make the separation cleaner, module objects don't have that distinction. ModuleSpec will support both cases equally well.
While "ModuleSpec.is_package" would be a simple additional attribute (aliasing self.submodule_search_locations is not None), it perpetuates the artificial (and mostly erroneous) distinction between modules and packages.
The module spec Factory Functions could be classmethods on ModuleSpec. However that would expose them on all modules via __spec__, which has the potential to unnecessarily confuse non-advanced Python users. The factory functions have a specific use case, to support finder authors. See ModuleSpec Users.
Likewise, several other methods could be added to ModuleSpec that expose the specific uses of module specs by the import machinery:
- create() - a wrapper around Loader.create_module().
- exec(module) - a wrapper around Loader.exec_module().
- load() - an analogue to the deprecated Loader.load_module().
As with the factory functions, exposing these methods via module.__spec__ is less than desireable. They would end up being an attractive nuisance, even if only exposed as "private" attributes (as they were in previous versions of this PEP). If someone finds a need for these methods later, we can expose the via an appropriate API (separate from ModuleSpec) at that point, perhaps relative to PEP 406 (import engine).
Conceivably, the load() method could optionally take a list of modules with which to interact instead of sys.modules. Also, load() could be leveraged to implement multi-version imports. Both are interesting ideas, but definitely outside the scope of this proposal.
Others left out:
- Add ModuleSpec.submodules (RO-property) - returns possible submodules relative to the spec.
- Add ModuleSpec.loaded (RO-property) - the module in sys.module, if any.
- Add ModuleSpec.data - a descriptor that wraps the data API of the spec's loader.
- Also see [3].
References
| [1] | http://mail.python.org/pipermail/import-sig/2013-August/000658.html |
| [2] | http://docs.python.org/3/reference/import.html |
| [3] | https://mail.python.org/pipermail/import-sig/2013-September/000735.html |
| [4] | https://mail.python.org/pipermail/python-dev/2013-August/128129.html |
| [5] | http://bugs.python.org/issue19413 |
| [6] | https://mail.python.org/pipermail/python-dev/2013-October/129913.html |
| [7] | https://mail.python.org/pipermail/python-dev/2013-October/129971.html |
| [8] | https://mail.python.org/pipermail/python-dev/2013-October/129933.html |
Copyright
This document has been placed in the public domain.
pep-0452 API for Cryptographic Hash Functions v2.0
| PEP: | 452 |
|---|---|
| Title: | API for Cryptographic Hash Functions v2.0 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | A.M. Kuchling <amk at amk.ca>, Christian Heimes <christian at python.org> |
| Status: | Draft |
| Type: | Informational |
| Created: | 15-Aug-2013 |
| Post-History: | |
| Replaces: | 247 |
Abstract
There are several different modules available that implement
cryptographic hashing algorithms such as MD5 or SHA. This
document specifies a standard API for such algorithms, to make it
easier to switch between different implementations.
Specification
All hashing modules should present the same interface. Additional
methods or variables can be added, but those described in this
document should always be present.
Hash function modules define one function:
new([string]) (unkeyed hashes)
new(key, [string], [digestmod]) (keyed hashes)
Create a new hashing object and return it. The first form is
for hashes that are unkeyed, such as MD5 or SHA. For keyed
hashes such as HMAC, 'key' is a required parameter containing
a string giving the key to use. In both cases, the optional
'string' parameter, if supplied, will be immediately hashed
into the object's starting state, as if obj.update(string) was
called.
After creating a hashing object, arbitrary bytes can be fed
into the object using its update() method, and the hash value
can be obtained at any time by calling the object's digest()
method.
Although the parameter is called 'string', hashing objects operate
on 8-bit data only. Both 'key' and 'string' must be a bytes-like
object (bytes, bytearray...). A hashing object may support
one-dimensional, contiguous buffers as argument, too. Text
(unicode) is no longer supported in Python 3.x. Python 2.x
implementations may take ASCII-only unicode as argument, but
portable code should not rely on the feature.
Arbitrary additional keyword arguments can be added to this
function, but if they're not supplied, sensible default values
should be used. For example, 'rounds' and 'digest_size'
keywords could be added for a hash function which supports a
variable number of rounds and several different output sizes,
and they should default to values believed to be secure.
Hash function modules define one variable:
digest_size
An integer value; the size of the digest produced by the
hashing objects created by this module, measured in bytes.
You could also obtain this value by creating a sample object
and accessing its 'digest_size' attribute, but it can be
convenient to have this value available from the module.
Hashes with a variable output size will set this variable to
None.
Hashing objects require the following attribute:
digest_size
This attribute is identical to the module-level digest_size
variable, measuring the size of the digest produced by the
hashing object, measured in bytes. If the hash has a variable
output size, this output size must be chosen when the hashing
object is created, and this attribute must contain the
selected size. Therefore None is *not* a legal value for this
attribute.
block_size
An integer value or ``NotImplemented``; the internal block size
of the hash algorithm in bytes. The block size is used by the
HMAC module to pad the secret key to digest_size or to hash the
secret key if it is longer than digest_size. If no HMAC
algorithm is standardized for the the hash algorithm, return
``NotImplemented`` instead.
name
A text string value; the canonical, lowercase name of the hashing
algorithm. The name should be a suitable parameter for
:func:`hashlib.new`.
Hashing objects require the following methods:
copy()
Return a separate copy of this hashing object. An update to
this copy won't affect the original object.
digest()
Return the hash value of this hashing object as a bytes
containing 8-bit data. The object is not altered in any way
by this function; you can continue updating the object after
calling this function.
hexdigest()
Return the hash value of this hashing object as a string
containing hexadecimal digits. Lowercase letters should be used
for the digits 'a' through 'f'. Like the .digest() method, this
method mustn't alter the object.
update(string)
Hash bytes-like 'string' into the current state of the hashing
object. update() can be called any number of times during a
hashing object's lifetime.
Hashing modules can define additional module-level functions or
object methods and still be compliant with this specification.
Here's an example, using a module named 'MD5':
>>> import hashlib
>>> from Crypto.Hash import MD5
>>> m = MD5.new()
>>> isinstance(m, hashlib.CryptoHash)
True
>>> m.name
'md5'
>>> m.digest_size
16
>>> m.block_size
64
>>> m.update(b'abc')
>>> m.digest()
b'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
>>> m.hexdigest()
'900150983cd24fb0d6963f7d28e17f72'
>>> MD5.new(b'abc').digest()
b'\x90\x01P\x98<\xd2O\xb0\xd6\x96?}(\xe1\x7fr'
Rationale
The digest size is measured in bytes, not bits, even though hash
algorithm sizes are usually quoted in bits; MD5 is a 128-bit
algorithm and not a 16-byte one, for example. This is because, in
the sample code I looked at, the length in bytes is often needed
(to seek ahead or behind in a file; to compute the length of an
output string) while the length in bits is rarely used.
Therefore, the burden will fall on the few people actually needing
the size in bits, who will have to multiply digest_size by 8.
It's been suggested that the update() method would be better named
append(). However, that method is really causing the current
state of the hashing object to be updated, and update() is already
used by the md5 and sha modules included with Python, so it seems
simplest to leave the name update() alone.
The order of the constructor's arguments for keyed hashes was a
sticky issue. It wasn't clear whether the key should come first
or second. It's a required parameter, and the usual convention is
to place required parameters first, but that also means that the
'string' parameter moves from the first position to the second.
It would be possible to get confused and pass a single argument to
a keyed hash, thinking that you're passing an initial string to an
unkeyed hash, but it doesn't seem worth making the interface
for keyed hashes more obscure to avoid this potential error.
Changes from Version 1.0 to Version 2.0
Version 2.0 of API for Cryptographic Hash Functions clarifies some
aspects of the API and brings it up-to-date. It also formalized aspects
that were already de-facto standards and provided by most
implementations.
Version 2.0 introduces the following new attributes:
name
The name property was made mandatory by :issue:`18532`.
block_size
The new version also specifies that the return value
``NotImplemented`` prevents HMAC support.
Version 2.0 takes the separation of binary and text data in Python
3.0 into account. The 'string' argument to new() and update() as
well as the 'key' argument must be bytes-like objects. On Python
2.x a hashing object may also support ASCII-only unicode. The actual
name of argument is not changed as it is part of the public API.
Code may depend on the fact that the argument is called 'string'.
Recommended names for common hashing algorithms
algorithm variant recommended name
---------- --------- ----------------
MD5 md5
RIPEMD-160 ripemd160
SHA-1 sha1
SHA-2 SHA-224 sha224
SHA-256 sha256
SHA-384 sha384
SHA-512 sha512
SHA-3 SHA-3-224 sha3_224
SHA-3-256 sha3_256
SHA-3-384 sha3_384
SHA-3-512 sha3_512
WHIRLPOOL whirlpool
Changes
2001-09-17: Renamed clear() to reset(); added digest_size attribute
to objects; added .hexdigest() method.
2001-09-20: Removed reset() method completely.
2001-09-28: Set digest_size to None for variable-size hashes.
2013-08-15: Added block_size and name attributes; clarified that
'string' actually referes to bytes-like objects.
Acknowledgements
Thanks to Aahz, Andrew Archibald, Rich Salz, Itamar
Shtull-Trauring, and the readers of the python-crypto list for
their comments on this PEP.
Copyright
This document has been placed in the public domain.
pep-0453 Explicit bootstrapping of pip in Python installations
| PEP: | 453 |
|---|---|
| Title: | Explicit bootstrapping of pip in Python installations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io>, Nick Coghlan <ncoghlan at gmail.com> |
| BDFL-Delegate: | Martin von Lรถwis |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 10-Aug-2013 |
| Post-History: | 30-Aug-2013, 15-Sep-2013, 18-Sep-2013, 19-Sep-2013, 23-Sep-2013, 29-Sep-2013, 13-Oct-2013, 20-Oct-2013 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-October/129810.html |
Contents
- Abstract
- PEP Acceptance
- Rationale
- Proposal Overview
- Explicit bootstrapping mechanism
- Security considerations
- Reliability considerations
- Implementation strategy
- Integration timeline
- Proposed CLI
- Proposed module API
- Invocation from the CPython installers
- Installing from source
- Changes to virtual environments
- Documentation
- Bundling CA certificates with CPython
- Automatic installation of setuptools
- Updating the private copy of pip
- Updating the ensurepip module API and CLI
- Uninstallation
- Script Execution on Windows
- Recommendations for Downstream Distributors
- Policies & Governance
- Appendix: Rejected Proposals
- References
- Copyright
Abstract
This PEP proposes that the Installing Python Modules guide in Python 2.7, 3.3 and 3.4 be updated to officially recommend the use of pip as the default installer for Python packages, and that appropriate technical changes be made in Python 3.4 to provide pip by default in support of that recommendation.
PEP Acceptance
This PEP was accepted for inclusion in Python 3.4 by Martin von Lรถwis on Tuesday 22nd October, 2013.
Issue 19347 has been created to track the implementation of this PEP.
Rationale
There are two related, but distinct rationales for the proposal in this PEP. The first relates to the experience of new users, while the second relates to better enabling the evolution of the broader Python packaging ecosystem.
Improving the new user experience
Currently, on systems without a platform package manager and repository, installing a third-party Python package into a freshly installed Python requires first identifying an appropriate package manager and then installing it.
Even on systems that do have a platform package manager, it is unlikely to include every package that is available on the Python Package Index, and even when a desired third-party package is available, the correct name in the platform package manager may not be clear.
This means that, to work effectively with the Python Package Index ecosystem, users must know which package manager to install, where to get it, and how to install it. The effect of this is that third-party Python projects are currently required to choose from a variety of undesirable alternatives:
- Assume the user already has a suitable cross-platform package manager installed.
- Duplicate the instructions and tell their users how to install the package manager.
- Completely forgo the use of dependencies to ease installation concerns for their users.
All of these available options have significant drawbacks.
If a project simply assumes a user already has the tooling then beginning users may get a confusing error message when the installation command doesn't work. Some operating systems may ease this pain by providing a global hook that looks for commands that don't exist and suggest an OS package they can install to make the command work, but that only works on systems with platform package managers that include a package that provides the relevant cross-platform installer command (such as many major Linux distributions). No such assistance is available for Windows and Mac OS X users, or more conservative Linux distributions. The challenges of dealing with this problem for beginners (who are often also completely new to programming, the use of command line tools and editing system environment variables) are a regular feature of feedback the core Python developers receive from professional educators and others introducing new users to Python.
If a project chooses to duplicate the installation instructions and tell their users how to install the package manager before telling them how to install their own project then whenever these instructions need updates they need updating by every project that has duplicated them. This is particular problematic when there are multiple competing installation tools available, and different projects recommend different tools.
This specific problem can be partially alleviated by strongly promoting pip as the default installer and recommending that other projects reference pip's own bootstrapping instructions rather than duplicating them. However the user experience created by this approach still isn't particularly good (although there is an effort under way to create a combined Windows installer for pip and its dependencies that should improve matters on that platform, and Mac OS X and *nix platforms generally have wget and hence the ability to easily download and run the bootstrap scripts from the command line).
The projects that have decided to forgo dependencies altogether are forced to either duplicate the efforts of other projects by inventing their own solutions to problems or are required to simply include the other projects in their own source trees. Both of these options present their own problems either in duplicating maintenance work across the ecosystem or potentially leaving users vulnerable to security issues because the included code or duplicated efforts are not automatically updated when upstream releases a new version.
By officially recommending and providing by default a specific cross-platform package manager it will be easier for users trying to install these third-party packages as well as easier for the people distributing them as they should now be able to safely assume that most users will have the appropriate installation tools available (or access to clear instructions on how to obtain them). This is expected to become more important in the future as the Wheel [17] package format (deliberately) does not have a built in "installer" in the form of setup.py so users wishing to install from a wheel file will want an installer even in the simplest cases.
Reducing the burden of actually installing a third-party package should also decrease the pressure to add every useful module to the standard library. This will allow additions to the standard library to focus more on why Python should have a particular tool out of the box, and why it is reasonable for that package to adopt the standard library's 18-24 month feature release cycle, instead of using the general difficulty of installing third-party packages as justification for inclusion.
Providing a standard installation system also helps with bootstrapping alternate build and installer systems, such as zc.buildout, hashdist and conda. So long as pip install <tool> works, then a standard Python-specific installer provides a reasonably secure, cross platform mechanism to get access to these utilities.
Enabling the evolution of the broader Python packaging ecosystem
As no new packaging standard can achieve widespread adoption without a transition strategy that covers the versions of Python that are in widespread current use (rather than merely future versions, like most language features), the change proposed in this PEP is considered a necessary step in the evolution of the Python packaging ecosystem
The broader community has embraced the Python Package Index as a mechanism for distributing and installing Python software, but the different concerns of language evolution and secure software distribution mean that a faster feature release cycle that encompasses older versions is needed to properly support the latter.
In addition, the core CPython development team have the luxury of dropping support for earlier Python versions well before the rest of the community, as downstream commercial redistributors pick up the task of providing support for those versions to users that still need it, while many third party libraries maintain compatibility with those versions as long as they remain in widespread use.
This means that the current setup.py install based model for package installation poses serious difficulties for the development and adoption of new packaging standards, as, depending on how a project writes their setup.py file, the installation command (along with other operations) may end up invoking the standard library's distutils package.
As an indicator of how this may cause problems for the broader ecosystem, consider that the feature set of distutils in Python 2.6 was frozen in June 2008 (with the release of Python 2.6b1), while the feature set of distutils in Python 2.7 was frozen in April 2010 (with the release of Python 2.7b1).
By contrast, using a separate installer application like pip (which ensures that even setup.py files that invoke distutils directly still support the new packaging standards) makes it possible to support new packaging standards in older versions of Python, just by upgrading pip (which receives new feature releases roughly every 6 months). The situation on older versions of Python is further improved by making it easier for end users to install and upgrade newer build systems like setuptools or improved PyPI upload utilities like twine.
It is not coincidental that this proposed model of using a separate installer program with more metadata heavy and less active distribution formats matches that used by most operating systems (including Windows since the introduction of the installer service and the MSI file format), as well as many other language specific installers.
For Python 2.6, this compatibility issue is largely limited to various enterprise Linux distributions (and their downstream derivatives). These distributions often have even slower update cycles than CPython, so they offer full support for versions of Python that are considered "security fix only" versions upstream (and sometimes may even be to the point where the core development team no longer support them at all - you can still get commercial support for Python 2.3 if you really need it!).
In practice, the fact that tools like wget and curl are readily available on Linux systems, that most users of Python on Linux are already familiar with the command line, and that most Linux distributions ship with a default configuration that makes running Python scripts easy, means that the existing pip bootstrapping instructions for any *nix system are already quite straightforward. Even if pip isn't provided by the system package manager, then using wget or curl to retrieve the bootstrap script from www.pip-installer.org and then running it is just a couple of shell commands that can easily be copied and pasted as necessary.
Accordingly, for any version of Python on any *nix system, the need to bootstrap pip in older versions isn't considered a major barrier to adoption of new packaging standards, since it's just one more small speedbump encountered by users of these long term stable releases. For *nix systems, this PEP's formal endorsement of pip as the preferred default packaging tool is seen as more important than the underlying technical details involved in making pip available by default, since it shifts the nature of the conversation between the developers of pip and downstream repackagers of both pip and CPython.
For Python 2.7, on the other hand, the compatibility issue for adopting new metadata standards is far more widespread, as it affects the python.org binary installers for Windows and Mac OS X, as well as even relatively fast moving *nix platforms.
Firstly, and unlike Python 2.6, Python 2.7 is still a fully supported upstream version, and will remain so until the release of Python 2.7.9 (currently scheduled for May 2015), at which time it is expected to enter the usual "security fix only" mode. That means there are at least another 19 months where Python 2.7 is a deployment target for Python applications that enjoys full upstream support. Even after the core development team switches 2.7 to security release only mode in 2015, Python 2.7 will likely remain a commercially supported legacy target out beyond 2020.
While Python 3 already presents a compelling alternative over Python 2 for new Python applications and deployments without an existing investment in Python 2 and without a dependency on specific Python 2 only third party modules (a set which is getting ever smaller over time), it is going to take longer to create compelling business cases to update existing Python 2.7 based infrastructure to Python 3, especially in situations where the culture of automated testing is weak (or nonexistent), making it difficult to effectively use the available migration utilities.
While this PEP only proposes documentation changes for Python 2.7, once pip has a Windows installer available, a separate PEP will be created and submitted proposing the creation and distribution of aggregate installers for future CPython 2.7 maintenance releases that combine the CPython, pip and Python Launcher for Windows installers into a single download (the separate downloads would still remain available - the aggregate installers would be provided as a convenience, and as a clear indication of the recommended operating environment for Python in Windows systems).
Why pip?
pip has been chosen as the preferred default installer, as it is an already popular tool that addresses several design and user experience issues with its predecessor easy_install (these issues can't readily be fixed in easy_install itself due to backwards compatibility concerns). pip is also well suited to working within the bounds of a single Python runtime installation (including associated virtual environments), which is a desirable feature for a tool bundled with CPython.
Other tools like zc.buildout and conda are more ambitious in their aims (and hence substantially better than pip at handling external binary dependencies), so it makes sense for the Python ecosystem to treat them more like platform package managers to interoperate with rather than as the default cross-platform installation tool. This relationship is similar to that between pip and platform package management systems like apt and yum (which are also designed to handle arbitrary binary dependencies).
Proposal Overview
This PEP proposes that the Installing Python Modules guide be updated to officially recommend the use of pip as the default installer for Python packages, rather than the current approach of recommending the direct invocation of the setup.py install command.
However, to avoid recommending a tool that CPython does not provide, it is further proposed that the pip [18] package manager be made available by default when installing CPython 3.4 or later and when creating virtual environments using the standard library's venv module via the pyvenv command line utility.
To support that end, this PEP proposes the inclusion of an ensurepip bootstrapping module in Python 3.4, as well as automatic invocation of that module from pyvenv and changes to the way Python installed scripts are handled on Windows. Using a bootstrap module rather than providing pip directly helps to clearly demarcate development responsibilities, and to avoid inadvertently downgrading pip when updating CPython.
To provide clear guidance for new users of Python that may not be starting with the latest release, this PEP also proposes that the "Installing Python Modules" guides in Python 2.7 and 3.3 be updated to recommend installing and using pip, rather than invoking distutils directly. It does not propose backporting any of the code changes that are being proposed for Python 3.4.
Finally, the PEP also strongly recommends that CPython redistributors and other Python implementations ensure that pip is available by default, or at the very least, explicitly document the fact that it is not included.
This PEP does not propose making pip (or any dependencies) directly available as part of the standard library. Instead, pip will be a bundled application provided along with CPython for the convenience of Python users, but subject to its own development life cycle and able to be upgraded independently of the core interpreter and standard library.
Explicit bootstrapping mechanism
An additional module called ensurepip will be added to the standard library whose purpose is to install pip and any of its dependencies into the appropriate location (most commonly site-packages). It will expose a callable named bootstrap() as well as offer direct execution via python -m ensurepip.
The bootstrap will not contact PyPI, but instead rely on a private copy of pip stored inside the standard library. Accordingly, only options related to the installation location will be supported (--user, --root, etc).
It is considered desirable that users be strongly encouraged to use the latest available version of pip, in order to take advantage of the ongoing efforts to improve the security of the PyPI based ecosystem, as well as benefiting from the efforts to improve the speed, reliability and flexibility of that ecosystem.
In order to satisfy this goal of providing the most recent version of pip by default, the private copy of pip will be updated in CPython maintenance releases, which should align well with the 6-month cycle used for new pip releases.
Security considerations
The design in this PEP has been deliberately chosen to avoid making any significant changes to the trust model of CPython for end users that do not subsequently run the command pip install --upgrade pip.
The installers will contain all the components of a fully functioning version of Python, including the pip installer. The installation process will not require network access, and will not rely on trusting the security of the network connection established between pip and the Python package index.
Only users that choose to use pip to communicate with PyPI will need to pay attention to the additional security considerations that come with doing so.
However, the core CPython team will still assist with reviewing and resolving at least the certificate update management issue currently affecting the requests project (and hence pip), and may also be able to offer assistance in resolving other identified security concerns [6].
Reliability considerations
By including the bootstrap as part of the standard library (rather than solely as a feature of the binary installers), the correct operation of the bootstrap command can be easily tested using the existing CPython buildbot infrastructure rather than adding significantly to the testing burden for the installers themselves.
Implementation strategy
To ensure there is no need for network access when installing Python or creating virtual environments, the ensurepip module will, as an implementation detail, include a complete private copy of pip and its dependencies which will be used to extract pip and install it into the target environment. It is important to stress that this private copy of pip is only an implementation detail and it should not be relied on or assumed to exist beyond the public capabilities exposed through the ensurepip module (and indirectly through venv).
There is not yet a reference ensurepip implementation. The existing get-pip.py bootstrap script demonstrates an earlier variation of the general concept, but the standard library version would take advantage of the improved distribution capabilities offered by the CPython installers to include private copies of pip and setuptools as wheel files (rather than as embedded base64 encoded data), and would not try to contact PyPI (instead installing directly from the private wheel files).
Rather than including separate code to handle the bootstrapping, the ensurepip module will manipulate sys.path appropriately to allow the wheel files to be used to install themselves, either into the current Python installation or into a virtual environment (as determined by the options passed to the bootstrap command).
It is proposed that the implementation be carried out in five separate steps (all steps after the first two are independent of each other and can be carried out in any order):
- the first step would update the "Installing Python Modules" documentation to recommend the use of pip and reference the pip team's instructions for downloading and installing it. This change would be applied to Python 2.7, 3.3, and 3.4.
- the ensurepip module and the private copies of the most recently released versions of pip and setuptools would be added to Python 3.4 and the 3.4 "Installing Python Modules" documentation updated accordingly.
- the CPython Windows installer would be updated to offer the new pip installation option for Python 3.4.
- the CPython Mac OS X installer would be updated to offer the new pip installation option for Python 3.4.
- the venv module and pyvenv command would be updated to make use of ensurepip in Python 3.4
- the PATH handling on Windows would be updated for Python 3.4+
Integration timeline
If this PEP is accepted, the proposed time frame for integration of pip into the CPython release is as follows:
- as soon as possible after the release of 3.4.0 alpha 4
- Documentation updated and ensurepip implemented based on a pre-release version of pip 1.5.
- All other proposed functional changes for Python 3.4 implemented, including the installer updates to invoke ensurepip.
- by November 20th (3 days prior to the scheduled date of 3.4.0 beta 1)
- ensurepip updated to use a pip 1.5 release candidate.
- PEP 101 updated to cover ensuring the bundled version of pip is up to date.
- by November 24th (scheduled date of 3.4.0 beta 1)
- As with any other new feature, all proposed functional changes for Python 3.4 must be implemented prior to the beta feature freeze.
- by December 29th (1 week prior to the scheduled date of 3.4.0 beta 2)
- requests certificate management issue resolved
- ensurepip updated to the final release of pip 1.5, or a subsequent maintenance release (including a suitably updated vendored copy of requests)
(See PEP 429 for the current official scheduled dates of each release. Dates listed above are accurate as of October 20th, 2013.)
If there is no final or maintenance release of pip 1.5 with a suitable updated version of requests available by one week before the scheduled Python 3.4 beta 2 release, then implementation of this PEP will be deferred to Python 3.5. Note that this scenario is considered unlikely - the tentative date for the pip 1.5 release is currently December 1st.
In future CPython releases, this kind of coordinated scheduling shouldn't be needed: the CPython release manager will be able to just update to the latest released version of pip. However, in this case, some fixes are needed in pip in order to allow the bundling to work correctly, and the certificate update mechanism for requests needs to be improved, so the pip 1.5 release cycle needs to be properly aligned with the CPython 3.4 beta releases.
Proposed CLI
The proposed CLI is based on a subset of the existing pip install options:
Usage: python -m ensurepip [options] General Options: -h, --help Show help. -v, --verbose Give more output. Option is additive, and can be used up to 3 times. -V, --version Show the pip version that would be extracted and exit. -q, --quiet Give less output. Installation Options: -U, --upgrade Upgrade pip and dependencies, even if already installed --user Install using the user scheme. --root <dir> Install everything relative to this alternate root directory.
In most cases, end users won't need to use this CLI directly, as pip should have been installed automatically when installing Python or when creating a virtual environment. However, it is formally documented as a public interface to support at least these known use cases:
- Windows and Mac OS X installations where the "Install pip" option was not chosen during installation
- any installation where the user previously ran "pip uninstall pip"
Users that want to retrieve the latest version from PyPI, or otherwise need more flexibility, can then invoke the extracted pip appropriately.
Proposed module API
The proposed ensurepip module API consists of the following two functions:
def version():
"""
Returns a string specifying the bundled version of pip.
"""
def bootstrap(root=None, upgrade=False, user=False, verbosity=0):
"""
Bootstrap pip into the current Python installation (or the given root
directory).
"""
Invocation from the CPython installers
The CPython Windows and Mac OS X installers will each gain a new option:
- Install pip (the default Python package management utility)?
This option will be checked by default.
If the option is checked, then the installer will invoke the following command with the just installed Python:
python -m ensurepip --upgrade
This ensures that, by default, installing or updating CPython will ensure that the installed version of pip is at least as recent as the one included with that version of CPython. If a newer version of pip has already been installed then python -m ensurepip --upgrade will simply return without doing anything.
Installing from source
Just as the prebuilt binary installers will be updated to run python -m ensurepip by default, a similar change will be made to the make install and make altinstall commands of the source distribution. The directory settings in the sysconfig module should ensure the pip components are automatically installed to the expected locations.
ensurepip itself (including the private copy of pip and its dependencies) will always be installed normally (as it is a regular part of the standard library), but an option will be provided to skip the invocation of ensurepip.
This means that even installing from source will provide pip by default, but redistributors provide pip by other means (or not providing it at all) will still be able to opt out of installing it using ensurepip.
Changes to virtual environments
Python 3.3 included a standard library approach to virtual Python environments through the venv module. Since its release it has become clear that very few users have been willing to use this feature directly, in part due to the lack of an installer present by default inside of the virtual environment. They have instead opted to continue using the virtualenv package which does include pip installed by default.
To make the venv more useful to users it will be modified to issue the pip bootstrap by default inside of the new environment while creating it. This will allow people the same convenience inside of the virtual environment as this PEP provides outside of it as well as bringing the venv module closer to feature parity with the external virtualenv package, making it a more suitable replacement.
To handle cases where a user does not wish to have pip bootstrapped into their virtual environment a --without-pip option will be added.
The venv.EnvBuilder and venv.create APIs will be updated to accept one new parameter: with_pip (defaulting to False).
The new default for the module API is chosen for backwards compatibility with the current behaviour (as it is assumed that most invocation of the venv module happens through third part tools that likely will not want pip installed without explicitly requesting it), while the default for the command line interface is chosen to try to ensure pip is available in most virtual environments without additional action on the part of the end user.
As this change will only benefit Python 3.4 and later versions, the third-party virtualenv project will still be needed to obtain a consistent cross-version experience in Python 3.3 and 2.7.
Documentation
The "Installing Python Modules" section of the standard library documentation in Python 2.7, 3.3 and 3.4 will be updated to recommend the use of the pip installer, either provided by default in Python 3.4 or retrieved and installed by the user in Python 2.7 or 3.3. It will give a brief description of the most common commands and options, but delegate to the externally maintained pip documentation for the full details.
In Python 3.4, the pyvenv and venv documentation will also be updated to reference the revised module installation guide.
The existing content of the module installation guide will be retained in all versions, but under a new "Invoking distutils directly" subsection.
Bundling CA certificates with CPython
The ensurepip implementation will include the pip CA bundle along with the rest of pip. This means CPython effectively includes a CA bundle that is used solely by pip after it has been extracted.
This is considered preferable to relying solely on the system certificate stores, as it ensures that pip will behave the same across all supported versions of Python, even those prior to Python 3.4 that cannot access the system certificate store on Windows.
Automatic installation of setuptools
pip currently depends on setuptools to handle metadata generation during the build process, along with some other features. While work is ongoing to reduce or eliminate this dependency, it is not clear if that work will be complete for pip 1.5 (which is the version likely to be current when Python 3.4.0 is released).
This PEP proposes that, if pip still requires it as a dependency, ensurepip will include a private copy of setuptools (in addition to the private copy of ensurepip). python -m ensurepip will then install the private copy in addition to installing pip itself.
However, this behavior is officially considered an implementation detail. Other projects which explicitly require setuptools must still provide an appropriate dependency declaration, rather than assuming setuptools will always be installed alongside pip.
Once pip is able to run pip install --upgrade pip without needing setuptools installed first, then the private copy of setuptools will be removed from ensurepip in subsequent CPython releases.
As long as setuptools is needed, it will be a completely unmodified copy of the latest upstream setuptools release, including the easy_install script if the upstream setuptools continues to include it. The installation of easy_install along with pip isn't considered desirable, but installing a broken setuptools would be worse. This problem will naturally resolve itself once the pip developers have managed to eliminate their dependency on setuptools and the private copy of setuptools can be removed entirely from CPython.
Updating the private copy of pip
In order to keep up with evolutions in packaging as well as providing users with as recent version a possible the ensurepip module will be regularly updated to the latest versions of everything it bootstraps.
After each new pip release, and again during the preparation for any release of Python (including feature releases), a script, provided as part of the implementation for this PEP, will be run to ensure the private copies stored in the CPython source repository have been updated to the latest versions.
Updating the ensurepip module API and CLI
Like venv and pyvenv, the ensurepip module API and CLI will be governed by the normal rules for the standard library: no new features are permitted in maintenance releases.
However, the embedded components may be updated as noted above, so the extracted pip may offer additional functionality in maintenance releases.
Uninstallation
No changes are proposed to the CPython uninstallation process by this PEP. The bootstrapped pip will be installed the same way as any other pip installed packages, and will be handled in the same way as any other post-install additions to the Python environment.
At least on Windows, that means the bootstrapped files will be left behind after uninstallation, since those files won't be associated with the Python MSI installer.
While the case can be made for the CPython installers clearing out these directories automatically, changing that behaviour is considered outside the scope of this PEP.
Script Execution on Windows
While the Windows installer was updated in Python 3.3 to optionally make python available on the PATH, no such change was made to include the script installation directory returned by sysconfig.get_path("scripts").
Accordingly, in addition to adding the option to extract and install pip during installation, this PEP proposes that the Windows installer in Python 3.4 and later be updated to also add the path returned by sysconfig.get_path("scripts") to the Windows PATH when the PATH modification option is enabled during installation
Note that this change will only be available in Python 3.4 and later.
This means that, for Python 3.3, the most reliable way to invoke pip globally on Windows (without tinkering manually with PATH) will still remain py -m pip (or py -3 -m pip to select the Python 3 version if both Python 2 and 3 are installed) rather than simply calling pip. This works because Python 3.3 provides the Python Launcher for Windows (and the associated py command) by default.
For Python 2.7 and 3.2, the most reliable mechanism will be to install the Python Launcher for Windows using the standalone installer and then use py -m pip as noted above.
Adding the scripts directory to the system PATH will mean that pip works reliably in the "only one Python installation on the system PATH" case, with py -m pip, pipX, or pipX.Y needed only to select a non-default version in the parallel installation case (and outside a virtual environment). This change should also make the pyvenv command substantially easier to invoke on Windows, along with all scripts installed by pip, easy_install and similar tools.
While the script invocations on recent versions of Python will run through the Python launcher for Windows, this shouldn't cause any issues, as long as the Python files in the Scripts directory correctly specify a Python version in their shebang line or have an adjacent Windows executable (as easy_install and pip do).
Recommendations for Downstream Distributors
A common source of Python installations are through downstream distributors such as the various Linux Distributions [8] [9] [10], OSX package managers [11] [12] [13], and commercial Python redistributors [14] [15] [16]. In order to provide a consistent, user-friendly experience to all users of Python regardless of how they obtained Python this PEP recommends and asks that downstream distributors:
- Ensure that whenever Python is installed pip is either installed or is
otherwise made readily available to end users.
- For redistributors using binary installers, this may take the form of optionally executing the ensurepip bootstrap during installation, similar to the CPython installers.
- For redistributors using package management systems, it may take the form of separate packages with dependencies on each other so that installing the Python package installs the pip package and installing the pip package installs the Python package.
- Another reasonable way to implement this is to package pip separately but ensure that there is some sort of global hook that will recommend installing the separate pip package when a user executes pip without it being installed. Systems that choose this option should ensure that the ensurepip module still installs pip directly when invoked inside a virtual environment, but may modify the module in the system Python installation to redirect to the platform provided mechanism when installing pip globally.
- Even if pip is made available globally by other means, do not remove the
ensurepip module in Python 3.4 or later.
- ensurepip will be required for automatic installation of pip into virtual environments by the venv module.
- This is similar to the existing virtualenv package for which many downstream distributors have already made exception to the common "debundling" policy.
- This does mean that if pip needs to be updated due to a security issue, so does the private copy in the ensurepip bootstrap module
- However, altering the private copy of pip to remove the embedded CA certificate bundle and rely on the system CA bundle instead is a reasonable change.
- Ensure that all features of this PEP continue to work with any modifications
made to the redistributed version of Python.
- Checking the version of pip that will be bootstrapped using python -m ensurepip --version or ensurepip.version().
- Installation of pip into a global or virtual python environment using python -m ensurepip or ensurepip.bootstrap().
- pip install --upgrade pip in a global installation should not affect any already created virtual environments (but is permitted to affect future virtual environments, even though it will not do so when using the standard implementation of ensurepip).
- pip install --upgrade pip in a virtual environment should not affect the global installation.
- Migrate build systems to utilize pip [18] and Wheel [17] wherever feasible
and avoid directly invoking setup.py.
- This will help ensure a smoother and more timely migration to improved metadata formats as the Python packaging ecosystem continues to evolve.
In the event that a Python redistributor chooses not to follow these recommendations, we request that they explicitly document this fact and provide their users with suitable guidance on translating upstream pip based installation instructions into something appropriate for the platform.
Other Python implementations are also encouraged to follow these guidelines where applicable.
Policies & Governance
The maintainers of the bootstrapped software and the CPython core team will work together in order to address the needs of both. The bootstrapped software will still remain external to CPython and this PEP does not include CPython subsuming the development responsibilities or design decisions of the bootstrapped software. This PEP aims to decrease the burden on end users wanting to use third-party packages and the decisions inside it are pragmatic ones that represent the trust that the Python community has already placed in the Python Packaging Authority as the authors and maintainers of pip, setuptools, PyPI, virtualenv and other related projects.
Backwards Compatibility
The public API and CLI of the ensurepip module itself will fall under the typical backwards compatibility policy of Python for its standard library. The externally developed software that this PEP bundles does not.
Most importantly, this means that the bootstrapped version of pip may gain new features in CPython maintenance releases, and pip continues to operate on its own 6 month release cycle rather than CPython's 18-24 month cycle.
Security Releases
Any security update that affects the ensurepip module will be shared prior to release with the Python Security Response Team (security@python.org). The PSRT will then decide if the reported issue warrants a security release of CPython with an updated private copy of pip.
Licensing
pip is currently licensed as 1 Clause BSD, and it contains code taken from other projects. Additionally this PEP will include setuptools until such time as pip no longer requires it. The licenses for these appear in the table below.
| Project | License |
|---|---|
| requests | Apache 2.0 |
| six | 1 Clause BSD |
| html5lib | 1 Clause BSD |
| distlib | PSF |
| colorama | 3 Clause BSD |
| Mozilla CA Bundle | LGPL |
| setuptools | PSF |
All of these licenses should be compatible with the PSF license. Additionally it is unclear if a CA Bundle is copyrightable material and thus if it needs or can be licensed at all.
Appendix: Rejected Proposals
Changing the name of the scripts directory on Windows
Earlier versions of this PEP proposed changing the name of the script installation directory on Windows from "Scripts" to "bin" in order to improve the cross-platform consistency of the virtual environments created by pyvenv.
However, Paul Moore determined that this change was likely backwards incompatible with cross-version Windows installers created with previous versions of Python, so the change has been removed from this PEP [7].
Including ensurepip in Python 2.7, and 3.3
Earlier versions of this PEP made the case that the challenges of getting pip bootstrapped for new users posed a significant enough barrier to Python's future growth that it justified adding ensurepip as a new feature in the upcoming Python 2.7 and 3.3 maintenance releases.
While the proposal to provide pip with Python 3.4 was universally popular, this part of the proposal was highly controversial and ultimately rejected by MvL as BDFL-Delegate.
Accordingly, the proposal to backport ensurepip to Python 2.7 and 3.3 has been removed from this PEP in favour of creating a Windows installer for pip and a possible future PEP suggesting creation of an aggregate installer for Python 2.7 that combines CPython 2.7, pip and the Python Launcher for Windows.
Automatically contacting PyPI when bootstrapping pip
Earlier versions of this PEP called the bootstrapping module getpip and defaulted to downloading and installing pip from PyPI, with the private copy used only as a fallback option or when explicitly requested.
This resulted in several complex edge cases, along with difficulties in defining a clean API and CLI for the bootstrap module. It also significantly altered the default trust model for the binary installers published on python.org, as end users would need to explicitly opt-out of trusting the security of the PyPI ecosystem (rather than opting in to it by explicitly invoking pip following installation).
As a result, the PEP was simplified to the current design, where the bootstrapping always uses the private copy of pip. Contacting PyPI is now always an explicit separate step, with direct access to the full pip interface.
Removing the implicit attempt to access PyPI also made it feasible to invoke ensurepip by default when installing from a custom source build.
Implicit bootstrap
PEP439 [19], the predecessor for this PEP, proposes its own solution. Its solution involves shipping a fake pip command that when executed would implicitly bootstrap and install pip if it does not already exist. This has been rejected because it is too "magical". It hides from the end user when exactly the pip command will be installed or that it is being installed at all. It also does not provide any recommendations or considerations towards downstream packagers who wish to manage the globally installed pip through the mechanisms typical for their system.
The implicit bootstrap mechanism also ran into possible permissions issues, if a user inadvertently attempted to bootstrap pip without write access to the appropriate installation directories.
Including pip directly in the standard library
Similar to this PEP is the proposal of just including pip in the standard library. This would ensure that Python always includes pip and fixes all of the end user facing problems with not having pip present by default. This has been rejected because we've learned, through the inclusion and history of distutils in the standard library, that losing the ability to update the packaging tools independently can leave the tooling in a state of constant limbo. Making it unable to ever reasonably evolve in a time frame that actually affects users as any new features will not be available to the general population for years.
Allowing the packaging tools to progress separately from the Python release and adoption schedules allows the improvements to be used by all members of the Python community and not just those able to live on the bleeding edge of Python releases.
There have also been issues in the past with the "dual maintenance" problem if a project continues to be maintained externally while also having a fork maintained in the standard library. Since external maintenance of pip will always be needed to support earlier Python versions, the proposed bootstrapping mechanism will becoming the explicit responsibility of the CPython core developers (assisted by the pip developers), while pip issues reported to the CPython tracker will be migrated to the pip issue tracker. There will no doubt still be some user confusion over which tracker to use, but hopefully less than has been seen historically when including complete public copies of third-party projects in the standard library.
The approach described in this PEP also avoids some technical issues related to handling CPython maintenance updates when pip has been independently updated to a more recent version. The proposed pip-based bootstrapping mechanism handles that automatically, since pip and the system installer never get into a fight about who owns the pip installation (it is always managed through pip, either directly, or indirectly via the ensurepip bootstrap module).
Finally, the separate bootstrapping step means it is also easy to avoid installing pip at all if end users so desire. This is often the case if integrators are using system packages to handle installation of components written in multiple languages using a common set of tools.
Defaulting to --user installation
Some consideration was given to bootstrapping pip into the per-user site-packages directory by default. However, this behavior would be surprising (as it differs from the default behavior of pip itself) and is also not currently considered reliable (there are some edge cases which are not handled correctly when pip is installed into the user site-packages directory rather than the system site-packages).
References
| [1] | Discussion thread 1 (distutils-sig) (https://mail.python.org/pipermail/distutils-sig/2013-August/022529.html) |
| [2] | Discussion thread 2 (distutils-sig) (https://mail.python.org/pipermail/distutils-sig/2013-September/022702.html) |
| [3] | Discussion thread 3 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128723.html) |
| [4] | Discussion thread 4 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128780.html) |
| [5] | Discussion thread 5 (python-dev) (https://mail.python.org/pipermail/python-dev/2013-September/128894.html) |
| [6] | pip/requests certificate management concerns (https://mail.python.org/pipermail/python-dev/2013-October/129755.html) |
| [7] | Windows installer compatibility concerns (https://mail.python.org/pipermail/distutils-sig/2013-October/022855.html) |
| [8] | Ubuntu <http://www.ubuntu.com/> |
| [9] | Debian <http://www.debian.org> |
| [10] | Fedora <https://fedoraproject.org/> |
| [11] | Homebrew <http://brew.sh/> |
| [12] | MacPorts <http://macports.org> |
| [13] | Fink <http://finkproject.org> |
| [14] | Anaconda <https://store.continuum.io/cshop/anaconda/> |
| [15] | ActivePython <http://www.activestate.com/activepython> |
| [16] | Enthought Canopy <https://www.enthought.com/products/canopy/> |
| [17] | (1, 2) http://www.python.org/dev/peps/pep-0427/ |
| [18] | (1, 2) http://www.pip-installer.org |
| [19] | http://www.python.org/dev/peps/pep-0439/ |
Copyright
This document has been placed in the public domain.
pep-0454 Add a new tracemalloc module to trace Python memory allocations
| PEP: | 454 |
|---|---|
| Title: | Add a new tracemalloc module to trace Python memory allocations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| BDFL-Delegate: | Charles-Franรงois Natali <cf.natali@gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 3-September-2013 |
| Python-Version: | 3.4 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130491.html |
Contents
Abstract
This PEP proposes to add a new tracemalloc module to trace memory blocks allocated by Python.
Rationale
Classic generic tools like Valgrind can get the C traceback where a memory block was allocated. Using such tools to analyze Python memory allocations does not help because most memory blocks are allocated in the same C function, in PyMem_Malloc() for example. Moreover, Python has an allocator for small objects called "pymalloc" which keeps free blocks for efficiency. This is not well handled by these tools.
There are debug tools dedicated to the Python language like Heapy Pympler and Meliae which lists all alive objects using the garbage collector module (functions like gc.get_objects(), gc.get_referrers() and gc.get_referents()), compute their size (ex: using sys.getsizeof()) and group objects by type. These tools provide a better estimation of the memory usage of an application. They are useful when most memory leaks are instances of the same type and this type is only instantiated in a few functions. Problems arise when the object type is very common like str or tuple, and it is hard to identify where these objects are instantiated.
Finding reference cycles is also a difficult problem. There are different tools to draw a diagram of all references. These tools cannot be used on large applications with thousands of objects because the diagram is too huge to be analyzed manually.
Proposal
Using the customized allocation API from PEP 445, it becomes easy to set up a hook on Python memory allocators. A hook can inspect Python internals to retrieve Python tracebacks. The idea of getting the current traceback comes from the faulthandler module. The faulthandler dumps the traceback of all Python threads on a crash, here is the idea is to get the traceback of the current Python thread when a memory block is allocated by Python.
This PEP proposes to add a new tracemalloc module, a debug tool to trace memory blocks allocated by Python. The module provides the following information:
- Traceback where an object was allocated
- Statistics on allocated memory blocks per filename and per line number: total size, number and average size of allocated memory blocks
- Computed differences between two snapshots to detect memory leaks
The API of the tracemalloc module is similar to the API of the faulthandler module: enable() / start(), disable() / stop() and is_enabled() / is_tracing() functions, an environment variable (PYTHONFAULTHANDLER and PYTHONTRACEMALLOC), and a -X command line option (-X faulthandler and -X tracemalloc). See the documentation of the faulthandler module.
The idea of tracing memory allocations is not new. It was first implemented in the PySizer project in 2005. PySizer was implemented differently: the traceback was stored in frame objects and some Python types were linked the trace with the name of object type. PySizer patch on CPython adds a overhead on performances and memory footprint, even if the PySizer was not used. tracemalloc attachs a traceback to the underlying layer, to memory blocks, and has no overhead when the module is not tracing memory allocations.
The tracemalloc module has been written for CPython. Other implementations of Python may not be able to provide it.
API
To trace most memory blocks allocated by Python, the module should be started as early as possible by setting the PYTHONTRACEMALLOC environment variable to 1, or by using -X tracemalloc command line option. The tracemalloc.start() function can be called at runtime to start tracing Python memory allocations.
By default, a trace of an allocated memory block only stores the most recent frame (1 frame). To store 25 frames at startup: set the PYTHONTRACEMALLOC environment variable to 25, or use the -X tracemalloc=25 command line option. The set_traceback_limit() function can be used at runtime to set the limit.
Functions
clear_traces() function:
Clear traces of memory blocks allocated by Python.
See also stop().
get_object_traceback(obj) function:
Get the traceback where the Python object obj was allocated. Return a Traceback instance, or None if the tracemalloc module is not tracing memory allocations or did not trace the allocation of the object.
See also gc.get_referrers() and sys.getsizeof() functions.
get_traceback_limit() function:
Get the maximum number of frames stored in the traceback of a trace.
The tracemalloc module must be tracing memory allocations to get the limit, otherwise an exception is raised.
The limit is set by the start() function.
get_traced_memory() function:
Get the current size and maximum size of memory blocks traced by the tracemalloc module as a tuple: (size: int, max_size: int).
get_tracemalloc_memory() function:
Get the memory usage in bytes of the tracemalloc module used to store traces of memory blocks. Return an int.
is_tracing() function:
True if the tracemalloc module is tracing Python memory allocations, False otherwise.
See also start() and stop() functions.
start(nframe: int=1) function:
Start tracing Python memory allocations: install hooks on Python memory allocators. Collected tracebacks of traces will be limited to nframe frames. By default, a trace of a memory block only stores the most recent frame: the limit is 1. nframe must be greater or equal to 1.
Storing more than 1 frame is only useful to compute statistics grouped by 'traceback' or to compute cumulative statistics: see the Snapshot.compare_to() and Snapshot.statistics() methods.
Storing more frames increases the memory and CPU overhead of the tracemalloc module. Use the get_tracemalloc_memory() function to measure how much memory is used by the tracemalloc module.
The PYTHONTRACEMALLOC environment variable (PYTHONTRACEMALLOC=NFRAME) and the -X tracemalloc=NFRAME command line option can be used to start tracing at startup.
See also stop(), is_tracing() and get_traceback_limit() functions.
stop() function:
Stop tracing Python memory allocations: uninstall hooks on Python memory allocators. Clear also traces of memory blocks allocated by Python
Call take_snapshot() function to take a snapshot of traces before clearing them.
See also start() and is_tracing() functions.
take_snapshot() function:
Take a snapshot of traces of memory blocks allocated by Python. Return a new Snapshot instance.
The snapshot does not include memory blocks allocated before the tracemalloc module started to trace memory allocations.
Tracebacks of traces are limited to get_traceback_limit() frames. Use the nframe parameter of the start() function to store more frames.
The tracemalloc module must be tracing memory allocations to take a snapshot, see the the start() function.
See also the get_object_traceback() function.
Filter
Filter(inclusive: bool, filename_pattern: str, lineno: int=None, all_frames: bool=False) class:
Filter on traces of memory blocks.
See the fnmatch.fnmatch() function for the syntax of filename_pattern. The '.pyc' and '.pyo' file extensions are replaced with '.py'.
Examples:
- Filter(True, subprocess.__file__) only includes traces of the subprocess module
- Filter(False, tracemalloc.__file__) excludes traces of the tracemalloc module
- Filter(False, "<unknown>") excludes empty tracebacks
inclusive attribute:
If inclusive is True (include), only trace memory blocks allocated in a file with a name matching filename_pattern at line number lineno.
If inclusive is False (exclude), ignore memory blocks allocated in a file with a name matching filename_pattern at line number lineno.
lineno attribute:
Line number (int) of the filter. If lineno is None, the filter matches any line number.
filename_pattern attribute:
Filename pattern of the filter (str).
all_frames attribute:
If all_frames is True, all frames of the traceback are checked. If all_frames is False, only the most recent frame is checked.
This attribute is ignored if the traceback limit is less than 2. See the get_traceback_limit() function and Snapshot.traceback_limit attribute.
Frame
Frame class:
Frame of a traceback.
The Traceback class is a sequence of Frame instances.
filename attribute:
Filename (str).
lineno attribute:
Line number (int).
Snapshot
Snapshot class:
Snapshot of traces of memory blocks allocated by Python.
The take_snapshot() function creates a snapshot instance.
compare_to(old_snapshot: Snapshot, group_by: str, cumulative: bool=False) method:
Compute the differences with an old snapshot. Get statistics as a sorted list of StatisticDiff instances grouped by group_by.
See the statistics() method for group_by and cumulative parameters.
The result is sorted from the biggest to the smallest by: absolute value of StatisticDiff.size_diff, StatisticDiff.size, absolute value of StatisticDiff.count_diff, Statistic.count and then by StatisticDiff.traceback.
dump(filename) method:
Write the snapshot into a file.
Use load() to reload the snapshot.
filter_traces(filters) method:
Create a new Snapshot instance with a filtered traces sequence, filters is a list of Filter instances. If filters is an empty list, return a new Snapshot instance with a copy of the traces.
All inclusive filters are applied at once, a trace is ignored if no inclusive filters match it. A trace is ignored if at least one exclusive filter matchs it.
load(filename) classmethod:
Load a snapshot from a file.
See also dump().
statistics(group_by: str, cumulative: bool=False) method:
Get statistics as a sorted list of Statistic instances grouped by group_by:
group_by description 'filename' filename 'lineno' filename and line number 'traceback' traceback If cumulative is True, cumulate size and count of memory blocks of all frames of the traceback of a trace, not only the most recent frame. The cumulative mode can only be used with group_by equals to 'filename' and 'lineno' and traceback_limit greater than 1.
The result is sorted from the biggest to the smallest by: Statistic.size, Statistic.count and then by Statistic.traceback.
traceback_limit attribute:
Maximum number of frames stored in the traceback of traces: result of the get_traceback_limit() when the snapshot was taken.
traces attribute:
Traces of all memory blocks allocated by Python: sequence of Trace instances.
The sequence has an undefined order. Use the Snapshot.statistics() method to get a sorted list of statistics.
Statistic
Statistic class:
Statistic on memory allocations.
Snapshot.statistics() returns a list of Statistic instances.
See also the StatisticDiff class.
count attribute:
Number of memory blocks (int).
size attribute:
Total size of memory blocks in bytes (int).
traceback attribute:
Traceback where the memory block was allocated, Traceback instance.
StatisticDiff
StatisticDiff class:
Statistic difference on memory allocations between an old and a new Snapshot instance.
Snapshot.compare_to() returns a list of StatisticDiff instances. See also the Statistic class.
count attribute:
Number of memory blocks in the new snapshot (int): 0 if the memory blocks have been released in the new snapshot.
count_diff attribute:
Difference of number of memory blocks between the old and the new snapshots (int): 0 if the memory blocks have been allocated in the new snapshot.
size attribute:
Total size of memory blocks in bytes in the new snapshot (int): 0 if the memory blocks have been released in the new snapshot.
size_diff attribute:
Difference of total size of memory blocks in bytes between the old and the new snapshots (int): 0 if the memory blocks have been allocated in the new snapshot.
traceback attribute:
Traceback where the memory blocks were allocated, Traceback instance.
Trace
Trace class:
Trace of a memory block.
The Snapshot.traces attribute is a sequence of Trace instances.
size attribute:
Size of the memory block in bytes (int).
traceback attribute:
Traceback where the memory block was allocated, Traceback instance.
Traceback
Traceback class:
Sequence of Frame instances sorted from the most recent frame to the oldest frame.
A traceback contains at least 1 frame. If the tracemalloc module failed to get a frame, the filename "<unknown>" at line number 0 is used.
When a snapshot is taken, tracebacks of traces are limited to get_traceback_limit() frames. See the take_snapshot() function.
The Trace.traceback attribute is an instance of Traceback instance.
Rejected Alternatives
Log calls to the memory allocator
A different approach is to log calls to malloc(), realloc() and free() functions. Calls can be logged into a file or send to another computer through the network. Example of a log entry: name of the function, size of the memory block, address of the memory block, Python traceback where the allocation occurred, timestamp.
Logs cannot be used directly, getting the current status of the memory requires to parse previous logs. For example, it is not possible to get directly the traceback of a Python object, like get_object_traceback(obj) does with traces.
Python uses objects with a very short lifetime and so makes an extensive use of memory allocators. It has an allocator optimized for small objects (less than 512 bytes) with a short lifetime. For example, the Python test suites calls malloc(), realloc() or free() 270,000 times per second in average. If the size of log entry is 32 bytes, logging produces 8.2 MB per second or 29.0 GB per hour.
The alternative was rejected because it is less efficient and has less features. Parsing logs in a different process or a different computer is slower than maintaining traces on allocated memory blocks in the same process.
Prior Work
- Python Memory Validator (2005-2013): commercial Python memory validator developed by Software Verification. It uses the Python Reflection API.
- PySizer: Google Summer of Code 2005 project by Nick Smallbone.
- Heapy (2006-2013): part of the Guppy-PE project written by Sverker Nilsson.
- Draft PEP: Support Tracking Low-Level Memory Usage in CPython (Brett Canon, 2006)
- Muppy: project developed in 2008 by Robert Schuppenies.
- asizeof: a pure Python module to estimate the size of objects by Jean Brouwers (2008).
- Heapmonitor: It provides facilities to size individual objects and can track all objects of certain classes. It was developed in 2008 by Ludwig Haehne.
- Pympler (2008-2011): project based on asizeof, muppy and HeapMonitor
- objgraph (2008-2012)
- Dozer: WSGI Middleware version of the CherryPy memory leak debugger, written by Marius Gedminas (2008-2013)
- Meliae: Python Memory Usage Analyzer developed by John A Meinel since 2009
- gdb-heap: gdb script written in Python by Dave Malcom (2010-2011) to analyze the usage of the heap memory
- memory_profiler: written by Fabian Pedregosa (2011-2013)
- caulk: written by Ben Timby in 2012
See also Pympler Related Work.
Links
tracemalloc:
Copyright
This document has been placed in the public domain.
pep-0455 Adding a key-transforming dictionary to collections
| PEP: | 455 |
|---|---|
| Title: | Adding a key-transforming dictionary to collections |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| BDFL-Delegate: | Raymond Hettinger |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 13-Sep-2013 |
| Python-Version: | 3.5 |
| Post-History: |
Contents
Abstract
This PEP proposes a new data structure for the collections module, called "TransformDict" in this PEP. This structure is a mutable mapping which transforms the key using a given function when doing a lookup, but retains the original key when reading.
Rejection
See the rationale at https://mail.python.org/pipermail/python-dev/2015-May/140003.html and for a earlier partial review, see https://mail.python.org/pipermail/python-dev/2013-October/129937.html .
Rationale
Numerous specialized versions of this pattern exist. The most common is a case-insensitive case-preserving dict, i.e. a dict-like container which matches keys in a case-insensitive fashion but retains the original casing. It is a very common need in network programming, as many protocols feature some arrays of "key / value" properties in their messages, where the keys are textual strings whose case is specified to be ignored on receipt but by either specification or custom is to be preserved or non-trivially canonicalized when retransmitted.
Another common request is an identity dict, where keys are matched according to their respective id()s instead of normal matching.
Both are instances of a more general pattern, where a given transformation function is applied to keys when looking them up: that function being str.lower or str.casefold in the former example and the built-in id function in the latter.
(It could be said that the pattern projects keys from the user-visible set onto the internal lookup set.)
Semantics
TransformDict is a MutableMapping implementation: it faithfully implements the well-known API of mutable mappings, like dict itself and other dict-like classes in the standard library. Therefore, this PEP won't rehash the semantics of most TransformDict methods.
The transformation function needn't be bijective, it can be strictly surjective as in the case-insensitive example (in other words, different keys can lookup the same value):
>>> d = TransformDict(str.casefold) >>> d['SomeKey'] = 5 >>> d['somekey'] 5 >>> d['SOMEKEY'] 5
TransformDict retains the first key used when creating an entry:
>>> d = TransformDict(str.casefold)
>>> d['SomeKey'] = 1
>>> d['somekey'] = 2
>>> list(d.items())
[('SomeKey', 2)]
The original keys needn't be hashable, as long as the transformation function returns a hashable one:
>>> d = TransformDict(id) >>> l = [None] >>> d[l] = 5 >>> l in d True
Constructor
As shown in the examples above, creating a TransformDict requires passing the key transformation function as the first argument (much like creating a defaultdict requires passing the factory function as first argument).
The constructor also takes other optional arguments which can be used to initialize the TransformDict with certain key-value pairs. Those optional arguments are the same as in the dict and defaultdict constructors:
>>> d = TransformDict(str.casefold, [('Foo', 1)], Bar=2)
>>> sorted(d.items())
[('Bar', 2), ('Foo', 1)]
Getting the original key
TransformDict also features a lookup method returning the stored key together with the corresponding value:
>>> d = TransformDict(str.casefold, {'Foo': 1})
>>> d.getitem('FOO')
('Foo', 1)
>>> d.getitem('bar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'bar'
The method name getitem() follows the standard popitem() method on mutable mappings.
Getting the transformation function
TransformDict has a simple read-only property transform_func which gives back the transformation function.
Alternative proposals and questions
Retaining the last original key
Most python-dev respondents found retaining the first user-supplied key more intuitive than retaining the last. Also, it matches the dict object's own behaviour when using different but equal keys:
>>> d = {}
>>> d[1] = 'hello'
>>> d[1.0] = 'world'
>>> d
{1: 'world'}
Furthermore, explicitly retaining the last key in a first-key-retaining scheme is still possible using the following approach:
d.pop(key, None) d[key] = value
while the converse (retaining the first key in a last-key-retaining scheme) doesn't look possible without rewriting part of the container's code.
Using an encoder / decoder pair
Using a function pair isn't necessary, since the original key is retained by the container. Moreover, an encoder / decoder pair would require the transformation to be bijective, which prevents important use cases like case-insensitive matching.
Providing a transformation function for values
Dictionary values are not used for lookup, their semantics are totally irrelevant to the container's operation. Therefore, there is no point in having both an "original" and a "transformed" value: the transformed value wouldn't be used for anything.
Providing a specialized container, not generic
It was asked why we would provide the generic TransformDict construct rather than a specialized case-insensitive dict variant. The answer is that it's nearly as cheap (code-wise and performance-wise) to provide the generic construct, and it can fill more use cases.
Even case-insensitive dicts can actually elicit different transformation functions: str.lower, str.casefold or in some cases bytes.lower when working with text encoded in a ASCII-compatible encoding.
Other constructor patterns
Two other constructor patterns were proposed by Serhiy Storchaka:
A type factory scheme:
d = TransformDict(str.casefold)(Foo=1)
A subclassing scheme:
class CaseInsensitiveDict(TransformDict): __transform__ = str.casefold d = CaseInsensitiveDict(Foo=1)
While both approaches can be defended, they don't follow established practices in the standard library, and therefore were rejected.
Implementation
A patch for the collections module is tracked on the bug tracker at http://bugs.python.org/issue18986.
Existing work
Case-insensitive dicts are a popular request:
- http://twistedmatrix.com/documents/current/api/twisted.python.util.InsensitiveDict.html
- https://mail.python.org/pipermail/python-list/2013-May/647243.html
- https://mail.python.org/pipermail/python-list/2005-April/296208.html
- https://mail.python.org/pipermail/python-list/2004-June/241748.html
- http://bugs.python.org/msg197376
- http://stackoverflow.com/a/2082169
- http://stackoverflow.com/a/3296782
- http://code.activestate.com/recipes/66315-case-insensitive-dictionary/
- https://gist.github.com/babakness/3901174
- http://www.wikier.org/blog/key-insensitive-dictionary-in-python
- http://en.sharejs.com/python/14534
- http://www.voidspace.org.uk/python/archive.shtml#caseless
Identity dicts have been requested too:
- https://mail.python.org/pipermail/python-ideas/2010-May/007235.html
- http://www.gossamer-threads.com/lists/python/python/209527
Several modules in the standard library use identity lookups for object memoization, for example pickle, json, copy, cProfile, doctest and _threading_local.
Other languages
C# / .Net
.Net has a generic Dictionary class where you can specify a custom IEqualityComparer: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
Using it is the recommended way to write case-insensitive dictionaries: http://stackoverflow.com/questions/13230414/case-insensitive-access-for-generic-dictionary
Java
Java has a specialized CaseInsensitiveMap: http://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/map/CaseInsensitiveMap.html
It also has a separate IdentityHashMap: http://docs.oracle.com/javase/6/docs/api/java/util/IdentityHashMap.html
C++
The C++ Standard Template Library features an unordered_map with customizable hash and equality functions: http://www.cplusplus.com/reference/unordered_map/unordered_map/
Copyright
This document has been placed in the public domain.
pep-0456 Secure and interchangeable hash algorithm
| PEP: | 456 |
|---|---|
| Title: | Secure and interchangeable hash algorithm |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christian Heimes <christian at python.org> |
| BDFL-Delegate: | Nick Coghlan |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Sep-2013 |
| Python-Version: | 3.4 |
| Post-History: | 06-Oct-2013, 14-Nov-2013, 20-Nov-2013 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130400.html |
Contents
- Abstract
- Rationale
- Requirements for a hash function
- Current implementation with modified FNV
- Examined hashing algorithms
- Small string optimization
- C API additions
- Python API addition
- Necessary modifications to C code
- Performance
- Backwards Compatibility
- Alternative counter measures against hash collision DoS
- Discussion
- References
- Copyright
Abstract
This PEP proposes SipHash as default string and bytes hash algorithm to properly fix hash randomization once and for all. It also proposes modifications to Python's C code in order to unify the hash code and to make it easily interchangeable.
Rationale
Despite the last attempt [issue13703] CPython is still vulnerable to hash collision DoS attacks [29c3] [issue14621]. The current hash algorithm and its randomization is not resilient against attacks. Only a proper cryptographic hash function prevents the extraction of secret randomization keys. Although no practical attack against a Python-based service has been seen yet, the weakness has to be fixed. Jean-Philippe Aumasson and Daniel J. Bernstein have already shown how the seed for the current implementation can be recovered [poc].
Furthermore the current hash algorithm is hard-coded and implemented multiple times for bytes and three different Unicode representations UCS1, UCS2 and UCS4. This makes it impossible for embedders to replace it with a different implementation without patching and recompiling large parts of the interpreter. Embedders may want to choose a more suitable hash function.
Finally the current implementation code does not perform well. In the common case it only processes one or two bytes per cycle. On a modern 64-bit processor the code can easily be adjusted to deal with eight bytes at once.
This PEP proposes three major changes to the hash code for strings and bytes:
- SipHash [sip] is introduced as default hash algorithm. It is fast and small despite its cryptographic properties. Due to the fact that it was designed by well known security and crypto experts, it is safe to assume that its secure for the near future.
- The existing FNV code is kept for platforms without a 64-bit data type. The algorithm is optimized to process larger chunks per cycle.
- Calculation of the hash of strings and bytes is moved into a single API function instead of multiple specialized implementations in Objects/object.c and Objects/unicodeobject.c. The function takes a void pointer plus length and returns the hash for it.
- The algorithm can be selected at compile time. FNV is guaranteed to exist on all platforms. SipHash is available on the majority of modern systems.
Requirements for a hash function
- It MUST be able to hash arbitrarily large blocks of memory from 1 byte up to the maximum ssize_t value.
- It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits on 64-bit platforms. (Note: Larger outputs can be compressed with e.g. v ^ (v >> 32).)
- It MUST support hashing of unaligned memory in order to support hash(memoryview).
- It is highly RECOMMENDED that the length of the input influences the outcome, so that hash(b'\00') != hash(b'\x00\x00').
The internal interface code between the hash function and the tp_hash slots implements special cases for zero length input and a return value of -1. An input of length 0 is mapped to hash value 0. The output -1 is mapped to -2.
Current implementation with modified FNV
CPython currently uses uses a variant of the Fowler-Noll-Vo hash function [fnv]. The variant is has been modified to reduce the amount and cost of hash collisions for common strings. The first character of the string is added twice, the first time with a bit shift of 7. The length of the input string is XOR-ed to the final value. Both deviations from the original FNV algorithm reduce the amount of hash collisions for short strings.
Recently [issue13703] a random prefix and suffix were added as an attempt to randomize the hash values. In order to protect the hash secret the code still returns 0 for zero length input.
C code:
Py_uhash_t x;
Py_ssize_t len;
/* p is either 1, 2 or 4 byte type */
unsigned char *p;
Py_UCS2 *p;
Py_UCS4 *p;
if (len == 0)
return 0;
x = (Py_uhash_t) _Py_HashSecret.prefix;
x ^= (Py_uhash_t) *p << 7;
for (i = 0; i < len; i++)
x = (1000003 * x) ^ (Py_uhash_t) *p++;
x ^= (Py_uhash_t) len;
x ^= (Py_uhash_t) _Py_HashSecret.suffix;
return x;
Which roughly translates to Python:
def fnv(p):
if len(p) == 0:
return 0
# bit mask, 2**32-1 or 2**64-1
mask = 2 * sys.maxsize + 1
x = hashsecret.prefix
x = (x ^ (ord(p[0]) << 7)) & mask
for c in p:
x = ((1000003 * x) ^ ord(c)) & mask
x = (x ^ len(p)) & mask
x = (x ^ hashsecret.suffix) & mask
if x == -1:
x = -2
return x
FNV is a simple multiply and XOR algorithm with no cryptographic properties. The randomization was not part of the initial hash code, but was added as counter measure against hash collision attacks as explained in oCERT-2011-003 [ocert]. Because FNV is not a cryptographic hash algorithm and the dict implementation is not fortified against side channel analysis, the randomization secrets can be calculated by a remote attacker. The author of this PEP strongly believes that the nature of a non-cryptographic hash function makes it impossible to conceal the secrets.
Examined hashing algorithms
The author of this PEP has researched several hashing algorithms that are considered modern, fast and state-of-the-art.
SipHash
SipHash [sip] is a cryptographic pseudo random function with a 128-bit seed and 64-bit output. It was designed by Jean-Philippe Aumasson and Daniel J. Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl, OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has been released under CC0 license (public domain).
Quote from SipHash's site:
SipHash is a family of pseudorandom functions (a.k.a. keyed hash functions) optimized for speed on short messages. Target applications include network traffic authentication and defense against hash-flooding DoS attacks.
siphash24 is the recommend variant with best performance. It uses 2 rounds per message block and 4 finalization rounds. Besides the reference implementation several other implementations are available. Some are single-shot functions, others use a Merkle–Damgård construction-like approach with init, update and finalize functions. Marek Majkowski C implementation csiphash [csiphash] defines the prototype of the function. (Note: k is split up into two uint64_t):
uint64_t siphash24(const void *src, unsigned long src_sz, const char k[16])
SipHash requires a 64-bit data type and is not compatible with pure C89 platforms.
MurmurHash
MurmurHash [murmur] is a family of non-cryptographic keyed hash function developed by Austin Appleby. Murmur3 is the latest and fast variant of MurmurHash. The C++ reference implementation has been released into public domain. It features 32- or 128-bit output with a 32-bit seed. (Note: The out parameter is a buffer with either 1 or 4 bytes.)
Murmur3's function prototypes are:
void MurmurHash3_x86_32(const void *key, int len, uint32_t seed, void *out) void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out) void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out)
The 128-bit variants requires a 64-bit data type and are not compatible with pure C89 platforms. The 32-bit variant is fully C89-compatible.
Aumasson, Bernstein and Boßlet have shown [sip] [ocert-2012-001] that Murmur3 is not resilient against hash collision attacks. Therefore Murmur3 can no longer be considered as secure algorithm. It still may be an alternative is hash collision attacks are of no concern.
CityHash
CityHash [city] is a family of non-cryptographic hash function developed by Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation has been released under MIT license. The algorithm is partly based on MurmurHash and claims to be faster. It supports 64- and 128-bit output with a 128-bit seed as well as 32-bit output without seed.
The relevant function prototype for 64-bit CityHash with 128-bit seed is:
uint64 CityHash64WithSeeds(const char *buf, size_t len, uint64 seed0,
uint64 seed1)
CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long inputs. All variants except CityHash32 require 64-bit data types. CityHash32 uses only 32-bit data types but it doesn't support seeding.
Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip] a similar weakness in CityHash.
DJBX33A
DJBX33A is a very simple multiplication and addition algorithm by Daniel J. Bernstein. It is fast and has low setup costs but it's not secure against hash collision attacks. Its properties make it a viable choice for small string hashing optimization.
Other
Crypto algorithms such as HMAC, MD5, SHA-1 or SHA-2 are too slow and have high setup and finalization costs. For these reasons they are not considered fit for this purpose. Modern AMD and Intel CPUs have AES-NI (AES instruction set) [aes-ni] to speed up AES encryption. CMAC with AES-NI might be a viable option but it's probably too slow for daily operation. (testing required)
Conclusion
SipHash provides the best combination of speed and security. Developers of other prominent projects have came to the same conclusion.
Small string optimization
Hash functions like SipHash24 have a costly initialization and finalization code that can dominate speed of the algorithm for very short strings. On the other hand Python calculates the hash value of short strings quite often. A simple and fast function for especially for hashing of small strings can make a measurable impact on performance. For example these measurements were taken during a run of Python's regression tests. Additional measurements of other code have shown a similar distribution.
| bytes | hash() calls | portion |
|---|---|---|
| 1 | 18709 | 0.2% |
| 2 | 737480 | 9.5% |
| 3 | 636178 | 17.6% |
| 4 | 1518313 | 36.7% |
| 5 | 643022 | 44.9% |
| 6 | 770478 | 54.6% |
| 7 | 525150 | 61.2% |
| 8 | 304873 | 65.1% |
| 9 | 297272 | 68.8% |
| 10 | 68191 | 69.7% |
| 11 | 1388484 | 87.2% |
| 12 | 480786 | 93.3% |
| 13 | 52730 | 93.9% |
| 14 | 65309 | 94.8% |
| 15 | 44245 | 95.3% |
| 16 | 85643 | 96.4% |
| Total | 7921678 |
However a fast function like DJBX33A is not as secure as SipHash24. A cutoff at about 5 to 7 bytes should provide a decent safety margin and speed up at the same time. The PEP's reference implementation provides such a cutoff with Py_HASH_CUTOFF. The optimization is disabled by default for several reasons. For one the security implications are unclear yet and should be thoroughly studied before the optimization is enabled by default. Secondly the performance benefits vary. On 64 bit Linux system with Intel Core i7 multiple runs of Python's benchmark suite [pybench] show an average speedups between 3% and 5% for benchmarks such as django_v2, mako and etree with a cutoff of 7. Benchmarks with X86 binaries and Windows X86_64 builds on the same machine are a bit slower with small string optimization.
The state of small string optimization will be assessed during the beta phase of Python 3.4. The feature will either be enabled with appropriate values or the code will be removed before beta 2 is released.
C API additions
All C API extension modifications are not part of the stable API.
hash secret
The _Py_HashSecret_t type of Python 2.6 to 3.3 has two members with either 32- or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys. The typedef will be changed to an union with a guaranteed size of 24 bytes on all architectures. The union provides a 128 bit random key for SipHash24 and FNV as well as an additional value of 64 bit for the optional small string optimization and pyexpat seed. The additional 64 bit seed ensures that pyexpat or small string optimization cannot reveal bits of the SipHash24 seed.
memory layout on 64 bit systems:
cccccccc cccccccc cccccccc uc -- unsigned char[24] pppppppp ssssssss ........ fnv -- two Py_hash_t k0k0k0k0 k1k1k1k1 ........ siphash -- two PY_UINT64_T ........ ........ ssssssss djbx33a -- 16 bytes padding + one Py_hash_t ........ ........ eeeeeeee pyexpat XML hash salt
memory layout on 32 bit systems:
cccccccc cccccccc cccccccc uc -- unsigned char[24] ppppssss ........ ........ fnv -- two Py_hash_t k0k0k0k0 k1k1k1k1 ........ siphash -- two PY_UINT64_T (if available) ........ ........ ssss.... djbx33a -- 16 bytes padding + one Py_hash_t ........ ........ eeee.... pyexpat XML hash salt
new type definition:
typedef union {
/* ensure 24 bytes */
unsigned char uc[24];
/* two Py_hash_t for FNV */
struct {
Py_hash_t prefix;
Py_hash_t suffix;
} fnv;
#ifdef PY_UINT64_T
/* two uint64 for SipHash24 */
struct {
PY_UINT64_T k0;
PY_UINT64_T k1;
} siphash;
#endif
/* a different (!) Py_hash_t for small string optimization */
struct {
unsigned char padding[16];
Py_hash_t suffix;
} djbx33a;
struct {
unsigned char padding[16];
Py_hash_t hashsalt;
} expat;
} _Py_HashSecret_t;
PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;
_Py_HashSecret_t is initialized in Python/random.c:_PyRandom_Init() exactly once at startup.
hash function definition
Implementation:
typedef struct {
/* function pointer to hash function, e.g. fnv or siphash24 */
Py_hash_t (*const hash)(const void *, Py_ssize_t);
const char *name; /* name of the hash algorithm and variant */
const int hash_bits; /* internal size of hash value */
const int seed_bits; /* size of seed input */
} PyHash_FuncDef;
PyAPI_FUNC(PyHash_FuncDef*) PyHash_GetFuncDef(void);
autoconf
A new test is added to the configure script. The test sets HAVE_ALIGNED_REQUIRED, when it detects a platform, that requires aligned memory access for integers. Must current platforms such as X86, X86_64 and modern ARM don't need aligned data.
A new option --with-hash-algorithm enables the user to select a hash algorithm in the configure step.
hash function selection
The value of the macro Py_HASH_ALGORITHM defines which hash algorithm is used internally. It may be set to any of the three values Py_HASH_SIPHASH24, Py_HASH_FNV or Py_HASH_EXTERNAL. If Py_HASH_ALGORITHM is not defined at all, then the best available algorithm is selected. On platforms wich don't require aligned memory access (HAVE_ALIGNED_REQUIRED not defined) and an unsigned 64 bit integer type PY_UINT64_T, SipHash24 is used. On strict C89 platforms without a 64 bit data type, or architectures such as SPARC, FNV is selected as fallback. A hash algorithm can be selected with an autoconf option, for example ./configure --with-hash-algorithm=fnv.
The value Py_HASH_EXTERNAL allows 3rd parties to provide their own implementation at compile time.
Implementation:
#if Py_HASH_ALGORITHM == Py_HASH_EXTERNAL
extern PyHash_FuncDef PyHash_Func;
#elif Py_HASH_ALGORITHM == Py_HASH_SIPHASH24
static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
#elif Py_HASH_ALGORITHM == Py_HASH_FNV
static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
16 * sizeof(Py_hash_t)};
#endif
Python API addition
sys module
The sys module already has a hash_info struct sequence. More fields are added to the object to reflect the active hash algorithm and its properties.
sys.hash_info(width=64,
modulus=2305843009213693951,
inf=314159,
nan=0,
imag=1000003,
# new fields:
algorithm='siphash24',
hash_bits=64,
seed_bits=128,
cutoff=0)
Necessary modifications to C code
_Py_HashBytes() (Objects/object.c)
_Py_HashBytes is an internal helper function that provides the hashing code for bytes, memoryview and datetime classes. It currently implements FNV for unsigned char *.
The function is moved to Python/pyhash.c and modified to use the hash function through PyHash_Func.hash(). The function signature is altered to take a const void * as first argument. _Py_HashBytes also takes care of special cases: it maps zero length input to 0 and return value of -1 to -2.
bytes_hash() (Objects/bytesobject.c)
bytes_hash uses _Py_HashBytes to provide the tp_hash slot function for bytes objects. The function will continue to use _Py_HashBytes but withoht a type cast.
memory_hash() (Objects/memoryobject.c)
memory_hash provides the tp_hash slot function for read-only memory views if the original object is hashable, too. It's the only function that has to support hashing of unaligned memory segments in the future. The function will continue to use _Py_HashBytes but withoht a type cast.
unicode_hash() (Objects/unicodeobject.c)
unicode_hash provides the tp_hash slot function for unicode. Right now it implements the FNV algorithm three times for unsigned char*, Py_UCS2 and Py_UCS4. A reimplementation of the function must take care to use the correct length. Since the macro PyUnicode_GET_LENGTH returns the length of the unicode string and not its size in octets, the length must be multiplied with the size of the internal unicode kind:
if (PyUnicode_READY(u) == -1)
return -1;
x = _Py_HashBytes(PyUnicode_DATA(u),
PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
generic_hash() (Modules/_datetimemodule.c)
generic_hash acts as a wrapper around _Py_HashBytes for the tp_hash slots of date, time and datetime types. timedelta objects are hashed by their state (days, seconds, microseconds) and tzinfo objects are not hashable. The data members of date, time and datetime types' struct are not void* aligned. This can easily by fixed with memcpy()ing four to ten bytes to an aligned buffer.
Performance
In general the PEP 456 code with SipHash24 is about as fast as the old code with FNV. SipHash24 seems to make better use of modern compilers, CPUs and large L1 cache. Several benchmarks show a small speed improvement on 64 bit CPUs such as Intel Core i5 and Intel Core i7 processes. 32 bit builds and benchmarks on older CPUs such as an AMD Athlon X2 are slightly slower with SipHash24. The performance increase or decrease are so small that they should not affect any application code.
The benchmarks were conducted on CPython default branch revision b08868fd5994 and the PEP repository [pep-456-repos]. All upstream changes were merged into the pep-456 branch. The "performance" CPU governor was configured and almost all programs were stopped so the benchmarks were able to utilize TurboBoost and the CPU caches as much as possible. The raw benchmark results of multiple machines and platforms are made available at [benchmarks].
Hash value distribution
A good distribution of hash values is important for dict and set performance. Both SipHash24 and FNV take the length of the input into account, so that strings made up entirely of NULL bytes don't have the same hash value. The last bytes of the input tend to affect the least significant bits of the hash value, too. That attribute reduces the amount of hash collisions for strings with a common prefix.
Typical length
Serhiy Storchaka has shown in [issue16427] that a modified FNV implementation with 64 bits per cycle is able to process long strings several times faster than the current FNV implementation.
However according to statistics [issue19183] a typical Python program as well as the Python test suite have a hash ratio of about 50% small strings between 1 and 6 bytes. Only 5% of the strings are larger than 16 bytes.
Grand Unified Python Benchmark Suite
Initial tests with an experimental implementation and the Grand Unified Python Benchmark Suite have shown minimal deviations. The summarized total runtime of the benchmark is within 1% of the runtime of an unmodified Python 3.4 binary. The tests were run on an Intel i7-2860QM machine with a 64-bit Linux installation. The interpreter was compiled with GCC 4.7 for 64- and 32-bit.
More benchmarks will be conducted.
Backwards Compatibility
The modifications don't alter any existing API.
The output of hash() for strings and bytes are going to be different. The hash values for ASCII Unicode and ASCII bytes will stay equal.
Alternative counter measures against hash collision DoS
Three alternative countermeasures against hash collisions were discussed in the past, but are not subject of this PEP.
- Marc-Andre Lemburg has suggested that dicts shall count hash collisions. In case an insert operation causes too many collisions an exception shall be raised.
- Some applications (e.g. PHP) limit the amount of keys for GET and POST HTTP requests. The approach effectively leverages the impact of a hash collision attack. (XXX citation needed)
- Hash maps have a worst case of O(n) for insertion and lookup of keys. This results in a quadratic runtime during a hash collision attack. The introduction of a new and additional data structure with with O(log n) worst case behavior would eliminate the root cause. A data structures like red-black-tree or prefix trees (trie [trie]) would have other benefits, too. Prefix trees with stringed keyed can reduce memory usage as common prefixes are stored within the tree structure.
Discussion
Pluggable
The first draft of this PEP made the hash algorithm pluggable at runtime. It supported multiple hash algorithms in one binary to give the user the possibility to select a hash algorithm at startup. The approach was considered an unnecessary complication by several core committers [pluggable]. Subsequent versions of the PEP aim for compile time configuration.
Non-aligned memory access
The implementation of SipHash24 were critized because it ignores the issue of non-aligned memory and therefore doesn't work on architectures that requires alignment of integer types. The PEP deliberately neglects this special case and doesn't support SipHash24 on such platforms. It's simply not considered worth the trouble until proven otherwise. All major platforms like X86, X86_64 and ARMv6+ can handle unaligned memory with minimal or even no speed impact. [alignmentmyth]
Almost every block is properly aligned anyway. At present bytes' and str's data are always aligned. Only memoryviews can point to unaligned blocks under rare circumstances. The PEP implementation is optimized and simplified for the common case.
ASCII str / bytes hash collision
Since the implementation of [pep-0393] bytes and ASCII text have the same memory layout. Because of this the new hashing API will keep the invariant:
hash("ascii string") == hash(b"ascii string")
for ASCII string and ASCII bytes. Equal hash values result in a hash collision and therefore cause a minor speed penalty for dicts and sets with mixed keys. The cause of the collision could be removed by e.g. subtracting 2 from the hash value of bytes. -2 because hash(b"") == 0 and -1 is reserved. The PEP doesn't change the hash value.
References
- Issue 19183 [issue19183] contains a reference implementation.
| [29c3] | http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html |
| [fnv] | http://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function |
| [sip] | (1, 2, 3, 4) https://131002.net/siphash/ |
| [ocert] | http://www.nruns.com/_downloads/advisory28122011.pdf |
| [ocert-2012-001] | http://www.ocert.org/advisories/ocert-2012-001.html |
| [poc] | https://131002.net/siphash/poc.py |
| [issue13703] | (1, 2) http://bugs.python.org/issue13703 |
| [issue14621] | http://bugs.python.org/issue14621 |
| [issue16427] | http://bugs.python.org/issue16427 |
| [issue19183] | (1, 2) http://bugs.python.org/issue19183 |
| [trie] | http://en.wikipedia.org/wiki/Trie |
| [city] | http://code.google.com/p/cityhash/ |
| [murmur] | http://code.google.com/p/smhasher/ |
| [csiphash] | https://github.com/majek/csiphash/ |
| [pep-0393] | http://www.python.org/dev/peps/pep-0393/ |
| [aes-ni] | http://en.wikipedia.org/wiki/AES_instruction_set |
| [pluggable] | https://mail.python.org/pipermail/python-dev/2013-October/129138.html |
| [alignmentmyth] | http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ |
| [pybench] | http://hg.python.org/benchmarks/ |
| [benchmarks] | https://bitbucket.org/tiran/pep-456-benchmarks/src |
| [pep-456-repos] | http://hg.python.org/features/pep-456 |
Copyright
This document has been placed in the public domain.
pep-0457 Syntax For Positional-Only Parameters
| PEP: | 457 |
|---|---|
| Title: | Syntax For Positional-Only Parameters |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Larry Hastings <larry at hastings.org> |
| Discussions-To: | Python-Dev <python-dev at python.org> |
| Status: | Draft |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 08-Oct-2013 |
Contents
Overview
This PEP proposes a syntax for positional-only parameters in Python. Positional-only parameters are parameters without an externally-usable name; when a function accepting positional-only parameters is called, positional arguments are mapped to these parameters based solely on their position.
Rationale
Python has always supported positional-only parameters. Early versions of Python lacked the concept of specifying parameters by name, so naturally all parameters were positional-only. This changed around Python 1.0, when all parameters suddenly became positional-or-keyword. But, even in current versions of Python, many CPython "builtin" functions still only accept positional-only arguments.
Functions implemented in modern Python can accept an arbitrary number of positional-only arguments, via the variadic *args parameter. However, there is no Python syntax to specify accepting a specific number of positional-only parameters. Put another way, there are many builtin functions whose signatures are simply not expressable with Python syntax.
This PEP proposes a backwards-compatible syntax that should permit implementing any builtin in pure Python code.
Positional-Only Parameter Semantics In Current Python
There are many, many examples of builtins that only accept positional-only parameters. The resulting semantics are easily experienced by the Python programmer--just try calling one, specifying its arguments by name:
>>> pow(x=5, y=3) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: pow() takes no keyword arguments
In addition, there are some functions with particularly interesting semantics:
- range(), which accepts an optional parameter to the left of its required parameter. [2]
- dict(), whose mapping/iterator parameter is optional and semantically must be positional-only. Any externally visible name for this parameter would occlude that name going into the **kwarg keyword variadic parameter dict! [1]
Obviously one can simulate any of these in pure Python code by accepting (*args, **kwargs) and parsing the arguments by hand. But this results in a disconnect between the Python function's signature and what it actually accepts, not to mention the work of implementing said argument parsing.
Motivation
This PEP does not propose we implement positional-only parameters in Python. The goal of this PEP is simply to define the syntax, so that:
- Documentation can clearly, unambiguously, and consistently express exactly how the arguments for a function will be interpreted.
- The syntax is reserved for future use, in case the community decides someday to add positional-only parameters to the language.
- Argument Clinic can use a variant of the syntax as part of its input when defining the arguments for built-in functions.
The Current State Of Documentation For Positional-Only Parameters
The documentation for positional-only parameters is incomplete and inconsistent:
- Some functions denote optional groups of positional-only arguments by enclosing them in nested square brackets. [3]
- Some functions denote optional groups of positional-only arguments by presenting multiple prototypes with varying numbers of arguments. [4]
- Some functions use both of the above approaches. [2] [5]
One more important idea to consider: currently in the documentation there's no way to tell whether a function takes positional-only parameters. open() accepts keyword arguments, ord() does not, but there is no way of telling just by reading the documentation that this is true.
Syntax And Semantics
From the "ten-thousand foot view", and ignoring *args and **kwargs for now, the grammar for a function definition currently looks like this:
def name(positional_or_keyword_parameters, *, keyword_only_parameters):
Building on that perspective, the new syntax for functions would look like this:
def name(positional_only_parameters, /, positional_or_keyword_parameters,
*, keyword_only_parameters):
All parameters before the / are positional-only. If / is not specified in a function signature, that function does not accept any positional-only parameters.
Positional-only parameters can be optional, but the mechanism is significantly different from positional-or-keyword or keyword-only parameters. Positional-only parameters don't accept default values. Instead, positional-only parameters can be specified in optional "groups". Groups of parameters are surrounded by square brackets, like so:
def addch([y, x,] ch, [attr,] /):
Positional-only parameters that are not in an option group are "required" positional-only parameters. All "required" positional-only parameters must be contiguous.
Parameters in an optional group accept arguments in a group; you must provide arguments either for all of the them or for none of them. Using the example of addch() above, you could not call addch() in such a way that x was specified but y was not (and vice versa). The mapping of positional parameters to optional groups is done based on fitting the number of parameters to groups. Based on the above definition, addch() would assign arguments to parameters in the following way:
Number of arguments Parameter assignment 0 raises an exception 1 ch 2 ch, attr 3 y, x, ch 4 y, x, ch, attr 5 or more raises an exception
More semantics of positional-only parameters:
- Although positional-only parameter technically have names, these names are internal-only; positional-only parameters are never externally addressable by name. (Similarly to *args and **kwargs.)
- It's possible to nest option groups.
- If there are no required parameters, all option groups behave as if they're to the right of the required parameter group.
- For clarity and consistency, the comma for a parameter always comes immediately after the parameter name. It's a syntax error to specify a square bracket between the name of a parameter and the following comma. (This is far more readable than putting the comma outside the square bracket, particularly for nested groups.)
- If there are arguments after the /, then you must specify a comma after the /, just as there is a comma after the * denoting the shift to keyword-only parameters.
- This syntax has no effect on *args or **kwargs.
It's possible to specify a function prototype where the mapping of arguments to parameters is ambiguous. Consider:
def range([start,] stop, [range,] /):
Python disambiguates these situations by preferring optional groups to the left of the required group.
Additional Limitations
Argument Clinic uses a form of this syntax for specifying builtins. It imposes further limitations that are theoretically unnecessary but make the implementation easier. Specifically:
A function that has positional-only parameters currently cannot have any other kind of parameter. (This will probably be relaxed slightly in the near future.)
Multiple option groups on either side of the required positional-only parameters must be nested, with the nesting getting deeper the further away the group is from the required positional-parameter group.
Put another way: all the left-brackets for option groups to the left of the required group must be specified contiguously, and all the right-brackets for option groups to the right of the required group must be specified contiguously.
Notes For A Future Implementor
If we decide to implement positional-only parameters in a future version of Python, we'd have to do some additional work to preserve their semantics. The problem: how do we inform a parameter that no value was passed in for it when the function was called?
The obvious solution: add a new singleton constant to Python that is passed in when a parameter is not mapped to an argument. I propose that the value be called undefined, and be a singleton of a special class called Undefined. If a positional-only parameter did not receive an argument when called, its value would be set to undefined.
But this raises a further problem. How do can we tell the difference between "this positional-only parameter did not receive an argument" and "the caller passed in undefined for this parameter"?
It'd be nice to make it illegal to pass undefined in as an argument to a function--to, say, raise an exception. But that would slow Python down, and the "consenting adults" rule appears applicable here. So making it illegal should probably be strongly discouraged but not outright prevented.
However, it should be allowed (and encouraged) for user functions to specify undefined as a default value for parameters.
Unresolved Questions
There are three types of parameters in Python:
- positional-only parameters,
- positional-or-keyword parameters, and
- keyword-only parameters.
Python allows functions to have both 2 and 3. And some builtins (e.g. range) have both 1 and 3. Does it make sense to have functions that have both 1 and 2? Or all of the above?
Thanks
Credit for the use of '/' as the separator between positional-only and positional-or-keyword parameters goes to Guido van Rossum, in a proposal from 2012. [6]
Credit for making left option groups higher precedence goes to Nick Coghlan. (Conversation in person at PyCon US 2013.)
| [1] | http://docs.python.org/3/library/stdtypes.html#dict |
| [2] | (1, 2) http://docs.python.org/3/library/functions.html#func-range |
| [3] | http://docs.python.org/3/library/curses.html#curses.window.border |
| [4] | http://docs.python.org/3/library/os.html#os.sendfile |
| [5] | http://docs.python.org/3/library/curses.html#curses.window.addch |
| [6] | Guido van Rossum, posting to python-ideas, March 2012: http://mail.python.org/pipermail/python-ideas/2012-March/014364.html and http://mail.python.org/pipermail/python-ideas/2012-March/014378.html and http://mail.python.org/pipermail/python-ideas/2012-March/014417.html |
Copyright
This document has been placed in the public domain.
pep-0458 Surviving a Compromise of PyPI
| PEP: | 458 |
|---|---|
| Title: | Surviving a Compromise of PyPI |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Trishank Karthik Kuppusamy <trishank at nyu.edu>, Vladimir Diaz <vladimir.diaz at nyu.edu>, Donald Stufft <donald at stufft.io>, Justin Cappos <jcappos at nyu.edu> |
| BDFL-Delegate: | Richard Jones <r1chardj0n3s@gmail.com> |
| Discussions-To: | DistUtils mailing list <distutils-sig at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Sep-2013 |
Contents
- Abstract
- Motivation
- Threat Model
- Definitions
- Overview of TUF
- Integrating TUF with PyPI
- PyPI and TUF Metadata
- PyPI and Key Requirements
- How Should Metadata be Generated?
- Key Compromise Analysis
- Appendix A: Repository Attacks Prevented by TUF
- Appendix B: Extension to the Minimum Security Model
- Appendix C: PEP 470 and Projects Hosted Externally
- References
- Acknowledgements
- Copyright
Abstract
This PEP proposes how the Python Package Index (PyPI [1]) should be integrated with The Update Framework [2] (TUF). TUF was designed to be a flexible security add-on to a software updater or package manager. The framework integrates best security practices such as separating role responsibilities, adopting the many-man rule for signing packages, keeping signing keys offline, and revocation of expired or compromised signing keys. For example, attackers would have to steal multiple signing keys stored independently to compromise a role responsible for specifying a repository's available files. Another role responsible for indicating the latest snapshot of the repository may have to be similarly compromised, and independent of the first compromised role.
The proposed integration will allow modern package managers such as pip [3] to be more secure against various types of security attacks on PyPI and protect users from such attacks. Specifically, this PEP describes how PyPI processes should be adapted to generate and incorporate TUF metadata (i.e., the minimum security model). The minimum security model supports verification of PyPI distributions that are signed with keys stored on PyPI: distributions uploaded by developers are signed by PyPI, require no action from developers (other than uploading the distribution), and are immediately available for download. The minimum security model also minimizes PyPI administrative responsibilities by automating much of the signing process.
This PEP does not prescribe how package managers such as pip should be adapted to install or update projects from PyPI with TUF metadata. Package managers interested in adopting TUF on the client side may consult TUF's library documentation [27], which exists for this purpose. Support for project distributions that are signed by developers (maximum security model) is also not discussed in this PEP, but is outlined in the appendix as a possible future extension and covered in detail in PEP 480 [26]. The PEP 480 extension focuses on the maximum security model, which requires more PyPI administrative work (none by clients), but it also proposes an easy-to-use key management solution for developers, how to interface with a potential future build farm on PyPI infrastructure, and discusses the feasibility of end-to-end signing.
Motivation
In January 2013, the Python Software Foundation (PSF) announced [4] that the python.org wikis for Python, Jython, and the PSF were subjected to a security breach that caused all of the wiki data to be destroyed on January 5, 2013. Fortunately, the PyPI infrastructure was not affected by this security breach. However, the incident is a reminder that PyPI should take defensive steps to protect users as much as possible in the event of a compromise. Attacks on software repositories happen all the time [5]. The PSF must accept the possibility of security breaches and prepare PyPI accordingly because it is a valuable resource used by thousands, if not millions, of people.
Before the wiki attack, PyPI used MD5 hashes to tell package managers, such as pip, whether or not a package was corrupted in transit. However, the absence of SSL made it hard for package managers to verify transport integrity to PyPI. It was therefore easy to launch a man-in-the-middle attack between pip and PyPI, and change package content arbitrarily. Users could be tricked into installing malicious packages with man-in-the-middle attacks. After the wiki attack, several steps were proposed (some of which were implemented) to deliver a much higher level of security than was previously the case: requiring SSL to communicate with PyPI [6], restricting project names [7], and migrating from MD5 to SHA-2 hashes [8].
These steps, though necessary, are insufficient because attacks are still possible through other avenues. For example, a public mirror is trusted to honestly mirror PyPI, but some mirrors may misbehave due to malice or accident. Package managers such as pip are supposed to use signatures from PyPI to verify packages downloaded from a public mirror [9], but none are known to actually do so [10]. Therefore, it would be wise to add more security measures to detect attacks from public mirrors or content delivery networks [11] (CDNs).
Even though official mirrors are being deprecated on PyPI [12], there remain a wide variety of other attack vectors on package managers [13]. These attacks can crash client systems, cause obsolete packages to be installed, or even allow an attacker to execute arbitrary code. In September 2013 [28], a post was made to the Distutils mailing list showing that the latest version of pip (at the time) was susceptible to such attacks, and how TUF could protect users against them [14]. Specifically, testing was done to see how pip would respond to these attacks with and without TUF. Attacks tested included replay and freeze, arbitrary packages, slow retrieval, and endless data. The post also included a demonstration of how pip would respond if PyPI were compromised.
With the intent to protect PyPI against infrastructure compromises, this PEP proposes integrating PyPI with The Update Framework [2] (TUF). TUF helps secure new or existing software update systems. Software update systems are vulnerable to many known attacks, including those that can result in clients being compromised or crashed. TUF solves these problems by providing a flexible security framework that can be added to software updaters.
Threat Model
The threat model assumes the following:
- Offline keys are safe and securely stored.
- Attackers can compromise at least one of PyPI's trusted keys stored online, and may do so at once or over a period of time.
- Attackers can respond to client requests.
An attacker is considered successful if they can cause a client to install (or leave installed) something other than the most up-to-date version of the software the client is updating. If the attacker is preventing the installation of updates, they want clients to not realize there is anything wrong.
Definitions
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [29].
This PEP focuses on integrating TUF with PyPI; however, the reader is encouraged to read about TUF's design principles [2]. It is also RECOMMENDED that the reader be familiar with the TUF specification [16].
Terms used in this PEP are defined as follows:
- Projects: Projects are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [17].
- Releases: Releases are uniquely identified snapshots of a project [17].
- Distributions: Distributions are the packaged files that are used to publish and distribute a release [17].
- Simple index: The HTML page that contains internal links to the distributions of a project [17].
- Roles: There is one root role in PyPI. There are multiple roles whose responsibilities are delegated to them directly or indirectly by the root role. The term top-level role refers to the root role and any role delegated by the root role. Each role has a single metadata file that it is trusted to provide.
- Metadata: Metadata are signed files that describe roles, other metadata, and target files.
- Repository: A repository is a resource comprised of named metadata and target files. Clients request metadata and target files stored on a repository.
- Consistent snapshot: A set of TUF metadata and PyPI targets that capture the complete state of all projects on PyPI as they existed at some fixed point in time.
- The snapshot (release) role: In order to prevent confusion due to the different meanings of the term "release" used in PEP 426 [17] and the TUF specification [16], the release role is renamed as the snapshot role.
- Developer: Either the owner or maintainer of a project who is allowed to update the TUF metadata as well as distribution metadata and files for the project.
- Online key: A private cryptographic key that MUST be stored on the PyPI server infrastructure. This is usually to allow automated signing with the key. However, an attacker who compromises the PyPI infrastructure will be able to read these keys.
- Offline key: A private cryptographic key that MUST be stored independent of the PyPI server infrastructure. This prevents automated signing with the key. An attacker who compromises the PyPI infrastructure will not be able to immediately read these keys.
- Threshold signature scheme: A role can increase its resilience to key compromises by specifying that at least t out of n keys are REQUIRED to sign its metadata. A compromise of t-1 keys is insufficient to compromise the role itself. Saying that a role requires (t, n) keys denotes the threshold signature property.
Overview of TUF
At its highest level, TUF provides applications with a secure method of obtaining files and knowing when new versions of files are available. On the surface, this all sounds simple. The basic steps for updating applications are:
- Knowing when an update exists.
- Downloading a correct copy of the latest version of an updated file.
The problem is that updating applications is only simple when there are no malicious activities in the picture. If an attacker is trying to interfere with these seemingly simple steps, there is plenty they can do.
Assume a software updater takes the approach of most systems (at least the ones that try to be secure). It downloads both the file it wants and a cryptographic signature of the file. The software updater already knows which key it trusts to make the signature. It checks that the signature is correct and was made by this trusted key. Unfortunately, the software updater is still at risk in many ways, including:
- An attacker keeps giving the software updater the same update file, so it never realizes there is an update.
- An attacker gives the software updater an older, insecure version of a file that it already has, so it downloads that one and blindly uses it thinking it is newer.
- An attacker gives the software updater a newer version of a file it has but it is not the newest one. The file is newer to the software updater, but it may be insecure and exploitable by the attacker.
- An attacker compromises the key used to sign these files and now the software updater downloads a malicious file that is properly signed.
TUF is designed to address these attacks, and others, by adding signed metadata (text files that describe the repository's files) to the repository and referencing the metadata files during the update procedure. Repository files are verified against the information included in the metadata before they are handed off to the software update system. The framework also provides multi-signature trust, explicit and implicit revocation of cryptograhic keys, responsibility separation of the metadata, and minimizes key risk. For a full list and outline of the repository attacks and software updater weaknesses addressed by TUF, see Appendix A.
Integrating TUF with PyPI
A software update system must complete two main tasks to integrate with TUF. First, it must add the framework to the client side of the update system. For example, TUF MAY be integrated with the pip package manager. Second, the repository on the server side MUST be modified to provide signed TUF metadata. This PEP is concerned with the second part of the integration, and the changes required on PyPI to support software updates with TUF.
What Additional Repository Files are Required on PyPI?
In order for package managers like pip to download and verify packages with TUF, a few extra files MUST exist on PyPI. These extra repository files are called TUF metadata. TUF metadata contains information such as which keys are trustable, the cryptographic hashes of files, signatures to the metadata, metadata version numbers, and the date after which the metadata should be considered expired.
When a package manager wants to check for updates, it asks TUF to do the work. That is, a package manager never has to deal with this additional metadata or understand what's going on underneath. If TUF reports back that there are updates available, a package manager can then ask TUF to download these files from PyPI. TUF downloads them and checks them against the TUF metadata that it also downloads from the repository. If the downloaded target files are trustworthy, TUF then hands them over to the package manager.
The Metadata [30] document provides information about each of the required metadata and their expected content. The next section covers the different kinds of metadata RECOMMENDED for PyPI.
PyPI and TUF Metadata
TUF metadata provides information that clients can use to make update decisions. For example, a targets metadata lists the available distributions on PyPI and includes the distribution's signatures, cryptographic hashes, and file sizes. Different metadata files provide different information. The various metadata files are signed by different roles, which are indicated by the root role. The concept of roles allows TUF to delegate responsibilities to multiple roles and minimizes the impact of a compromised role.
TUF requires four top-level roles. These are root, timestamp, snapshot, and targets. The root role specifies the public cryptographic keys of the top-level roles (including its own). The timestamp role references the latest snapshot and can signify when a new snapshot of the repository is available. The snapshot role indicates the latest version of all the TUF metadata files (other than timestamp). The targets role lists the available target files (in our case, it will be all files on PyPI under the /simple and /packages directories). Each top-level role will serve its responsibilities without exception. Figure 1 provides a table of the roles used in TUF.
Figure 1: An overview of the TUF roles.
Signing Metadata and Repository Management
The top-level root role signs for the keys of the top-level timestamp, snapshot, targets, and root roles. The timestamp role signs for every new snapshot of the repository metadata. The snapshot role signs for root, targets, and all delegated roles. The bins roles (delegated roles) sign for all distributions belonging to registered PyPI projects.
Figure 2 provides an overview of the roles available within PyPI, which includes the top-level roles and the roles delegated by targets. The figure also indicates the types of keys used to sign each role and which roles are trusted to sign for files available on PyPI. The next two sections cover the details of signing repository files and the types of keys used for each role.
Figure 2: An overview of the role metadata available on PyPI.
The roles that change most frequently are timestamp, snapshot and delegated roles (bins and its delegated roles). The timestamp and snapshot metadata MUST be updated whenever root, targets or delegated metadata are updated. Observe, though, that root and targets metadata are much less likely to be updated as often as delegated metadata. Therefore, timestamp and snapshot metadata will most likely be updated frequently (possibly every minute) due to delegated metadata being updated frequently in order to support continuous delivery of projects. Continuous delivery is a set of processes that PyPI uses produce snapshots that can safely coexist and be deleted independent of other snapshots [18].
Every year, PyPI administrators SHOULD sign for root and targets role keys. Automation will continuously sign for a timestamped, snapshot of all projects. A repository management [31] tool is available that can sign metadata files, generate cryptographic keys, and manage a TUF repository.
How to Establish Initial Trust in the PyPI Root Keys
Package managers like pip need to ship a file called "root.json" with the installation files that users initially download. This includes information about the keys trusted for certain roles, as well as the root keys themselves. Any new version of "root.json" that clients may download are verified against the root keys that client's initially trust. If a root key is compromised, but a threshold of keys are still secured, the PyPI administrator MUST push a new release that revokes trust in the compromised keys. If a threshold of root keys are compromised, then "root.json" should be updated out-of-band, however the threshold should be chosen so that this is extremely unlikely. The TUF client library does not require manual intervention if root keys are revoked or added: the update process handles the cases where "root.json" has changed.
To bundle the software, "root.json" MUST be included in the version of pip shipped with CPython (via ensurepip). The TUF client library then loads the root metadata and downloads the rest of the roles, including updating "root.json" if it has changed. An outline of the update process [32] is available.
Minimum Security Model
There are two security models to consider when integrating TUF with PyPI. The one proposed in this PEP is the minimum security model, which supports verification of PyPI distributions that are signed with private cryptographic keys stored on PyPI. Distributions uploaded by developers are signed by PyPI and immediately available for download. A possible future extension to this PEP, discussed in Appendix B, proposes the maximum security model and allows a developer to sign for his/her project. Developer keys are not stored online: therefore, projects are safe from PyPI compromises.
The minimum security model requires no action from a developer and protects against malicious CDNs [19] and public mirrors. To support continuous delivery of uploaded packages, PyPI signs for projects with an online key. This level of security prevents projects from being accidentally or deliberately tampered with by a mirror or a CDN because the mirror or CDN will not have any of the keys required to sign for projects. However, it does not protect projects from attackers who have compromised PyPI, since attackers can manipulate TUF metadata using the keys stored online.
This PEP proposes that the bins role (and its delegated roles) sign for all PyPI projects with an online key. The targets role, which only signs with an offline key, MUST delegate all PyPI projects to the bins role. This means that when a package manager such as pip (i.e., using TUF) downloads a distribution from a project on PyPI, it will consult the bins role about the TUF metadata for the project. If no bin roles delegated by bins specify the project's distribution, then the project is considered to be non-existent on PyPI.
Metadata Expiry Times
The root and targets role metadata SHOULD expire in one year, because these two metadata files are expected to change very rarely.
The timestamp, snapshot, and bins metadata SHOULD expire in one day because a CDN or mirror SHOULD synchronize itself with PyPI every day. Furthermore, this generous time frame also takes into account client clocks that are highly skewed or adrift.
Metadata Scalability
Due to the growing number of projects and distributions, TUF metadata will also grow correspondingly. For example, consider the bins role. In August 2013, it was found that the size of the bins metadata was about 42MB if the bins role itself signed for about 220K PyPI targets (which are simple indices and distributions). This PEP does not delve into the details, but TUF features a so-called "lazy bin walk [33]" scheme that splits a large targets' metadata file into many small ones. This allows a TUF client updater to intelligently download only a small number of TUF metadata files in order to update any project signed for by the bins role. For example, applying this scheme to the previous repository resulted in pip downloading between 1.3KB and 111KB to install or upgrade a PyPI project via TUF.
Based on our findings as of the time of writing, PyPI SHOULD split all targets in the bins role by delegating them to 1024 delegated roles, each of which would sign for PyPI targets whose hashes fall into that "bin" or delegated role (see Figure 2). It was found that 1024 bins would result in the bins metadata, and each of its delegated roles, being about the same size (40-50KB) for about 220K PyPI targets (simple indices and distributions).
It is possible to make TUF metadata more compact by representing it in a binary format as opposed to the JSON text format. Nevertheless, a sufficiently large number of projects and distributions will introduce scalability challenges at some point, and therefore the bins role will still need delegations (as outlined in figure 2) in order to address the problem. Furthermore, the JSON format is an open and well-known standard for data interchange. Due to the large number of delegated metadata, compressed versions of snapshot metadata SHOULD also be made available to clients.
PyPI and Key Requirements
In this section, the kinds of keys required to sign for TUF roles on PyPI are examined. TUF is agnostic with respect to choices of digital signature algorithms. For the purpose of discussion, it is assumed that most digital signatures will be produced with the well-tested and tried RSA algorithm [20]. Nevertheless, we do NOT recommend any particular digital signature algorithm in this PEP because there are a few important constraints: first, cryptography changes over time; second, package managers such as pip may wish to perform signature verification in Python, without resorting to a compiled C library, in order to be able to run on as many systems as Python supports; and third, TUF recommends diversity of keys for certain applications.
Number Of Keys Recommended
The timestamp, snapshot, and bins roles require continuous delivery. Even though their respective keys MUST be online, this PEP requires that the keys be independent of each other. Different keys for online roles allow for each of the keys to be placed on separate servers if need be, and prevents side channel attacks that compromise one key from automatically compromising the rest of the keys. Therefore, each of the timestamp, snapshot, and bins roles MUST require (1, 1) keys.
The bins role MAY delegate targets in an automated manner to a number of roles called "bins", as discussed in the previous section. Each of the "bin" roles SHOULD share the same key as the bins role, due to space efficiency, and because there is no security advantage to requiring separate keys.
The root role key is critical for security and should very rarely be used. It is primarily used for key revocation, and it is the locus of trust for all of PyPI. The root role signs for the keys that are authorized for each of the top-level roles (including its own). Keys belonging to the root role are intended to be very well-protected and used with the least frequency of all keys. It is RECOMMENDED that every PSF board member own a (strong) root key. A majority of them can then constitute a quorum to revoke or endow trust in all top-level keys. Alternatively, the system administrators of PyPI could be given responsibility for signing for the root role. Therefore, the root role SHOULD require (t, n) keys, where n is the number of either all PyPI administrators or all PSF board members, and t > 1 (so that at least two members must sign the root role).
The targets role will be used only to sign for the static delegation of all targets to the bins role. Since these target delegations must be secured against attacks in the event of a compromise, the keys for the targets role MUST be offline and independent of other keys. For simplicity of key management, without sacrificing security, it is RECOMMENDED that the keys of the targets role be permanently discarded as soon as they have been created and used to sign for the role. Therefore, the targets role SHOULD require (1, 1) keys. Again, this is because the keys are going to be permanently discarded and more offline keys will not help resist key recovery attacks [21] unless diversity of keys is maintained.
Online and Offline Keys Recommended for Each Role
In order to support continuous delivery, the timestamp, snapshot, bins role keys MUST be online.
As explained in the previous section, the root and targets role keys MUST be offline for maximum security: these keys will be offline in the sense that their private keys MUST NOT be stored on PyPI, though some of them MAY be online in the private infrastructure of the project.
How Should Metadata be Generated?
Project developers expect the distributions they upload to PyPI to be immediately available for download. Unfortunately, there will be problems when many readers and writers simultaneously access the same metadata and distributions. That is, there needs to be a way to ensure consistency of metadata and repository files when multiple developers simulaneously change the same metadata or distributions. There are also issues with consistency on PyPI without TUF, but the problem is more severe with signed metadata that MUST keep track of the files available on PyPI in real-time.
Suppose that PyPI generates a snapshot, which indicates the latest version of every metadata except timestamp, at version 1 and a client requests this snapshot from PyPI. While the client is busy downloading this snapshot, PyPI then timestamps a new snapshot at, say, version 2. Without ensuring consistency of metadata, the client would find itself with a copy of snapshot that disagrees with what is available on PyPI, which is indistinguishable from arbitrary metadata injected by an attacker. The problem would also occur for mirrors attempting to sync with PyPI.
Consistent Snapshots
There are problems with consistency on PyPI with or without TUF. TUF requires that its metadata be consistent with the repository files, but how would the metadata be kept consistent with projects that change all the time? As a result, this proposal MUST address the problem of producing a consistent snapshot that captures the state of all known projects at a given time. Each snapshot should safely coexist with any other snapshot, and be able to be deleted independently, without affecting any other snapshot.
The solution presented in this PEP is that every metadata or data file managed by PyPI and written to disk MUST include in its filename the cryptographic hash [34] of the file. How would this help clients that use the TUF protocol to securely and consistently install or update a project from PyPI?
The first step in the TUF protocol requires the client to download the latest timestamp metadata. However, the client would not know in advance the hash of the timestamp associated with the latest snapshot. Therefore, PyPI MUST redirect all HTTP GET requests for timestamp to the timestamp referenced in the latest snapshot. The timestamp role is the root of a tree of cryptographic hashes that points to every other metadata that is meant to exist together (i.e., clients request metadata in timestamp -> snapshot -> root -> targets order). Clients are able to retrieve any file from this snapshot by deterministically including, in the request for the file, the hash of the file in the filename. Assuming infinite disk space and no hash collisions [35], a client may safely read from one snapshot while PyPI produces another snapshot.
In this simple but effective manner, PyPI is able to capture a consistent snapshot of all projects and the associated metadata at a given time. The next subsection provides implementation details of this idea.
Note: This PEP does not prohibit using advanced file systems or tools to produce consistent snapshots. There are two important reasons for why this PEP proposes the simple solution. First, the solution does not mandate that PyPI use any particular file system or tool. Second, the generic file-system based approach allows mirrors to use extant file transfer tools such as rsync to efficiently transfer consistent snapshots from PyPI.
Producing Consistent Snapshots
Given a project, PyPI is responsible for updating the bins metadata (roles delegated by the bins role and signed with an online key). Every project MUST upload its release in a single transaction. The uploaded set of files is called the "project transaction". How PyPI MAY validate the files in a project transaction is discussed in a later section. For now, the focus is on how PyPI will respond to a project transaction.
Every metadata and target file MUST include in its filename the hex digest [36] of its SHA-256 [37] hash. For this PEP, it is RECOMMENDED that PyPI adopt a simple convention of the form: digest.filename, where filename is the original filename without a copy of the hash, and digest is the hex digest of the hash.
When a project uploads a new transaction, the project transaction process MUST add all new targets and relevant delegated bins metadata. (It is shown later in this section why the bins role will delegate targets to a number of delegated bins roles.) Finally, the project transaction process MUST inform the snapshot process about new delegated bins metadata.
Project transaction processes SHOULD be automated and MUST also be applied atomically: either all metadata and targets -- or none of them -- are added. The project transaction and snapshot processes SHOULD work concurrently. Finally, project transaction processes SHOULD keep in memory the latest bins metadata so that they will be correctly updated in new consistent snapshots.
All project transactions MAY be placed in a single queue and processed serially. Alternatively, the queue MAY be processed concurrently in order of appearance, provided that the following rules are observed:
- No pair of project transaction processes must concurrently work on the same project.
- No pair of project transaction processes must concurrently work on bins projects that belong to the same delegated bins targets role.
These rules MUST be observed so that metadata is not read from or written to inconsistently.
Snapshot Process
The snapshot process is fairly simple and SHOULD be automated. The snapshot process MUST keep in memory the latest working set of root, targets, and delegated roles. Every minute or so, the snapshot process will sign for this latest working set. (Recall that project transaction processes continuously inform the snapshot process about the latest delegated metadata in a concurrency-safe manner. The snapshot process will actually sign for a copy of the latest working set while the latest working set in memory will be updated with information that is continuously communicated by the project transaction processes.) The snapshot process MUST generate and sign new timestamp metadata that will vouch for the metadata (root, targets, and delegated roles) generated in the previous step. Finally, the snapshot process MUST make available to clients the new timestamp and snapshot metadata representing the latest snapshot.
A few implementation notes are now in order. So far, we have seen only that new metadata and targets are added, but not that old metadata and targets are removed. Practical constraints are such that eventually PyPI will run out of disk space to produce a new consistent snapshot. In that case, PyPI MAY then use something like a "mark-and-sweep" algorithm to delete sufficiently old consistent snapshots: in order to preserve the latest consistent snapshot, PyPI would walk objects beginning from the root (timestamp) of the latest consistent snapshot, mark all visited objects, and delete all unmarked objects. The last few consistent snapshots may be preserved in a similar fashion. Deleting a consistent snapshot will cause clients to see nothing except HTTP 404 responses to any request for a file within that consistent snapshot. Clients SHOULD then retry (as before) their requests with the latest consistent snapshot.
All clients, such as pip using the TUF protocol, MUST be modified to download every metadata and target file (except for timestamp metadata) by including, in the request for the file, the cryptographic hash of the file in the filename. Following the filename convention recommended earlier, a request for the file at filename.ext will be transformed to the equivalent request for the file at digest.filename.
Finally, PyPI SHOULD use a transaction log [38] to record project transaction processes and queues so that it will be easier to recover from errors after a server failure.
Key Compromise Analysis
This PEP has covered the minimum security model, the TUF roles that should be added to support continuous delivery of distributions, and how to generate and sign the metadata of each role. The remaining sections discuss how PyPI SHOULD audit repository metadata, and the methods PyPI can use to detect and recover from a PyPI compromise.
Table 1 summarizes a few of the attacks possible when a threshold number of private cryptographic keys (belonging to any of the PyPI roles) are compromised. The leftmost column lists the roles (or a combination of roles) that have been compromised, and the columns to its right show whether the compromised roles leaves clients susceptible to malicious updates, a freeze attack, or metadata inconsistency attacks.
| Role Compromise | Malicious Updates | Freeze Attack | Metadata Inconsistency Attacks |
|---|---|---|---|
| timestamp | NO snapshot and targets or any of the bins need to cooperate | YES limited by earliest root, targets, or bin expiry time | NO snapshot needs to cooperate |
| snapshot | NO timestamp and targets or any of the bins need to cooperate | NO timestamp needs to cooperate | NO timestamp needs to cooperate |
| timestamp AND snapshot | NO targets or any of the bins need to cooperate | YES limited by earliest root, targets, or bin metadata expiry time | YES limited by earliest root, targets, or bin metadata expiry time |
| targets OR bin | NO timestamp and snapshot need to cooperate | NOT APPLICABLE need timestamp and snapshot | NOT APPLICABLE need timestamp and snapshot |
| timestamp AND snapshot AND bin | YES | YES limited by earliest root, targets, or bin metadata expiry time | YES limited by earliest root, targets, or bin metadata expiry time |
| root | YES | YES | YES |
Table 1: Attacks possible by compromising certain combinations of role keys. In September 2013 [28], it was shown how the latest version (at the time) of pip was susceptible to these attacks and how TUF could protect users against them [14].
Note that compromising targets or any delegated role (except for project targets metadata) does not immediately allow an attacker to serve malicious updates. The attacker must also compromise the timestamp and snapshot roles (which are both online and therefore more likely to be compromised). This means that in order to launch any attack, one must not only be able to act as a man-in-the-middle but also compromise the timestamp key (or compromise the root keys and sign a new timestamp key). To launch any attack other than a freeze attack, one must also compromise the snapshot key.
Finally, a compromise of the PyPI infrastructure MAY introduce malicious updates to bins projects because the keys for these roles are online. The maximum security model discussed in the appendix addresses this issue. PEP 480 also covers the maximum security model and goes into more detail on generating developer keys and signing uploaded distributions.
In the Event of a Key Compromise
A key compromise means that a threshold of keys (belonging to the metadata roles on PyPI), as well as the PyPI infrastructure, have been compromised and used to sign new metadata on PyPI.
If a threshold number of timestamp, snapshot, or bins keys have been compromised, then PyPI MUST take the following steps:
- Revoke the timestamp, snapshot and targets role keys from the root role. This is done by replacing the compromised timestamp, snapshot and targets keys with newly issued keys.
- Revoke the bins keys from the targets role by replacing their keys with newly issued keys. Sign the new targets role metadata and discard the new keys (because, as explained earlier, this increases the security of targets metadata).
- All targets of the bins roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, or bins keys were known to have been compromised. Added, updated or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot MAY be restored to their previous versions. After ensuring the integrity of all bins targets, the bins metadata MUST be regenerated.
- The bins metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.
- A new timestamped consistent snapshot MUST be issued.
Following these steps would preemptively protect all of these roles even though only one of them may have been compromised.
If a threshold number of root keys have been compromised, then PyPI MUST take the steps taken when the targets role has been compromised. All of the root keys must also be replaced.
It is also RECOMMENDED that PyPI sufficiently document compromises with security bulletins. These security bulletins will be most informative when users of pip-with-TUF are unable to install or update a project because the keys for the timestamp, snapshot or root roles are no longer valid. They could then visit the PyPI web site to consult security bulletins that would help to explain why they are no longer able to install or update, and then take action accordingly. When a threshold number of root keys have not been revoked due to a compromise, then new root metadata may be safely updated because a threshold number of existing root keys will be used to sign for the integrity of the new root metadata. TUF clients will be able to verify the integrity of the new root metadata with a threshold number of previously known root keys. This will be the common case. Otherwise, in the worst case, where a threshold number of root keys have been revoked due to a compromise, an end-user may choose to update new root metadata with out-of-band [39] mechanisms.
Auditing Snapshots
If a malicious party compromises PyPI, they can sign arbitrary files with any of the online keys. The roles with offline keys (i.e., root and targets) are still protected. To safely recover from a repository compromise, snapshots should be audited to ensure files are only restored to trusted versions.
When a repository compromise has been detected, the integrity of three types of information must be validated:
- If the online keys of the repository have been compromised, they can be revoked by having the targets role sign new metadata delegating to a new key.
- If the role metadata on the repository has been changed, this would impact the metadata that is signed by online keys. Any role information created since the last period should be discarded. As a result, developers of new projects will need to re-register their projects.
- If the packages themselves may have been tampered with, they can be validated using the stored hash information for packages that existed at the time of the last period.
In order to safely restore snapshots in the event of a compromise, PyPI SHOULD maintain a small number of its own mirrors to copy PyPI snapshots according to some schedule. The mirroring protocol can be used immediately for this purpose. The mirrors must be secured and isolated such that they are responsible only for mirroring PyPI. The mirrors can be checked against one another to detect accidental or malicious failures.
Another approach is to generate the cryptographic hash of snapshot periodically and tweet it. Perhaps a user comes forward with the actual metadata and the repository maintainers can verify the metadata's cryptographic hash. Alternatively, PyPI may periodically archive its own versions of snapshot rather than rely on externally provided metadata. In this case, PyPI SHOULD take the cryptographic hash of every package on the repository and store this data on an offline device. If any package hash has changed, this indicates an attack.
As for attacks that serve different versions of metadata, or freeze a version of a package at a specific version, they can be handled by TUF with techniques like implicit key revocation and metadata mismatch detection [81].
Appendix A: Repository Attacks Prevented by TUF
- Arbitrary software installation: An attacker installs anything they want on the client system. That is, an attacker can provide arbitrary files in respond to download requests and the files will not be detected as illegitimate.
- Rollback attacks: An attacker presents a software update system with older files than those the client has already seen, causing the client to use files older than those the client knows about.
- Indefinite freeze attacks: An attacker continues to present a software update system with the same files the client has already seen. The result is that the client does not know that new files are available.
- Endless data attacks: An attacker responds to a file download request with an endless stream of data, causing harm to clients (e.g., a disk partition filling up or memory exhaustion).
- Slow retrieval attacks: An attacker responds to clients with a very slow stream of data that essentially results in the client never continuing the update process.
- Extraneous dependencies attacks: An attacker indicates to clients that in order to install the software they wanted, they also need to install unrelated software. This unrelated software can be from a trusted source but may have known vulnerabilities that are exploitable by the attacker.
- Mix-and-match attacks: An attacker presents clients with a view of a repository that includes files that never existed together on the repository at the same time. This can result in, for example, outdated versions of dependencies being installed.
- Wrong software installation: An attacker provides a client with a trusted file that is not the one the client wanted.
- Malicious mirrors preventing updates: An attacker in control of one repository mirror is able to prevent users from obtaining updates from other, good mirrors.
- Vulnerability to key compromises: An attacker who is able to compromise a single key or less than a given threshold of keys can compromise clients. This includes relying on a single online key (such as only being protected by SSL) or a single offline key (such as most software update systems use to sign files).
Appendix B: Extension to the Minimum Security Model
The maximum security model and end-to-end signing have been intentionally excluded from this PEP. Although both improve PyPI's ability to survive a repository compromise and allow developers to sign their distributions, they have been postponed for review as a potential future extension to PEP 458. PEP 480 [26], which discusses the extension in detail, is available for review to those developers interested in the end-to-end signing option. The maximum security model and end-to-end signing are briefly covered in subsections that follow.
There are several reasons for not initially supporting the features discussed in this section:
A build farm (distribution wheels on supported platforms are generated for each project on PyPI infrastructure) may possibly complicate matters. PyPI wants to support a build farm in the future. Unfortunately, if wheels are auto-generated externally, developer signatures for these wheels are unlikely. However, there might still be a benefit to generating wheels from source distributions that are signed by developers (provided that reproducible wheels are possible). Another possibility is to optionally delegate trust of these wheels to an online role.
An easy-to-use key management solution is needed for developers. miniLock [40] is one likely candidate for management and generation of keys. Although developer signatures can remain optional, this approach may be inadequate due to the great number of potentially unsigned dependencies each distribution may have. If any one of these dependencies is unsigned, it negates any benefit the project gains from signing its own distribution (i.e., attackers would only need to compromise one of the unsigned dependencies to attack end-users). Requiring developers to manually sign distributions and manage keys is expected to render key signing an unused feature.
A two-phase approach, where the minimum security model is implemented first followed by the maximum security model, can simplify matters and give PyPI administrators time to review the feasibility of end-to-end signing.
Maximum Security Model
The maximum security model relies on developers signing their projects and uploading signed metadata to PyPI. If the PyPI infrastructure were to be compromised, attackers would be unable to serve malicious versions of claimed projects without access to the project's developer key. Figure 3 depicts the changes made to figure 2, namely that developer roles are now supported and that three new delegated roles exist: claimed, recently-claimed, and unclaimed. The bins role has been renamed unclaimed and can contain any projects that have not been added to claimed. The strength of this model (over the minimum security model) is in the offline keys provided by developers. Although the minimum security model supports continuous delivery, all of the projects are signed by an online key. An attacker can corrupt packages in the minimum security model, but not in the maximum model without also compromising a developer's key.
Figure 3: An overview of the metadata layout in the maximum security model. The maximum security model supports continuous delivery and survivable key compromise.
End-to-End Signing
End-to-End signing allows both PyPI and developers to sign for the metadata downloaded by clients. PyPI is trusted to make uploaded projects available to clients (they sign the metadata for this part of the process), and developers can sign the distributions that they upload.
PEP 480 [26] discusses the tools available to developers who sign the distributions that they upload to PyPI. To summarize PEP 480, developers generate cryptographic keys and sign metadata in some automated fashion, where the metadata includes the information required to verify the authenticity of the distribution. The metadata is then uploaded to PyPI by the client, where it will be available for download by package managers such as pip (i.e., package managers that support TUF metadata). The entire process is transparent to clients (using a package manager that supports TUF) who download distributions from PyPI.
Appendix C: PEP 470 and Projects Hosted Externally
How should TUF handle distributions that are not hosted on PyPI? According to PEP 470 [41], projects may opt to host their distributions externally and are only required to provide PyPI a link to its external index, which package managers like pip can use to find the project's distributions. PEP 470 does not mention whether externally hosted projects are considered unverified by default, as projects that use this option are not required to submit any information about their distributions (e.g., file size and cryptographic hash) when the project is registered, nor include a cryptographic hash of the file in download links.
Potentional approaches that PyPI administrators MAY consider to handle projects hosted externally:
- Download external distributions but do not verify them. The targets metadata will not include information for externally hosted projects.
- PyPI will periodically download information from the external index. PyPI will gather the external distribution's file size and hashes and generate appropriate TUF metadata.
- External projects MUST submit to PyPI the file size and cryptographic hash for a distribution.
- External projects MUST upload to PyPI a developer public key for the index. The distribution MUST create TUF metadata that is stored at the index, and signed with the developer's corresponding private key. The client will fetch the external TUF metadata as part of the package update process.
- External projects MUST upload to PyPI signed TUF metadata (as allowed by the maximum security model) about the distributions that they host externally, and a developer public key. Package managers verify distributions by consulting the signed metadata uploaded to PyPI.
Only one of the options listed above should be implemented on PyPI. Option (4) or (5) is RECOMMENDED because external distributions are signed by developers. External distributions that are forged (due to a compromised PyPI account or external host) may be detected if external developers are required to sign metadata, although this requirement is likely only practical if an easy-to-use key management solution and developer scripts are provided by PyPI.
References
| [1] | https://pypi.python.org |
| [2] | (1, 2, 3) https://isis.poly.edu/~jcappos/papers/samuel_tuf_ccs_2010.pdf |
| [3] | http://www.pip-installer.org |
| [4] | https://wiki.python.org/moin/WikiAttack2013 |
| [5] | https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories |
| [6] | https://mail.python.org/pipermail/distutils-sig/2013-April/020596.html |
| [7] | https://mail.python.org/pipermail/distutils-sig/2013-May/020701.html |
| [8] | https://mail.python.org/pipermail/distutils-sig/2013-July/022008.html |
| [9] | PEP 381, Mirroring infrastructure for PyPI, ZiadĂŠ, LĂświs http://www.python.org/dev/peps/pep-0381/ |
| [10] | https://mail.python.org/pipermail/distutils-sig/2013-September/022773.html |
| [11] | https://mail.python.org/pipermail/distutils-sig/2013-May/020848.html |
| [12] | PEP 449, Removal of the PyPI Mirror Auto Discovery and Naming Scheme, Stufft http://www.python.org/dev/peps/pep-0449/ |
| [13] | https://isis.poly.edu/~jcappos/papers/cappos_mirror_ccs_08.pdf |
| [14] | (1, 2) https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html |
| [15] | https://pypi.python.org/security |
| [16] | (1, 2) https://github.com/theupdateframework/tuf/blob/develop/docs/tuf-spec.txt |
| [17] | (1, 2, 3, 4, 5) PEP 426, Metadata for Python Software Packages 2.0, Coghlan, Holth, Stufft http://www.python.org/dev/peps/pep-0426/ |
| [18] | https://en.wikipedia.org/wiki/Continuous_delivery |
| [19] | https://mail.python.org/pipermail/distutils-sig/2013-August/022154.html |
| [20] | https://en.wikipedia.org/wiki/RSA_%28algorithm%29 |
| [21] | https://en.wikipedia.org/wiki/Key-recovery_attack |
| [22] | http://csrc.nist.gov/publications/nistpubs/800-57/SP800-57-Part1.pdf |
| [23] | https://www.openssl.org/ |
| [24] | https://pypi.python.org/pypi/pycrypto |
| [25] | http://ed25519.cr.yp.to/ |
| [26] | (1, 2, 3) https://www.python.org/dev/peps/pep-0480/ |
| [27] | https://github.com/theupdateframework/tuf/tree/develop/tuf/client#updaterpy |
| [28] | (1, 2) https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html |
| [29] | http://www.ietf.org/rfc/rfc2119.txt |
| [30] | https://github.com/theupdateframework/tuf/blob/develop/METADATA.md |
| [31] | https://github.com/theupdateframework/tuf/tree/develop/tuf#repository-management |
| [32] | https://github.com/theupdateframework/tuf/tree/develop/tuf/client#overview-of-the-update-process. |
| [33] | https://github.com/theupdateframework/tuf/issues/39 |
| [34] | https://en.wikipedia.org/wiki/Cryptographic_hash_function |
| [35] | https://en.wikipedia.org/wiki/Collision_(computer_science) |
| [36] | http://docs.python.org/2/library/hashlib.html#hashlib.hash.hexdigest |
| [37] | https://en.wikipedia.org/wiki/SHA-2 |
| [38] | https://en.wikipedia.org/wiki/Transaction_log |
| [39] | https://en.wikipedia.org/wiki/Out-of-band#Authentication |
| [40] | https://minilock.io/ |
| [41] | http://www.python.org/dev/peps/pep-0470/ |
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grants No. CNS-1345049 and CNS-0959138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
We thank Nick Coghlan, Daniel Holth and the distutils-sig community in general for helping us to think about how to usably and efficiently integrate TUF with PyPI.
Roger Dingledine, Sebastian Hahn, Nick Mathewson, Martin Peck and Justin Samuel helped us to design TUF from its predecessor Thandy of the Tor project.
We appreciate the efforts of Konstantin Andrianov, Geremy Condra, Zane Fisher, Justin Samuel, Tian Tian, Santiago Torres, John Ward, and Yuyu Zheng to to develop TUF.
Vladimir Diaz, Monzur Muhammad and Sai Teja Peddinti helped us to review this PEP.
Zane Fisher helped us to review and transcribe this PEP.
Copyright
This document has been placed in the public domain.
pep-0459 Standard Metadata Extensions for Python Software Packages
| PEP: | 459 |
|---|---|
| Title: | Standard Metadata Extensions for Python Software Packages |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| BDFL-Delegate: | Nick Coghlan <ncoghlan@gmail.com> |
| Discussions-To: | Distutils SIG <distutils-sig at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 426 |
| Created: | 11 Nov 2013 |
| Post-History: | 21 Dec 2013 |
Contents
Abstract
This PEP describes several standard extensions to the Python metadata.
Like all metadata extensions, each standard extension format is independently versioned. Changing any of the formats requires an update to this PEP, but does not require an update to the core packaging metadata.
Note
These extensions may eventually be separated out into their own PEPs, but we're already suffering from PEP overload in the packaging metadata space.
This PEP was initially created by slicing out large sections of earlier drafts of PEP 426 and making them extensions, so some of the specifics may still be rough in the new context.
Standard Extension Namespace
The python project on the Python Package Index refers to the CPython reference interpreter. This namespace is used as the namespace for the standard metadata extensions.
The currently defined standard extensions are:
- python.details
- python.project
- python.integrator
- python.exports
- python.commands
- python.constraints
All standard extensions are currently at version 1.0, and thus the extension_metadata field may be omitted without losing access to any functionality.
The python.details extension
The python.details extension allows for more information to be provided regarding the software distribution.
The python.details extension contains four custom subfields:
- license: the copyright license for the distribution
- keywords: package index keywords for the distribution
- classifiers: package index Trove classifiers for the distribution
- document_names: the names of additional metadata files
All of these fields are optional. Automated tools MUST operate correctly if a distribution does not provide them, including failing cleanly when an operation depending on one of these fields is requested.
License
A short string summarising the license used for this distribution.
Note that distributions that provide this field should still specify any applicable license Trove classifiers in the Classifiers field. Even when an appropriate Trove classifier is available, the license summary can be a good way to specify a particular version of that license, or to indicate any variations or exception to the license.
This field SHOULD contain fewer than 512 characters and MUST contain fewer than 2048.
This field SHOULD NOT contain any line breaks.
The full license text SHOULD be included as a separate file in the source archive for the distribution. See Document names for details.
Example:
"license": "GPL version 3, excluding DRM provisions"
Keywords
A list of additional keywords to be used to assist searching for the distribution in a larger catalog.
Example:
"keywords": ["comfy", "chair", "cushions", "too silly", "monty python"]
Classifiers
A list of strings, with each giving a single classification value for the distribution. Classifiers are described in PEP 301 [2].
Example:
"classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console (Text Based)", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)" ]
Document names
Filenames for supporting documents included in the distribution's dist-info metadata directory.
The following supporting documents can be named:
- description: a file containing a long description of the distribution
- license: a file with the full text of the distribution's license
- changelog: a file describing changes made to the distribution
Supporting documents MUST be included directly in the dist-info directory. Directory separators are NOT permitted in document names.
The markup format (if any) for the file is indicated by the file extension. This allows index servers and other automated tools to render included text documents correctly and provide feedback on rendering errors, rather than having to guess the intended format.
If the filename has no extension, or the extension is not recognised, the default rendering format MUST be plain text.
The following markup renderers SHOULD be used for the specified file extensions:
- Plain text: .txt, no extension, unknown extension
- reStructured Text: .rst
- Markdown: .md
- AsciiDoc: .adoc, .asc, .asciidoc
- HTML: .html, .htm
Automated tools MAY render one or more of the specified formats as plain text and MAY render other markup formats beyond those listed.
Automated tools SHOULD NOT make any assumptions regarding the maximum length of supporting document content, except as necessary to protect the integrity of a service.
Example:
"document_names": {
"description": "README.rst",
"license": "LICENSE.rst",
"changelog": "NEWS"
}
The python.project extension
The python.project extension allows for more information to be provided regarding the creation and maintenance of the distribution.
The python.project extension contains three custom subfields:
- contacts: key contact points for the distribution
- contributors: other contributors to the distribution
- project_urls: relevant URLs for the distribution
Contact information
Details on individuals and organisations are recorded as mappings with the following subfields:
- name: the name of an individual or group
- email: an email address (this may be a mailing list)
- url: a URL (such as a profile page on a source code hosting service)
- role: one of "author", "maintainer" or "contributor"
The name subfield is required, the other subfields are optional.
If no specific role is stated, the default is contributor.
Email addresses must be in the form local-part@domain where the local-part may be up to 64 characters long and the entire email address contains no more than 254 characters. The formal specification of the format is in RFC 5322 (sections 3.2.3 and 3.4.1) and RFC 5321, with a more readable form given in the informational RFC 3696 and the associated errata.
The defined contributor roles are as follows:
- author: the original creator of a distribution
- maintainer: the current lead contributor for a distribution, when they are not the original creator
- contributor: any other individuals or organizations involved in the creation of the distribution
Contact and contributor metadata is optional. Automated tools MUST operate correctly if a distribution does not provide it, including failing cleanly when an operation depending on one of these fields is requested.
Contacts
A list of contributor entries giving the recommended contact points for getting more information about the project.
The example below would be suitable for a project that was in the process of handing over from the original author to a new lead maintainer, while operating as part of a larger development group.
Example:
"contacts": [
{
"name": "Python Packaging Authority/Distutils-SIG",
"email": "distutils-sig@python.org",
"url": "https://bitbucket.org/pypa/"
},
{
"name": "Samantha C.",
"role": "maintainer",
"email": "dontblameme@example.org"
},
{
"name": "Charlotte C.",
"role": "author",
"email": "iambecomingasketchcomedian@example.com"
}
]
Contributors
A list of contributor entries for other contributors not already listed as current project points of contact. The subfields within the list elements are the same as those for the main contact field.
Example:
"contributors": [
{"name": "John C."},
{"name": "Erik I."},
{"name": "Terry G."},
{"name": "Mike P."},
{"name": "Graeme C."},
{"name": "Terry J."}
]
Project URLs
A mapping of arbitrary text labels to additional URLs relevant to the project.
While projects are free to choose their own labels and specific URLs, it is RECOMMENDED that home page, source control, issue tracker and documentation links be provided using the labels in the example below.
URL labels MUST be treated as case insensitive by automated tools, but they are not required to be valid Python identifiers. Any legal JSON string is permitted as a URL label.
Example:
"project_urls": {
"Documentation": "https://distlib.readthedocs.org",
"Home": "https://bitbucket.org/pypa/distlib",
"Repository": "https://bitbucket.org/pypa/distlib/src",
"Tracker": "https://bitbucket.org/pypa/distlib/issues"
}
The python.integrator extension
Structurally, this extension is largely identical to the python.project extension (the extension name is the only difference).
However, where the project metadata refers to the upstream creators of the software, the integrator metadata refers to the downstream redistributor of a modified version.
If the software is being redistributed unmodified, then typically this extension will not be used. However, if the software has been patched (for example, backporting compatible fixes from a later version, or addressing a platform compatibility issue), then this extension SHOULD be used, and a local version label added to the distribution's version identifier.
If there are multiple redistributors in the chain, each one just overwrites this extension with their particular metadata.
The python.exports extension
Most Python distributions expose packages and modules for import through the Python module namespace. Distributions may also expose other interfaces when installed.
The python.exports extension contains three custom subfields:
- modules: modules exported by the distribution
- namespaces: namespace packages that the distribution contributes to
- exports: other Python interfaces exported by the distribution
Export specifiers
An export specifier is a string consisting of a fully qualified name, as well as an optional extra name enclosed in square brackets. This gives the following four possible forms for an export specifier:
module module:name module[requires_extra] module:name[requires_extra]
Note
The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.
The meaning of the subfields is as follows:
- module: the module providing the export
- name: if applicable, the qualified name of the export within the module
- requires_extra: indicates the export will only work correctly if the additional dependencies named in the given extra are available in the installed environment
Note
I tried this as a mapping with subfields, and it made the examples below unreadable. While this PEP is mostly for tool use, readability still matters to some degree for debugging purposes, and because I expect snippets of the format to be reused elsewhere.
Modules
A list of qualified names of modules and packages that the distribution provides for import.
Note
The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.
For names that contain dots, the portion of the name before the final dot MUST appear either in the installed module list or in the namespace package list.
To help avoid name conflicts, it is RECOMMENDED that distributions provide a single top level module or package that matches the distribution name (or a lower case equivalent). This requires that the distribution name also meet the requirements of a Python identifier, which are stricter than those for distribution names). This practice will also make it easier to find authoritative sources for modules.
Index servers SHOULD allow multiple distributions to publish the same modules, but MAY notify distribution authors of potential conflicts.
Installation tools SHOULD report an error when asked to install a distribution that provides a module that is also provided by a different, previously installed, distribution.
Note that attempting to import some declared modules may result in an exception if the appropriate extras are not installed.
Example:
"modules": ["chair", "chair.cushions", "python_sketches.nobody_expects"]
Note
Making this a list of export specifiers instead would allow a distribution to declare when a particular module requires a particular extra in order to run correctly. On the other hand, there's an argument to be made that that is the point where it starts to become worthwhile to split out a separate distribution rather than using extras.
Namespaces
A list of qualified names of namespace packages that the distribution contributes modules to.
Note
The jsonschema file currently restricts qualified names using the Python 2 ASCII identifier rules. This may need to be reconsidered given the more relaxed identifier rules in Python 3.
On versions of Python prior to Python 3.3 (which provides native namespace package support), installation tools SHOULD emit a suitable __init__.py file to properly initialise the namespace rather than using a distribution provided file.
Installation tools SHOULD emit a warning and MAY emit an error if a distribution declares a namespace package that conflicts with the name of an already installed module or vice-versa.
Example:
"namespaces": ["python_sketches"]
Exports
The exports field is a mapping containing prefixed names as keys. Each key identifies an export group containing one or more exports published by the distribution.
Export group names are defined by distributions that will then make use of the published export information in some way. The primary use case is for distributions that support a plugin model: defining an export group allows other distributions to indicate which plugins they provide, how they can be imported and accessed, and which additional dependencies (if any) are needed for the plugin to work correctly.
To reduce the chance of name conflicts, export group names SHOULD use a prefix that corresponds to a module name in the distribution that defines the meaning of the export group. This practice will also make it easier to find authoritative documentation for export groups.
Each individual export group is then a mapping of arbitrary non-empty string keys to export specifiers. The meaning of export names within an export group is up to the distribution that defines the export group. Creating an appropriate definition for the export name format can allow the importing distribution to determine whether or not an export is relevant without needing to import every exporting module.
Example:
"exports": {
"nose.plugins.0.10": {
"chairtest": "chair:NosePlugin"
}
}
The python.commands extension
The python.commands extension contains three custom subfields:
- wrap_console: console wrapper scripts to be generated by the installer
- wrap_gui: GUI wrapper scripts to be generated by the installer
- prebuilt: scripts created by the distribution's build process and installed directly to the configured scripts directory
wrap_console and wrap_gui are both mappings of script names to export specifiers. The script names must follow the same naming rules as distribution names.
The export specifiers for wrapper scripts must refer to either a package with a __main__ submodule (if no name subfield is given in the export specifier) or else to a callable inside the named module.
Installation tools should generate appropriate wrappers as part of the installation process.
Note
Still needs more detail on what "appropriate wrappers" means. For now, refer to what setuptools and zc.buildout generate as wrapper scripts.
prebuilt is a list of script paths, relative to the scripts directory in a wheel file or following installation. They are provided for informational purpose only - installing them is handled through the normal processes for files created when building a distribution.
Build tools SHOULD mark this extension as requiring handling by installers.
Index servers SHOULD allow multiple distributions to publish the same commands, but MAY notify distribution authors of potential conflicts.
Installation tools SHOULD report an error when asked to install a distribution that provides a command that is also provided by a different, previously installed, distribution.
Example:
"python.commands": {
"installer_must_handle": true,
"wrap_console": [{"chair": "chair:run_cli"}],
"wrap_gui": [{"chair-gui": "chair:run_gui"}],
"prebuilt": ["reduniforms"]
}
The python.constraints extension
The python.constraints extension contains two custom subfields:
- environments: supported installation environments
- extension_metadata: required exact matches in extension metadata fields published by other installed components
Build tools SHOULD mark this extension as requiring handling by installers.
Index servers SHOULD allow distributions to be uploaded with constraints that cannot be satisfied using that index, but MAY notify distribution authors of any such potential compatibility issues.
Installation tools SHOULD report an error if constraints are specified by the distribution and the target installation environment fails to satisfy them, MUST at least emit a warning, and MAY allow the user to force the installation to proceed regardless.
Example:
"python.constraints": {
"installer_must_handle": true,
"environments": ["python_version >= 2.6"],
"extension_metadata": {
"fortranlib": {
"fortranlib.compatibility": {
"fortran_abi": "openblas-g77"
}
}
}
}
Supported Environments
The environments subfield is a list of strings specifying the environments that the distribution explicitly supports. An environment is considered supported if it matches at least one of the environment markers given.
If this field is not given in the metadata, it is assumed that the distribution supports any platform supported by Python.
Individual entries are environment markers, as described in PEP 426.
The two main uses of this field are to declare which versions of Python and which underlying operating systems are supported.
Examples indicating supported Python versions:
# Supports Python 2.6+
"environments": ["python_version >= '2.6'"]
# Supports Python 2.6+ (for 2.x) or 3.3+ (for 3.x)
"environments": ["python_version >= '3.3'",
"'3.0' > python_version >= '2.6'"]
Examples indicating supported operating systems:
# Windows only
"environments": ["sys_platform == 'win32'"]
# Anything except Windows
"environments": ["sys_platform != 'win32'"]
# Linux or BSD only
"environments": ["'linux' in sys_platform",
"'bsd' in sys_platform"]
Example where the supported Python version varies by platform:
# The standard library's os module has long supported atomic renaming
# on POSIX systems, but only gained atomic renaming on Windows in Python
# 3.3. A distribution that needs atomic renaming support for reliable
# operation might declare the following supported environments.
"environment": ["python_version >= '2.6' and sys_platform != 'win32'",
"python_version >= '3.3' and sys_platform == 'win32'"]
Extension metadata constraints
The extension_metadata subfield is a mapping from distribution names to extension metadata snippets that are expected to exactly match the metadata of the named distribution in the target installation environment.
Each submapping then consists of a mapping from metadata extension names to the exact expected values of a subset of fields.
For example, a distribution called fortranlib may publish a different FORTRAN ABI depending on how it is built, and any related projects that are installed into the same runtime environment should use matching build options. This can be handled by having the base distribution publish a custom extension that indicates the build option that was used to create the binary extensions:
"extensions": {
"fortranlib.compatibility": {
"fortran_abi": "openblas-g77"
}
}
Other distributions that contain binary extensions that need to be compatible with the base distribution would then define a suitable constraint in their own metadata:
"python.constraints": {
"installer_must_handle": true,
"extension_metadata": {
"fortranlib": {
"fortranlib.compatibility": {
"fortran_abi": "openblas-g77"
}
}
}
}
This constraint specifies that:
- fortranlib must be installed (this should also be expressed as a normal dependency so that installers ensure it is satisfied)
- The installed version of fortranlib must include the custom fortranlib.compatibility extension in its published metadata
- The fortan_abi subfield of that extension must have the exact value openblas-g77.
If all of these conditions are met (the distribution is installed, the specified extension is included in the metadata, the specified subfields have the exact specified value), then the constraint is considered to be satisfied.
Note
The primary intended use case here is allowing C extensions with additional ABI compatibility requirements to declare those in a way that any installation tool can enforce without needing to understand the details. In particular, many NumPy based scientific libraries need to be built using a consistent set of FORTRAN libraries, hence the "fortranlib" example.
This is the reason there's no support for pattern matching or boolean logic: even the "simple" version of this extension is relatively complex, and there's currently no compelling rationale for making it more complicated than it already is.
Copyright
This document has been placed in the public domain.
pep-0460 Add binary interpolation and formatting
| PEP: | 460 |
|---|---|
| Title: | Add binary interpolation and formatting |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 6-Jan-2014 |
| Python-Version: | 3.5 |
Contents
Abstract
This PEP proposes to add minimal formatting operations to bytes and bytearray objects. The proposed additions are:
- bytes % ... and bytearray % ... for percent-formatting, similar in syntax to percent-formatting on str objects (accepting a single object, a tuple or a dict).
- bytes.format(...) and bytearray.format(...) for a formatting similar in syntax to str.format() (accepting positional as well as keyword arguments).
- bytes.format_map(...) and bytearray.format_map(...) for an API similar to str.format_map(...), with the same formatting syntax and semantics as bytes.format() and bytearray.format().
Rationale
In Python 2, str % args and str.format(args) allow the formatting and interpolation of bytestrings. This feature has commonly been used for the assembling of protocol messages when protocols are known to use a fixed encoding.
Python 3 generally mandates that text be stored and manipulated as unicode (i.e. str objects, not bytes). In some cases, though, it makes sense to manipulate bytes objects directly. Typical usage is binary network protocols, where you can want to interpolate and assemble several bytes object (some of them literals, some of them compute) to produce complete protocol messages. For example, protocols such as HTTP or SIP have headers with ASCII names and opaque "textual" values using a varying and/or sometimes ill-defined encoding. Moreover, those headers can be followed by a binary body... which can be chunked and decorated with ASCII headers and trailers!
While there are reasonably efficient ways to accumulate binary data (such as using a bytearray object, the bytes.join method or even io.BytesIO), none of them leads to the kind of readable and intuitive code that is produced by a %-formatted or {}-formatted template and a formatting operation.
Binary formatting features
Supported features
In this proposal, percent-formatting for bytes and bytearray supports the following features:
- Looking up formatting arguments by position as well as by name (i.e., %s as well as %(name)s).
- %s will try to get a Py_buffer on the given value, and fallback on calling __bytes__. The resulting binary data is inserted at the given point in the string. This is expected to work with bytes, bytearray and memoryview objects (as well as a couple others such as pathlib's path objects).
- %c will accept an integer between 0 and 255, and insert a byte of the given value.
Braces-formatting for bytes and bytearray supports the following features:
- All the kinds of argument lookup supported by str.format() (explicit positional lookup, auto-incremented positional lookup, keyword lookup, attribute lookup, etc.)
- Insertion of binary data when no modifier or layout is specified (e.g. {}, {0}, {name}). This has the same semantics as %s for percent-formatting (see above).
- The c modifier will accept an integer between 0 and 255, and insert a byte of the given value (same as %c above).
Unsupported features
All other features present in formatting of str objects (either through the percent operator or the str.format() method) are unsupported. Those features imply treating the recipient of the operator or method as text, which goes counter to the text / bytes separation (for example, accepting %d as a format code would imply that the bytes object really is a ASCII-compatible text string).
Amongst those unsupported features are not only most type-specific format codes, but also the various layout specifiers such as padding or alignment. Besides, str objects are not acceptable as arguments to the formatting operations, even when using e.g. the %s format code.
__format__ isn't called.
Criticisms
- The development cost and maintenance cost.
- In 3.3 encoding to ASCII or latin-1 is as fast as memcpy (but it still creates a separate object).
- Developers will have to work around the lack of binary formatting anyway, if they want to to support Python 3.4 and earlier.
- bytes.join() is consistently faster than format to join bytes strings (XXX is it?).
- Formatting functions could be implemented in a third party module, rather than added to builtin types.
Other proposals
A new type datatype
It was proposed to create a new datatype specialized for "network programming". The authors of this PEP believe this is counter-productive. Python 3 already has several major types dedicated to manipulation of binary data: bytes, bytearray, memoryview, io.BytesIO.
Adding yet another type would make things more confusing for users, and interoperability between libraries more painful (also potentially sub-optimal, due to the necessary conversions).
Moreover, not one type would be needed, but two: one immutable type (to allow for hashing), and one mutable type (as efficient accumulation is often necessary when working with network messages).
Resolution
This PEP is made obsolete by the acceptance of PEP 461, which introduces a more extended formatting language for bytes objects in conjunction with the modulo operator.
Copyright
This document has been placed in the public domain.
pep-0461 Adding % formatting to bytes and bytearray
| PEP: | 461 |
|---|---|
| Title: | Adding % formatting to bytes and bytearray |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ethan Furman <ethan at stoneleaf.us> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2014-01-13 |
| Python-Version: | 3.5 |
| Post-History: | 2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22, 2014-03-25, 2014-03-27 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2014-March/133621.html |
Contents
Abstract
This PEP proposes adding % formatting operations similar to Python 2's str type to bytes and bytearray [1] [2].
Rationale
While interpolation is usually thought of as a string operation, there are cases where interpolation on bytes or bytearrays make sense, and the work needed to make up for this missing functionality detracts from the overall readability of the code.
Motivation
With Python 3 and the split between str and bytes, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols [3].
This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a restricted %-interpolation for bytes and bytearray will aid both in writing new wire format code, and in porting Python 2 wire format code.
Common use-cases include dbf and pdf file formats, email formats, and FTP and HTTP communications, among many others.
Proposed semantics for bytes and bytearray formatting
%-interpolation
All the numeric formatting codes (d, i, o, u, x, X, e, E, f, F, g, G, and any that are subsequently added to Python 3) will be supported, and will work as they do for str, including the padding, justification and other related modifiers (currently #, 0, -, `` `` (space), and + (plus any added to Python 3)). The only non-numeric codes allowed are c, b, a, and s (which is a synonym for b).
For the numeric codes, the only difference between str and bytes (or bytearray) interpolation is that the results from these codes will be ASCII-encoded text, not unicode. In other words, for any numeric formatting code %x:
b"%x" % val
is equivalent to:
("%x" % val).encode("ascii")
Examples:
>>> b'%4x' % 10 b' a' >>> b'%#4x' % 10 ' 0xa' >>> b'%04X' % 10 '000A'
%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1, not from a str.
Examples:
>>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a'
%b will insert a series of bytes. These bytes are collected in one of two ways:
- input type supports ``Py_buffer`` [4]_? use it to collect the necessary bytes - input type is something else? use its ``__bytes__`` method [5]_ ; if there isn't one, raise a ``TypeError``
In particular, %b will not accept numbers nor str. str is rejected as the string to bytes conversion requires an encoding, and we are refusing to guess; numbers are rejected because:
- what makes a number is fuzzy (float? Decimal? Fraction? some user type?)
- allowing numbers would lead to ambiguity between numbers and textual representations of numbers (3.14 vs '3.14')
- given the nature of wire formats, explicit is definitely better than implicit
%s is included as a synonym for %b for the sole purpose of making 2/3 code bases easier to maintain. Python 3 only code should use %b.
Examples:
>>> b'%b' % b'abc'
b'abc'
>>> b'%b' % 'some string'.encode('utf8')
b'some string'
>>> b'%b' % 3.14
Traceback (most recent call last):
...
TypeError: b'%b' does not accept 'float'
>>> b'%b' % 'hello world!'
Traceback (most recent call last):
...
TypeError: b'%b' does not accept 'str'
%a will give the equivalent of repr(some_obj).encode('ascii', 'backslashreplace') on the interpolated value. Use cases include developing a new protocol and writing landmarks into the stream; debugging data going into an existing protocol to see if the problem is the protocol itself or bad data; a fall-back for a serialization format; or any situation where defining __bytes__ would not be appropriate but a readable/informative representation is needed [6].
%r is included as a synonym for %a for the sole purpose of making 2/3 code bases easier to maintain. Python 3 only code use %a [7].
Examples:
>>> b'%a' % 3.14 b'3.14' >>> b'%a' % b'abc' b"b'abc'" >>> b'%a' % 'def' b"'def'"
Compatibility with Python 2
As noted above, %s and %r are being included solely to help ease migration from, and/or have a single code base with, Python 2. This is important as there are modules both in the wild and behind closed doors that currently use the Python 2 str type as a bytes container, and hence are using %s as a bytes interpolator.
However, %b and %a should be used in new, Python 3 only code, so %s and %r will immediately be deprecated, but not removed from the 3.x series [7].
Proposed variations
It has been proposed to automatically use .encode('ascii','strict') for str arguments to %b.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %b return the ascii-encoded repr when the value is a str (b'%b' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Originally this PEP also proposed adding format-style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.
Various new special methods were proposed, such as __ascii__, __format_bytes__, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.
A competing PEP, PEP 460 Add binary interpolation and formatting [8], also exists.
Objections
The objections raised against this PEP were mainly variations on two themes:
- the bytes and bytearray types are for pure binary data, with no assumptions about encodings
- offering %-interpolation that assumes an ASCII encoding will be an attractive nuisance and lead us back to the problems of the Python 2 str/unicode text model
As was seen during the discussion, bytes and bytearray are also used for mixed binary data and ASCII-compatible segments: file formats such as dbf and pdf, network protocols such as ftp and email, etc.
bytes and bytearray already have several methods which assume an ASCII compatible encoding. upper(), isalpha(), and expandtabs() to name just a few. %-interpolation, with its very restricted mini-language, will not be any more of a nuisance than the already existing methods.
Some have objected to allowing the full range of numeric formatting codes with the claim that decimal alone would be sufficient. However, at least two formats (dbf and pdf) make use of non-decimal numbers.
Footnotes
| [1] | http://docs.python.org/2/library/stdtypes.html#string-formatting |
| [2] | neither string.Template, format, nor str.format are under consideration |
| [3] | https://mail.python.org/pipermail/python-dev/2014-January/131518.html |
| [4] | http://docs.python.org/3/c-api/buffer.html examples: memoryview, array.array, bytearray, bytes |
| [5] | http://docs.python.org/3/reference/datamodel.html#object.__bytes__ |
| [6] | https://mail.python.org/pipermail/python-dev/2014-February/132750.html |
| [7] | (1, 2) http://bugs.python.org/issue23467 -- originally %r was not allowed, but was added for consistency during the 3.5 alpha stage. |
| [8] | http://python.org/dev/peps/pep-0460/ |
Copyright
This document has been placed in the public domain.
pep-0462 Core development workflow automation for CPython
| PEP: | 462 |
|---|---|
| Title: | Core development workflow automation for CPython |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Deferred |
| Type: | Process |
| Content-Type: | text/x-rst |
| Requires: | 474 |
| Created: | 23-Jan-2014 |
| Post-History: | 25-Jan-2014, 27-Jan-2014, 01-Feb-2015 |
Contents
Abstract
This PEP proposes investing in automation of several of the tedious, time consuming activities that are currently required for the core development team to incorporate changes into CPython. This proposal is intended to allow core developers to make more effective use of the time they have available to contribute to CPython, which should also result in an improved experience for other contributors that are reliant on the core team to get their changes incorporated.
PEP Deferral
This PEP is currently deferred pending acceptance or rejection of the Kallithea-based forge.python.org proposal in PEP 474.
Rationale for changes to the core development workflow
The current core developer workflow to merge a new feature into CPython on a POSIX system "works" as follows:
- If applying a change submitted to bugs.python.org by another user, first check they have signed the PSF Contributor Licensing Agreement. If not, request that they sign one before continuing with merging the change.
- Apply the change locally to a current checkout of the main CPython repository (the change will typically have been discussed and reviewed as a patch on bugs.python.org first, but this step is not currently considered mandatory for changes originating directly from core developers).
- Run the test suite locally, at least make test or ./python -m test (depending on system specs, this takes a few minutes in the default configuration, but substantially longer if all optional resources, like external network access, are enabled).
- Run make patchcheck to fix any whitespace issues and as a reminder of other changes that may be needed (such as updating Misc/ACKS or adding an entry to Misc/NEWS)
- Commit the change and push it to the main repository. If hg indicates this would create a new head in the remote repository, run hg pull --rebase (or an equivalent). Theoretically, you should rerun the tests at this point, but it's very tempting to skip that step.
- After pushing, monitor the stable buildbots for any new failures introduced by your change. In particular, developers on POSIX systems will often break the Windows buildbots, and vice-versa. Less commonly, developers on Linux or Mac OS X may break other POSIX systems.
The steps required on Windows are similar, but the exact commands used will be different.
Rather than being simpler, the workflow for a bug fix is more complicated than that for a new feature! New features have the advantage of only being applied to the default branch, while bug fixes also need to be considered for inclusion in maintenance branches.
- If a bug fix is applicable to Python 2.7, then it is also separately applied to the 2.7 branch, which is maintained as an independent head in Mercurial
- If a bug fix is applicable to the current 3.x maintenance release, then it is first applied to the maintenance branch and then merged forward to the default branch. Both branches are pushed to hg.python.org at the same time.
Documentation patches are simpler than functional patches, but not hugely so - the main benefit is only needing to check the docs build successfully rather than running the test suite.
I would estimate that even when everything goes smoothly, it would still take me at least 20-30 minutes to commit a bug fix patch that applies cleanly. Given that it should be possible to automate several of these tasks, I do not believe our current practices are making effective use of scarce core developer resources.
There are many, many frustrations involved with this current workflow, and they lead directly to some undesirable development practices.
- Much of this overhead is incurred on a per-patch applied basis. This encourages large commits, rather than small isolated changes. The time required to commit a 500 line feature is essentially the same as that needed to commit a 1 line bug fix - the additional time needed for the larger change appears in any preceding review rather than as part of the commit process.
- The additional overhead of working on applying bug fixes creates an additional incentive to work on new features instead, and new features are already inherently more interesting to work on - they don't need workflow difficulties giving them a helping hand!
- Getting a preceding review on bugs.python.org is additional work, creating an incentive to commit changes directly, increasing the reliance on post-review on the python-checkins mailing list.
- Patches on the tracker that are complete, correct and ready to merge may still languish for extended periods awaiting a core developer with the time to devote to getting it merged.
- The risk of push races (especially when pushing a merged bug fix) creates a temptation to skip doing full local test runs (especially after a push race has already been encountered once), increasing the chance of breaking the buildbots.
- The buildbots are sometimes red for extended periods, introducing errors into local test runs, and also meaning that they sometimes fail to serve as a reliable indicator of whether or not a patch has introduced cross platform issues.
- Post-conference development sprints are a nightmare, as they collapse into a mire of push races. It's tempting to just leave patches on the tracker until after the sprint is over and then try to clean them up afterwards.
There are also many, many opportunities for core developers to make mistakes that inconvenience others, both in managing the Mercurial branches and in breaking the buildbots without being in a position to fix them promptly. This both makes the existing core development team cautious in granting new developers commit access, as well as making those new developers cautious about actually making use of their increased level of access.
There are also some incidental annoyances (like keeping the NEWS file up to date) that will also be necessarily addressed as part of this proposal.
One of the most critical resources of a volunteer-driven open source project is the emotional energy of its contributors. The current approach to change incorporation doesn't score well on that front for anyone:
- For core developers, the branch wrangling for bug fixes is delicate and easy to get wrong. Conflicts on the NEWS file and push races when attempting to upload changes add to the irritation of something most of us aren't being paid to spend time on (and for those that are, contributing to CPython is likely to be only one of our responsibilities). The time we spend actually getting a change merged is time we're not spending coding additional changes, writing or updating documentation or reviewing contributions from others.
- Red buildbots make life difficult for other developers (since a local test failure may not be due to anything that developer did), release managers (since they may need to enlist assistance cleaning up test failures prior to a release) and for the developers themselves (since it creates significant pressure to fix any failures we inadvertently introduce right now, rather than at a more convenient time, as well as potentially making hg bisect more difficult to use if hg annotate isn't sufficient to identify the source of a new failure).
- For other contributors, a core developer spending time actually getting changes merged is a developer that isn't reviewing and discussing patches on the issue tracker or otherwise helping others to contribute effectively. It is especially frustrating for contributors that are accustomed to the simplicity of a developer just being able to hit "Merge" on a pull request that has already been automatically tested in the project's CI system (which is a common workflow on sites like GitHub and BitBucket), or where the post-review part of the merge process is fully automated (as is the case for OpenStack).
Current Tools
The following tools are currently used to manage various parts of the CPython core development workflow.
- Mercurial (hg.python.org) for version control
- Roundup (bugs.python.org) for issue tracking
- Rietveld (also hosted on bugs.python.org) for code review
- Buildbot (buildbot.python.org) for automated testing
This proposal suggests replacing the use of Rietveld for code review with the more full-featured Kallithea-based forge.python.org service proposed in PEP 474. Guido has indicated that the original Rietveld implementation was primarily intended as a public demonstration application for Google App Engine, and switching to Kallithea will address some of the issues with identifying intended target branches that arise when working with patch files on Roundup and the associated reviews in the integrated Rietveld instance.
It also suggests the addition of new tools in order to automate additional parts of the workflow, as well as a critical review of the remaining tools to see which, if any, may be candidates for replacement.
Proposal
The essence of this proposal is that CPython aim to adopt a "core reviewer" development model, similar to that used by the OpenStack project.
The workflow problems experienced by the CPython core development team are not unique. The OpenStack infrastructure team have come up with a well designed automated workflow that is designed to ensure:
- once a patch has been reviewed, further developer involvement is needed only if the automated tests fail prior to merging
- patches never get merged without being tested relative to the current state of the branch
- the main development branch always stays green. Patches that do not pass the automated tests do not get merged
If a core developer wants to tweak a patch prior to merging, they download it from the review tool, modify and upload it back to the review tool rather than pushing it directly to the source code repository.
The core of this workflow is implemented using a tool called Zuul [1], a Python web service created specifically for the OpenStack project, but deliberately designed with a plugin based trigger and action system to make it easier to adapt to alternate code review systems, issue trackers and CI systems. James Blair of the OpenStack infrastructure team provided an excellent overview of Zuul at linux.conf.au 2014.
While Zuul handles several workflows for OpenStack, the specific one of interest for this PEP is the "merge gating" workflow.
For this workflow, Zuul is configured to monitor the Gerrit code review system for patches which have been marked as "Approved". Once it sees such a patch, Zuul takes it, and combines it into a queue of "candidate merges". It then creates a pipeline of test runs that execute in parallel in Jenkins (in order to allow more than 24 commits a day when a full test run takes the better part of an hour), and are merged as they pass (and as all the candidate merges ahead of them in the queue pass). If a patch fails the tests, Zuul takes it out of the queue, cancels any test runs after that patch in the queue, and rebuilds the queue without the failing patch.
If a developer looks at a test which failed on merge and determines that it was due to an intermittent failure, they can then resubmit the patch for another attempt at merging.
To adapt this process to CPython, it should be feasible to have Zuul monitor Kallithea for approved pull requests (which may require a feature addition in Kallithea), submit them to Buildbot for testing on the stable buildbots, and then merge the changes appropriately in Mercurial. This idea poses a few technical challenges, which have their own section below.
For CPython, I don't believe we will need to take advantage of Zuul's ability to execute tests in parallel (certainly not in the initial iteration - if we get to a point where serial testing of patches by the merge gating system is our primary bottleneck rather than having the people we need in order to be able to review and approve patches, then that will be a very good day).
However, the merge queue itself is a very powerful concept that should directly address several of the issues described in the Rationale above.
Deferred Proposals
The OpenStack team also use Zuul to coordinate several other activities:
- Running preliminary "check" tests against patches posted to Gerrit.
- Creation of updated release artefacts and republishing documentation when changes are merged
- The Elastic recheck [2] feature that uses ElasticSearch in conjunction with a spam filter to monitor test output and suggest the specific intermittent failure that may have caused a test to fail, rather than requiring users to search logs manually
While these are possibilities worth exploring in the future (and one of the possible benefits I see to seeking closer coordination with the OpenStack Infrastructure team), I don't see them as offering quite the same kind of fundamental workflow improvement that merge gating appears to provide.
However, if we find we are having too many problems with intermittent test failures in the gate, then introducing the "Elastic recheck" feature may need to be considered as part of the initial deployment.
Suggested Variants
Terry Reedy has suggested doing an initial filter which specifically looks for approved documentation-only patches (~700 of the 4000+ open CPython issues are pure documentation updates). This approach would avoid several of the issues related to flaky tests and cross-platform testing, while still allowing the rest of the automation flows to be worked out (such as how to push a patch into the merge queue).
The key downside to this approach is that Zuul wouldn't have complete control of the merge process as it usually expects, so there would potentially be additional coordination needed around that.
It may be worth keeping this approach as a fallback option if the initial deployment proves to have more trouble with test reliability than is anticipated.
It would also be possible to tweak the merge gating criteria such that it doesn't run the test suite if it detects that the patch hasn't modified any files outside the "Docs" tree, and instead only checks that the documentation builds without errors.
As yet another alternative, it may be reasonable to move some parts of the documentation (such as the tutorial and the HOWTO guides) out of the main source repository and manage them using the simpler pull request based model described in PEP 474.
Perceived Benefits
The benefits of this proposal accrue most directly to the core development team. First and foremost, it means that once we mark a patch as "Approved" in the updated code review system, we're usually done. The extra 20-30 minutes (or more) of actually applying the patch, running the tests and merging it into Mercurial would all be orchestrated by Zuul. Push races would also be a thing of the past - if lots of core developers are approving patches at a sprint, then that just means the queue gets deeper in Zuul, rather than developers getting frustrated trying to merge changes and failing. Test failures would still happen, but they would result in the affected patch being removed from the merge queue, rather than breaking the code in the main repository.
With the bulk of the time investment moved to the review process, this also encourages "development for reviewability" - smaller, easier to review patches, since the overhead of running the tests multiple times will be incurred by Zuul rather than by the core developers.
However, removing this time sink from the core development team should also improve the experience of CPython development for other contributors, as it eliminates several of the opportunities for patches to get "dropped on the floor", as well as increasing the time core developers are likely to have available for reviewing contributed patches.
Another example of benefits to other contributors is that when a sprint aimed primarily at new contributors is running with just a single core developer present (such as the sprints at PyCon AU for the last few years), the merge queue would allow that developer to focus more of their time on reviewing patches and helping the other contributors at the sprint, since accepting a patch for inclusion would now be a single click in the Kallithea UI, rather than the relatively time consuming process that it is currently. Even when multiple core developers are present, it is better to enable them to spend their time and effort on interacting with the other sprint participants than it is on things that are sufficiently mechanical that a computer can (and should) handle them.
With most of the ways to make a mistake when committing a change automated out of existence, there are also substantially fewer new things to learn when a contributor is nominated to become a core developer. This should have a dual benefit, both in making the existing core developers more comfortable with granting that additional level of responsibility, and in making new contributors more comfortable with exercising it.
Finally, a more stable default branch in CPython makes it easier for other Python projects to conduct continuous integration directly against the main repo, rather than having to wait until we get into the release candidate phase of a new release. At the moment, setting up such a system isn't particularly attractive, as it would need to include an additional mechanism to wait until CPython's own Buildbot fleet indicated that the build was in a usable state. With the proposed merge gating system, the trunk always remains usable.
Technical Challenges
Adapting Zuul from the OpenStack infrastructure to the CPython infrastructure will at least require the development of additional Zuul trigger and action plugins, and may require additional development in some of our existing tools.
Kallithea vs Gerrit
Kallithea does not currently include a voting/approval feature that is equivalent to Gerrit's. For CPython, we wouldn't need anything as sophisticated as Gerrit's voting system - a simple core-developer-only "Approved" marker to trigger action from Zuul should suffice. The core-developer-or-not flag is available in Roundup, as is the flag indicating whether or not the uploader of a patch has signed a PSF Contributor Licensing Agreement, which may require further development to link contributor accounts between the Kallithea instance and Roundup.
Some of the existing Zuul triggers work by monitoring for particular comments (in particular, recheck/reverify comments to ask Zuul to try merging a change again if it was previously rejected due to an unrelated intermittent failure). We will likely also want similar explicit triggers for Kallithea.
The current Zuul plugins for Gerrit work by monitoring the Gerrit activity stream for particular events. If Kallithea has no equivalent, we will need to add something suitable for the events we would like to trigger on.
There would also be development effort needed to create a Zuul plugin that monitors Kallithea activity rather than Gerrit.
Mercurial vs Gerrit/git
Gerrit uses git as the actual storage mechanism for patches, and automatically handles merging of approved patches. By contrast, Kallithea use the RhodeCode created vcs <https://pythonhosted.org/vcs/> library as an abstraction layer over specific DVCS implementations (with Mercurial and git backends currently available).
Zuul is also directly integrated with git for patch manipulation - as far as I am aware, this part of the design currently isn't pluggable. However, at PyCon US 2014, the Mercurial core developers at the sprints expressed some interest in collaborating with the core development team and the Zuul developers on enabling the use of Zuul with Mercurial in addition to git. As Zuul is itself a Python application, migrating it to use the same DVCS abstraction library as RhodeCode and Kallithea may be a viable path towards achieving that.
Buildbot vs Jenkins
Zuul's interaction with the CI system is also pluggable, using Gearman as the preferred interface. Accordingly, adapting the CI jobs to run in Buildbot rather than Jenkins should just be a matter of writing a Gearman client that can process the requests from Zuul and pass them on to the Buildbot master. Zuul uses the pure Python gear client library to communicate with Gearman, and this library should also be useful to handle the Buildbot side of things.
Note that, in the initial iteration, I am proposing that we do not attempt to pipeline test execution. This means Zuul would be running in a very simple mode where only the patch at the head of the merge queue is being tested on the Buildbot fleet, rather than potentially testing several patches in parallel. I am picturing something equivalent to requesting a forced build from the Buildbot master, and then waiting for the result to come back before moving on to the second patch in the queue.
If we ultimately decide that this is not sufficient, and we need to start using the CI pipelining features of Zuul, then we may need to look at moving the test execution to dynamically provisioned cloud images, rather than relying on volunteer maintained statically provisioned systems as we do currently. The OpenStack CI infrastructure team are exploring the idea of replacing their current use of Jenkins masters with a simpler pure Python test runner, so if we find that we can't get Buildbot to effectively support the pipelined testing model, we'd likely participate in that effort rather than setting up a Jenkins instance for CPython.
In this case, the main technical risk would be a matter of ensuring we support testing on platforms other than Linux (as our stable buildbots currently cover Windows, Mac OS X, FreeBSD and OpenIndiana in addition to a couple of different Linux variants).
In such a scenario, the Buildbot fleet would still have a place in doing "check" runs against the master repository (either periodically or for every commit), even if it did not play a part in the merge gating process. More unusual configurations (such as building without threads, or without SSL/TLS support) would likely still be handled that way rather than being included in the gate criteria (at least initially, anyway).
Handling of maintenance branches
The OpenStack project largely leaves the question of maintenance branches to downstream vendors, rather than handling it directly. This means there are questions to be answered regarding how we adapt Zuul to handle our maintenance branches.
Python 2.7 can be handled easily enough by treating it as a separate patch queue. This would be handled natively in Kallithea by submitting separate pull requests in order to update the Python 2.7 maintenance branch.
The Python 3.x maintenance branches are potentially more complicated. My current recommendation is to simply stop using Mercurial merges to manage them, and instead treat them as independent heads, similar to the Python 2.7 branch. Separate pull requests would need to be submitted for the active Python 3 maintenance branch and the default development branch. The downside of this approach is that it increases the risk that a fix is merged only to the maintenance branch without also being submitted to the default branch, so we may want to design some additional tooling that ensures that every maintenance branch pull request either has a corresponding default branch pull request prior to being merged, or else has an explicit disclaimer indicating that it is only applicable to that branch and doesn't need to be ported forward to later branches.
Such an approach has the benefit of adjusting relatively cleanly to the intermittent periods where we have two active Python 3 maintenance branches.
This issue does suggest some potential user interface ideas for Kallithea, where it may be desirable to be able to clone a pull request in order to be able to apply it to a second branch.
Handling of security branches
For simplicity's sake, I would suggest leaving the handling of security-fix only branches alone: the release managers for those branches would continue to backport specific changes manually. The only change is that they would be able to use the Kallithea pull request workflow to do the backports if they would like others to review the updates prior to merging them.
Handling of NEWS file updates
Our current approach to handling NEWS file updates regularly results in spurious conflicts when merging bug fixes forward from an active maintenance branch to a later branch.
Issue #18967* discusses some possible improvements in that area, which would be beneficial regardless of whether or not we adopt Zuul as a workflow automation tool.
Stability of "stable" Buildbot slaves
Instability of the nominally stable buildbots has a substantially larger impact under this proposal. We would need to ensure we're genuinely happy with each of those systems gating merges to the development branches, or else move then to "unstable" status.
Intermittent test failures
Some tests, especially timing tests, exhibit intermittent failures on the existing Buildbot fleet. In particular, test systems running as VMs may sometimes exhibit timing failures when the VM host is under higher than normal load.
The OpenStack CI infrastructure includes a number of additional features to help deal with intermittent failures, the most basic of which is simply allowing developers to request that merging a patch be tried again when the original failure appears to be due to a known intermittent failure (whether that intermittent failure is in OpenStack itself or just in a flaky test).
The more sophisticated Elastic recheck [2] feature may be worth considering, especially since the output of the CPython test suite is substantially simpler than that from OpenStack's more complex multi-service testing, and hence likely even more amenable to automated analysis.
Custom Mercurial client workflow support
One useful part of the OpenStack workflow is the "git review" plugin, which makes it relatively easy to push a branch from a local git clone up to Gerrit for review.
PEP 474 mentions a draft custom Mercurial extension that automates some aspects of the existing CPython core development workflow.
As part of this proposal, that custom extension would be extended to work with the new Kallithea based review workflow in addition to the legacy Roundup/Rietveld based review workflow.
Social Challenges
The primary social challenge here is getting the core development team to change their practices. However, the tedious-but-necessary steps that are automated by the proposal should create a strong incentive for the existing developers to go along with the idea.
I believe three specific features may be needed to assure existing developers that there are no downsides to the automation of this workflow:
- Only requiring approval from a single core developer to incorporate a patch. This could be revisited in the future, but we should preserve the status quo for the initial rollout.
- Explicitly stating that core developers remain free to approve their own patches, except during the release candidate phase of a release. This could be revisited in the future, but we should preserve the status quo for the initial rollout.
- Ensuring that at least release managers have a "merge it now" capability that allows them to force a particular patch to the head of the merge queue. Using a separate clone for release preparation may be sufficient for this purpose. Longer term, automatic merge gating may also allow for more automated preparation of release artefacts as well.
Practical Challenges
The PSF runs its own directly and indirectly sponsored workflow infrastructure primarily due to past experience with unacceptably poor performance and inflexibility of infrastructure provided for free to the general public. CPython development was originally hosted on SourceForge, with source control moved to self hosting when SF was both slow to offer Subversion support and suffering from CVS performance issues (see PEP 347), while issue tracking later moved to the open source Roundup issue tracker on dedicated sponsored hosting (from Upfront Systems), due to a combination of both SF performance issues and general usability issues with the SF tracker at the time (the outcome and process for the new tracker selection were captured on the python.org wiki rather than in a PEP).
Accordingly, proposals that involve setting ourselves up for "SourceForge usability and reliability issues, round two" will face significant opposition from at least some members of the CPython core development team (including the author of this PEP). This proposal respects that history by recommending only tools that are available for self-hosting as sponsored or PSF funded infrastructure, and are also open source Python projects that can be customised to meet the needs of the CPython core development team.
However, for this proposal to be a success (if it is accepted), we need to understand how we are going to carry out the necessary configuration, customisation, integration and deployment work.
The last attempt at adding a new piece to the CPython support infrastructure (speed.python.org) has unfortunately foundered due to the lack of time to drive the project from the core developers and PSF board members involved, and the difficulties of trying to bring someone else up to speed to lead the activity (the hardware donated to that project by HP is currently in use to support PyPy instead, but the situation highlights some of the challenges of relying on volunteer labour with many other higher priority demands on their time to steer projects to completion).
Even ultimately successful past projects, such as the source control migrations from CVS to Subversion and from Subversion to Mercurial, the issue tracker migration from SourceForge to Roundup, the code review integration between Roundup and Rietveld and the introduction of the Buildbot continuous integration fleet, have taken an extended period of time as volunteers worked their way through the many technical and social challenges involved.
Fortunately, as several aspects of this proposal and PEP 474 align with various workflow improvements under consideration for Red Hat's Beaker open source hardware integration testing system and other work-related projects, I have arranged to be able to devote ~1 day a week to working on CPython infrastructure projects.
Together with Rackspace's existing contributions to maintaining the pypi.python.org infrastructure, I personally believe this arrangement is indicative of a more general recognition amongst CPython redistributors and major users of the merit in helping to sustain upstream infrastructure through direct contributions of developer time, rather than expecting volunteer contributors to maintain that infrastructure entirely in their spare time or funding it indirectly through the PSF (with the additional management overhead that would entail). I consider this a positive trend, and one that I will continue to encourage as best I can.
Open Questions
Pretty much everything in the PEP. Do we want to adopt merge gating and Zuul? How do we want to address the various technical challenges? Are the Kallithea and Zuul development communities open to the kind of collaboration that would be needed to make this effort a success?
While I've arranged to spend some of my own work time on this, do we want to approach the OpenStack Foundation for additional assistance, since we're a key dependency of OpenStack itself, Zuul is a creation of the OpenStack infrastructure team, and the available development resources for OpenStack currently dwarf those for CPython?
Are other interested folks working for Python redistributors and major users also in a position to make a business case to their superiors for investing developer time in supporting this effort?
Next Steps
If pursued, this will be a follow-on project to the Kallithea-based forge.python.org proposal in PEP 474. Refer to that PEP for more details on the discussion, review and proof-of-concept pilot process currently under way.
Acknowledgements
Thanks to Jesse Noller, Alex Gaynor and James Blair for providing valuable feedback on a preliminary draft of this proposal, and to James and Monty Taylor for additional technical feedback following publication of the initial draft.
Thanks to Bradley Kuhn, Mads Kiellerich and other Kallithea developers for the discussions around PEP 474 that led to a significant revision of this proposal to be based on using Kallithea for the review component rather than the existing Rietveld installation.
Copyright
This document has been placed in the public domain.
pep-0463 Exception-catching expressions
| PEP: | 463 |
|---|---|
| Title: | Exception-catching expressions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Chris Angelico <rosuav at gmail.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Feb-2014 |
| Python-Version: | 3.5 |
| Post-History: | 20-Feb-2014, 16-Feb-2014 |
Contents
Abstract
Just as PEP 308 introduced a means of value-based conditions in an expression, this system allows exception-based conditions to be used as part of an expression.
Motivation
A number of functions and methods have parameters which will cause them to return a specified value instead of raising an exception. The current system is ad-hoc and inconsistent, and requires that each function be individually written to have this functionality; not all support this.
- dict.get(key, default) - second positional argument in place of KeyError
- next(iter, default) - second positional argument in place of StopIteration
- list.pop() - no way to return a default
- seq[index] - no way to handle a bounds error
- min(sequence, default=default) - keyword argument in place of ValueError
- statistics.mean(data) - no way to handle an empty iterator
Had this facility existed early in Python's history, there would have been no need to create dict.get() and related methods; the one obvious way to handle an absent key would be to respond to the exception. One method is written which signals the absence in one way, and one consistent technique is used to respond to the absence. Instead, we have dict.get(), and as of Python 3.4, we also have min(... default=default), and myriad others. We have a LBYL syntax for testing inside an expression, but there is currently no EAFP notation; compare the following:
# LBYL:
if key in dic:
process(dic[key])
else:
process(None)
# As an expression:
process(dic[key] if key in dic else None)
# EAFP:
try:
process(dic[key])
except KeyError:
process(None)
# As an expression:
process(dic[key] except KeyError: None)
Python generally recommends the EAFP policy, but must then proliferate utility functions like dic.get(key,None) to enable this.
Rationale
The current system requires that a function author predict the need for a default, and implement support for it. If this is not done, a full try/except block is needed.
Since try/except is a statement, it is impossible to catch exceptions in the middle of an expression. Just as if/else does for conditionals and lambda does for function definitions, so does this allow exception catching in an expression context.
This provides a clean and consistent way for a function to provide a default: it simply raises an appropriate exception, and the caller catches it.
With some situations, an LBYL technique can be used (checking if some sequence has enough length before indexing into it, for instance). This is not safe in all cases, but as it is often convenient, programmers will be tempted to sacrifice the safety of EAFP in favour of the notational brevity of LBYL. Additionally, some LBYL techniques (eg involving getattr with three arguments) warp the code into looking like literal strings rather than attribute lookup, which can impact readability. A convenient EAFP notation solves all of this.
There's no convenient way to write a helper function to do this; the nearest is something ugly using either lambda:
def except_(expression, exception_list, default):
try:
return expression()
except exception_list:
return default()
value = except_(lambda: 1/x, ZeroDivisionError, lambda: float("nan"))
which is clunky, and unable to handle multiple exception clauses; or eval:
def except_(expression, exception_list, default):
try:
return eval(expression, globals_of_caller(), locals_of_caller())
except exception_list as exc:
l = locals_of_caller().copy()
l['exc'] = exc
return eval(default, globals_of_caller(), l)
def globals_of_caller():
return sys._getframe(2).f_globals
def locals_of_caller():
return sys._getframe(2).f_locals
value = except_("""1/x""",ZeroDivisionError,""" "Can't divide by zero" """)
which is even clunkier, and relies on implementation-dependent hacks. (Writing globals_of_caller() and locals_of_caller() for interpreters other than CPython is left as an exercise for the reader.)
Raymond Hettinger expresses [1] a desire for such a consistent API. Something similar has been requested [2] multiple [3] times [4] in the past.
Proposal
Just as the 'or' operator and the three part 'if-else' expression give short circuiting methods of catching a falsy value and replacing it, this syntax gives a short-circuiting method of catching an exception and replacing it.
This currently works:
lst = [1, 2, None, 3] value = lst[2] or "No value"
The proposal adds this:
lst = [1, 2] value = (lst[2] except IndexError: "No value")
Specifically, the syntax proposed is:
(expr except exception_list: default)
where expr, exception_list, and default are all expressions. First, expr is evaluated. If no exception is raised, its value is the value of the overall expression. If any exception is raised, exception_list is evaluated, and should result in either a type or a tuple, just as with the statement form of try/except. Any matching exception will result in the corresponding default expression being evaluated and becoming the value of the expression. As with the statement form of try/except, non-matching exceptions will propagate upward.
Parentheses are required around the entire expression, unless they would be completely redundant, according to the same rules as generator expressions follow. This guarantees correct interpretation of nested except-expressions, and allows for future expansion of the syntax - see below on multiple except clauses.
Note that the current proposal does not allow the exception object to be captured. Where this is needed, the statement form must be used. (See below for discussion and elaboration on this.)
This ternary operator would be between lambda and if/else in precedence.
Consider this example of a two-level cache:
for key in sequence:
x = (lvl1[key] except KeyError: (lvl2[key] except KeyError: f(key)))
# do something with x
This cannot be rewritten as:
x = lvl1.get(key, lvl2.get(key, f(key)))
which, despite being shorter, defeats the purpose of the cache, as it must calculate a default value to pass to get(). The .get() version calculates backwards; the exception-testing version calculates forwards, as would be expected. The nearest useful equivalent would be:
x = lvl1.get(key) or lvl2.get(key) or f(key)
which depends on the values being nonzero, as well as depending on the cache object supporting this functionality.
Alternative Proposals
Discussion on python-ideas brought up the following syntax suggestions:
value = expr except default if Exception [as e] value = expr except default for Exception [as e] value = expr except default from Exception [as e] value = expr except Exception [as e] return default value = expr except (Exception [as e]: default) value = expr except Exception [as e] try default value = expr except Exception [as e] continue with default value = default except Exception [as e] else expr value = try expr except Exception [as e]: default value = expr except default # Catches anything value = expr except(Exception) default # Catches only the named type(s) value = default if expr raise Exception value = expr or else default if Exception value = expr except Exception [as e] -> default value = expr except Exception [as e] pass default
It has also been suggested that a new keyword be created, rather than reusing an existing one. Such proposals fall into the same structure as the last form, but with a different keyword in place of 'pass'. Suggestions include 'then', 'when', and 'use'. Also, in the context of the "default if expr raise Exception" proposal, it was suggested that a new keyword "raises" be used.
All forms involving the 'as' capturing clause have been deferred from this proposal in the interests of simplicity, but are preserved in the table above as an accurate record of suggestions.
The four forms most supported by this proposal are, in order:
value = (expr except Exception: default) value = (expr except Exception -> default) value = (expr except Exception pass default) value = (expr except Exception then default)
All four maintain left-to-right evaluation order: first the base expression, then the exception list, and lastly the default. This is important, as the expressions are evaluated lazily. By comparison, several of the ad-hoc alternatives listed above must (by the nature of functions) evaluate their default values eagerly. The preferred form, using the colon, parallels try/except by using "except exception_list:", and parallels lambda by having "keyword name_list: subexpression"; it also can be read as mapping Exception to the default value, dict-style. Using the arrow introduces a token many programmers will not be familiar with, and which currently has no similar meaning, but is otherwise quite readable. The English word "pass" has a vaguely similar meaning (consider the common usage "pass by value/reference" for function arguments), and "pass" is already a keyword, but as its meaning is distinctly unrelated, this may cause confusion. Using "then" makes sense in English, but this introduces a new keyword to the language - albeit one not in common use, but a new keyword all the same.
Left to right evaluation order is extremely important to readability, as it parallels the order most expressions are evaluated. Alternatives such as:
value = (expr except default if Exception)
break this, by first evaluating the two ends, and then coming to the middle; while this may not seem terrible (as the exception list will usually be a constant), it does add to the confusion when multiple clauses meet, either with multiple except/if or with the existing if/else, or a combination. Using the preferred order, subexpressions will always be evaluated from left to right, no matter how the syntax is nested.
Keeping the existing notation, but shifting the mandatory parentheses, we have the following suggestion:
value = expr except (Exception: default) value = expr except(Exception: default)
This is reminiscent of a function call, or a dict initializer. The colon cannot be confused with introducing a suite, but on the other hand, the new syntax guarantees lazy evaluation, which a dict does not. The potential to reduce confusion is considered unjustified by the corresponding potential to increase it.
Example usage
For each example, an approximately-equivalent statement form is given, to show how the expression will be parsed. These are not always strictly equivalent, but will accomplish the same purpose. It is NOT safe for the interpreter to translate one into the other.
A number of these examples are taken directly from the Python standard library, with file names and line numbers correct as of early Feb 2014. Many of these patterns are extremely common.
Retrieve an argument, defaulting to None:
cond = (args[1] except IndexError: None)
# Lib/pdb.py:803:
try:
cond = args[1]
except IndexError:
cond = None
Fetch information from the system if available:
pwd = (os.getcwd() except OSError: None)
# Lib/tkinter/filedialog.py:210:
try:
pwd = os.getcwd()
except OSError:
pwd = None
Attempt a translation, falling back on the original:
e.widget = (self._nametowidget(W) except KeyError: W)
# Lib/tkinter/__init__.py:1222:
try:
e.widget = self._nametowidget(W)
except KeyError:
e.widget = W
Read from an iterator, continuing with blank lines once it's exhausted:
line = (readline() except StopIteration: '')
# Lib/lib2to3/pgen2/tokenize.py:370:
try:
line = readline()
except StopIteration:
line = ''
Retrieve platform-specific information (note the DRY improvement); this particular example could be taken further, turning a series of separate assignments into a single large dict initialization:
# sys.abiflags may not be defined on all platforms.
_CONFIG_VARS['abiflags'] = (sys.abiflags except AttributeError: '')
# Lib/sysconfig.py:529:
try:
_CONFIG_VARS['abiflags'] = sys.abiflags
except AttributeError:
# sys.abiflags may not be defined on all platforms.
_CONFIG_VARS['abiflags'] = ''
Retrieve an indexed item, defaulting to None (similar to dict.get):
def getNamedItem(self, name):
return (self._attrs[name] except KeyError: None)
# Lib/xml/dom/minidom.py:573:
def getNamedItem(self, name):
try:
return self._attrs[name]
except KeyError:
return None
Translate numbers to names, falling back on the numbers:
g = (grp.getgrnam(tarinfo.gname)[2] except KeyError: tarinfo.gid)
u = (pwd.getpwnam(tarinfo.uname)[2] except KeyError: tarinfo.uid)
# Lib/tarfile.py:2198:
try:
g = grp.getgrnam(tarinfo.gname)[2]
except KeyError:
g = tarinfo.gid
try:
u = pwd.getpwnam(tarinfo.uname)[2]
except KeyError:
u = tarinfo.uid
Look up an attribute, falling back on a default:
mode = (f.mode except AttributeError: 'rb')
# Lib/aifc.py:882:
if hasattr(f, 'mode'):
mode = f.mode
else:
mode = 'rb'
return (sys._getframe(1) except AttributeError: None)
# Lib/inspect.py:1350:
return sys._getframe(1) if hasattr(sys, "_getframe") else None
Perform some lengthy calculations in EAFP mode, handling division by zero as a sort of sticky NaN:
value = (calculate(x) except ZeroDivisionError: float("nan"))
try:
value = calculate(x)
except ZeroDivisionError:
value = float("nan")
Calculate the mean of a series of numbers, falling back on zero:
value = (statistics.mean(lst) except statistics.StatisticsError: 0)
try:
value = statistics.mean(lst)
except statistics.StatisticsError:
value = 0
Looking up objects in a sparse list of overrides:
(overrides[x] or default except IndexError: default).ping()
try:
(overrides[x] or default).ping()
except IndexError:
default.ping()
Narrowing of exception-catching scope
The following examples, taken directly from Python's standard library, demonstrate how the scope of the try/except can be conveniently narrowed. To do this with the statement form of try/except would require a temporary variable, but it's far cleaner as an expression.
Lib/ipaddress.py:343:
try:
ips.append(ip.ip)
except AttributeError:
ips.append(ip.network_address)
Becomes:
ips.append(ip.ip except AttributeError: ip.network_address)
The expression form is nearly equivalent to this:
try:
_ = ip.ip
except AttributeError:
_ = ip.network_address
ips.append(_)
Lib/tempfile.py:130:
try:
dirlist.append(_os.getcwd())
except (AttributeError, OSError):
dirlist.append(_os.curdir)
Becomes:
dirlist.append(_os.getcwd() except (AttributeError, OSError): _os.curdir)
Lib/asyncore.py:264:
try:
status.append('%s:%d' % self.addr)
except TypeError:
status.append(repr(self.addr))
Becomes:
status.append('%s:%d' % self.addr except TypeError: repr(self.addr))
In each case, the narrowed scope of the try/except ensures that an unexpected exception (for instance, AttributeError if "append" were misspelled) does not get caught by the same handler. This is sufficiently unlikely to be reason to break the call out into a separate line (as per the five line example above), but it is a small benefit gained as a side-effect of the conversion.
Comparisons with other languages
(With thanks to Andrew Barnert for compiling this section. Note that the examples given here do not reflect the current version of the proposal, and need to be edited.)
Ruby's [5] "begin…rescue…rescue…else…ensure…end" is an expression (potentially with statements inside it). It has the equivalent of an "as" clause, and the equivalent of bare except. And it uses no punctuation or keyword between the bare except/exception class/exception class with as clause and the value. (And yes, it's ambiguous unless you understand Ruby's statement/expression rules.)
x = begin computation() rescue MyException => e default(e) end; x = begin computation() rescue MyException default() end; x = begin computation() rescue default() end; x = begin computation() rescue MyException default() rescue OtherException other() end;
In terms of this PEP:
x = computation() except MyException as e default(e) x = computation() except MyException default(e) x = computation() except default(e) x = computation() except MyException default() except OtherException other()
Erlang [6] has a try expression that looks like this
x = try computation() catch MyException:e -> default(e) end; x = try computation() catch MyException:e -> default(e); OtherException:e -> other(e) end;
The class and "as" name are mandatory, but you can use "_" for either. There's also an optional "when" guard on each, and a "throw" clause that you can catch, which I won't get into. To handle multiple exceptions, you just separate the clauses with semicolons, which I guess would map to commas in Python. So:
x = try computation() except MyException as e -> default(e) x = try computation() except MyException as e -> default(e), OtherException as e->other_default(e)
Erlang also has a "catch" expression, which, despite using the same keyword, is completely different, and you don't want to know about it.
The ML family has two different ways of dealing with this, "handle" and "try"; the difference between the two is that "try" pattern-matches the exception, which gives you the effect of multiple except clauses and as clauses. In either form, the handler clause is punctuated by "=>" in some dialects, "->" in others.
To avoid confusion, I'll write the function calls in Python style.
let x = computation() handle MyException => default();;
let x = try computation() with MyException explanation -> default(explanation);;
let x = try computation() with
MyException(e) -> default(e)
| MyOtherException() -> other_default()
| (e) -> fallback(e);;
In terms of this PEP, these would be something like:
x = computation() except MyException => default()
x = try computation() except MyException e -> default()
x = (try computation()
except MyException as e -> default(e)
except MyOtherException -> other_default()
except BaseException as e -> fallback(e))
Many ML-inspired but not-directly-related languages from academia mix things up, usually using more keywords and fewer symbols. So, the Oz [9] would map to Python as
x = try computation() catch MyException as e then default(e)
Many Lisp-derived languages, like Clojure, [10] implement try/catch as special forms (if you don't know what that means, think function-like macros), so you write, effectively
try(computation(), catch(MyException, explanation, default(explanation)))
try(computation(),
catch(MyException, explanation, default(explanation)),
catch(MyOtherException, explanation, other_default(explanation)))
In Common Lisp, this is done with a slightly clunkier "handler-case" macro, [11] but the basic idea is the same.
The Lisp style is, surprisingly, used by some languages that don't have macros, like Lua, where xpcall [12] takes functions. Writing lambdas Python-style instead of Lua-style
x = xpcall(lambda: expression(), lambda e: default(e))
This actually returns (true, expression()) or (false, default(e)), but I think we can ignore that part.
Haskell is actually similar to Lua here (except that it's all done with monads, of course):
x = do catch(lambda: expression(), lambda e: default(e))
You can write a pattern matching expression within the function to decide what to do with it; catching and re-raising exceptions you don't want is cheap enough to be idiomatic.
But Haskell infixing makes this nicer:
x = do expression() `catch` lambda: default() x = do expression() `catch` lambda e: default(e)
And that makes the parallel between the lambda colon and the except colon in the proposal much more obvious:
x = expression() except Exception: default() x = expression() except Exception as e: default(e)
Tcl [13] has the other half of Lua's xpcall; catch is a function which returns true if an exception was caught, false otherwise, and you get the value out in other ways. And it's all built around the the implicit quote-and-exec that everything in Tcl is based on, making it even harder to describe in Python terms than Lisp macros, but something like
if {[ catch("computation()") "explanation"]} { default(explanation) }
Smalltalk [14] is also somewhat hard to map to Python. The basic version would be
x := computation() on:MyException do:default()
... but that's basically Smalltalk's passing-arguments-with-colons syntax, not its exception-handling syntax.
Deferred sub-proposals
Multiple except clauses
An examination of use-cases shows that this is not needed as often as it would be with the statement form, and as its syntax is a point on which consensus has not been reached, the entire feature is deferred.
Multiple 'except' keywords could be used, and they will all catch exceptions raised in the original expression (only):
# Will catch any of the listed exceptions thrown by expr;
# any exception thrown by a default expression will propagate.
value = (expr
except Exception1: default1
except Exception2: default2
# ... except ExceptionN: defaultN
)
Currently, one of the following forms must be used:
# Will catch an Exception2 thrown by either expr or default1
value = (
(expr except Exception1: default1)
except Exception2: default2
)
# Will catch an Exception2 thrown by default1 only
value = (expr except Exception1:
(default1 except Exception2: default2)
)
Listing multiple exception clauses without parentheses is a syntax error (see above), and so a future version of Python is free to add this feature without breaking any existing code.
Capturing the exception object
In a try/except block, the use of 'as' to capture the exception object creates a local name binding, and implicitly deletes that binding (to avoid creating a reference loop) in a finally clause. In an expression context, this makes little sense, and a proper sub-scope would be required to safely capture the exception object - something akin to the way a list comprehension is handled. However, CPython currently implements a comprehension's subscope with a nested function call, which has consequences in some contexts such as class definitions, and is therefore unsuitable for this proposal. Should there be, in future, a way to create a true subscope (which could simplify comprehensions, except expressions, with blocks, and possibly more), then this proposal could be revived; until then, its loss is not a great one, as the simple exception handling that is well suited to the expression notation used here is generally concerned only with the type of the exception, and not its value - further analysis below.
This syntax would, admittedly, allow a convenient way to capture exceptions in interactive Python; returned values are captured by "_", but exceptions currently are not. This could be spelled:
>>> (expr except Exception as e: e)
An examination of the Python standard library shows that, while the use of 'as' is fairly common (occurring in roughly one except clause in five), it is extremely uncommon in the cases which could logically be converted into the expression form. Its few uses can simply be left unchanged. Consequently, in the interests of simplicity, the 'as' clause is not included in this proposal. A subsequent Python version can add this without breaking any existing code, as 'as' is already a keyword.
One example where this could possibly be useful is Lib/imaplib.py:568:
try: typ, dat = self._simple_command('LOGOUT')
except: typ, dat = 'NO', ['%s: %s' % sys.exc_info()[:2]]
This could become:
typ, dat = (self._simple_command('LOGOUT')
except BaseException as e: ('NO', '%s: %s' % (type(e), e)))
Or perhaps some other variation. This is hardly the most compelling use-case, but an intelligent look at this code could tidy it up significantly. In the absence of further examples showing any need of the exception object, I have opted to defer indefinitely the recommendation.
Rejected sub-proposals
finally clause
The statement form try... finally or try... except... finally has no logical corresponding expression form. Therefore the finally keyword is not a part of this proposal, in any way.
Bare except having different meaning
With several of the proposed syntaxes, omitting the exception type name would be easy and concise, and would be tempting. For convenience's sake, it might be advantageous to have a bare 'except' clause mean something more useful than "except BaseException". Proposals included having it catch Exception, or some specific set of "common exceptions" (subclasses of a new type called ExpressionError), or have it look for a tuple named ExpressionError in the current scope, with a built-in default such as (ValueError, UnicodeError, AttributeError, EOFError, IOError, OSError, LookupError, NameError, ZeroDivisionError). All of these were rejected, for several reasons.
- First and foremost, consistency with the statement form of try/except would be broken. Just as a list comprehension or ternary if expression can be explained by "breaking it out" into its vertical statement form, an expression-except should be able to be explained by a relatively mechanical translation into a near-equivalent statement. Any form of syntax common to both should therefore have the same semantics in each, and above all should not have the subtle difference of catching more in one than the other, as it will tend to attract unnoticed bugs.
- Secondly, the set of appropriate exceptions to catch would itself be a huge point of contention. It would be impossible to predict exactly which exceptions would "make sense" to be caught; why bless some of them with convenient syntax and not others?
- And finally (this partly because the recommendation was that a bare except should be actively encouraged, once it was reduced to a "reasonable" set of exceptions), any situation where you catch an exception you don't expect to catch is an unnecessary bug magnet.
Consequently, the use of a bare 'except' is down to two possibilities: either it is syntactically forbidden in the expression form, or it is permitted with the exact same semantics as in the statement form (namely, that it catch BaseException and be unable to capture it with 'as').
Bare except clauses
PEP 8 rightly advises against the use of a bare 'except'. While it is syntactically legal in a statement, and for backward compatibility must remain so, there is little value in encouraging its use. In an expression except clause, "except:" is a SyntaxError; use the equivalent long-hand form "except BaseException:" instead. A future version of Python MAY choose to reinstate this, which can be done without breaking compatibility.
Parentheses around the except clauses
Should it be legal to parenthesize the except clauses, separately from the expression that could raise? Example:
value = expr (
except Exception1 [as e]: default1
except Exception2 [as e]: default2
# ... except ExceptionN [as e]: defaultN
)
This is more compelling when one or both of the deferred sub-proposals of multiple except clauses and/or exception capturing is included. In their absence, the parentheses would be thus:
value = expr except ExceptionType: default value = expr (except ExceptionType: default)
The advantage is minimal, and the potential to confuse a reader into thinking the except clause is separate from the expression, or into thinking this is a function call, makes this non-compelling. The expression can, of course, be parenthesized if desired, as can the default:
value = (expr) except ExceptionType: (default)
As the entire expression is now required to be in parentheses (which had not been decided at the time when this was debated), there is less need to delineate this section, and in many cases it would be redundant.
Short-hand for "except: pass"
The following was been suggested as a similar short-hand, though not technically an expression:
statement except Exception: pass
try:
statement
except Exception:
pass
For instance, a common use-case is attempting the removal of a file:
os.unlink(some_file) except OSError: pass
There is an equivalent already in Python 3.4, however, in contextlib:
from contextlib import suppress with suppress(OSError): os.unlink(some_file)
As this is already a single line (or two with a break after the colon), there is little need of new syntax and a confusion of statement vs expression to achieve this.
Common objections
Colons always introduce suites
While it is true that many of Python's syntactic elements use the colon to introduce a statement suite (if, while, with, for, etcetera), this is not by any means the sole use of the colon. Currently, Python syntax includes four cases where a colon introduces a subexpression:
- dict display - { ... key:value ... }
- slice notation - [start:stop:step]
- function definition - parameter : annotation
- lambda - arg list: return value
This proposal simply adds a fifth:
- except-expression - exception list: result
Style guides and PEP 8 should recommend not having the colon at the end of a wrapped line, which could potentially look like the introduction of a suite, but instead advocate wrapping before the exception list, keeping the colon clearly between two expressions.
References
| [1] | https://mail.python.org/pipermail/python-ideas/2014-February/025443.html |
| [2] | https://mail.python.org/pipermail/python-ideas/2013-March/019760.html |
| [3] | https://mail.python.org/pipermail/python-ideas/2009-August/005441.html |
| [4] | https://mail.python.org/pipermail/python-ideas/2008-August/001801.html |
| [5] | http://www.skorks.com/2009/09/ruby-exceptions-and-exception-handling/ |
| [6] | http://erlang.org/doc/reference_manual/expressions.html#id79284 |
| [7] | http://www.cs.cmu.edu/~rwh/introsml/core/exceptions.htm |
| [8] | http://www2.lib.uchicago.edu/keith/ocaml-class/exceptions.html |
| [9] | http://mozart.github.io/mozart-v1/doc-1.4.0/tutorial/node5.html |
| [10] | http://clojure.org/special_forms#Special%20Forms--(try%20expr*%20catch-clause*%20finally-clause?) |
| [11] | http://clhs.lisp.se/Body/m_hand_1.htm |
| [12] | http://www.gammon.com.au/scripts/doc.php?lua=xpcall |
| [13] | http://wiki.tcl.tk/902 |
| [14] | http://smalltalk.gnu.org/wiki/exceptions |
Copyright
This document has been placed in the public domain.
pep-0464 Removal of the PyPI Mirror Authenticity API
| PEP: | 464 |
|---|---|
| Title: | Removal of the PyPI Mirror Authenticity API |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io> |
| BDFL-Delegate: | Richard Jones <richard@python.org> |
| Discussions-To: | distutils-sig at python.org |
| Status: | Accepted |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 02-Mar-2014 |
| Post-History: | 04-Mar-2014 |
| Replaces: | 381 |
| Resolution: | https://mail.python.org/pipermail/distutils-sig/2014-March/024027.html |
Abstract
This PEP proposes the deprecation and removal of the PyPI Mirror Authenticity API, this includes the /serverkey URL and all of the URLs under /serversig.
Rationale
The PyPI mirroring infrastructure (defined in PEP 381) provides a means to mirror the content of PyPI used by the automatic installers, and as a component of that, it provides a method for verifying the authenticity of the mirrored content.
This PEP proposes the removal of this API due to:
- There are no known implementations that utilize this API, this includes pip and setuptools.
- Because this API uses DSA it is vulnerable to leaking the private key if there is any bias in the random nonce.
- This API solves one small corner of the trust problem, however the problem itself is much larger and it would be better to have a fully fledged system, such as The Update Framework, instead.
Due to the issues it has and the lack of use it is the opinion of this PEP that it does not provide any practical benefit to justify the additional complexity.
Plan for Deprecation & Removal
Immediately upon the acceptance of this PEP the Mirror Authenticity API will be considered deprecated and mirroring agents and installation tools should stop accessing it.
Instead of actually removing it from the current code base (PyPI 1.0) the current work to replace PyPI 1.0 with a new code base (PyPI 2.0) will simply not implement this API. This would cause the API to be "removed" when the switch from 1.0 to 2.0 occurs.
If PyPI 2.0 has not been deployed in place of PyPI 1.0 by Sept 01 2014 then this PEP will be implemented in the PyPI 1.0 code base instead (by removing the associated code).
No changes will be required in the installers, however PEP 381 compliant mirroring clients, such as bandersnatch and pep381client will need to be updated to no longer attempt to mirror the /serversig URLs.
Copyright
This document has been placed in the public domain.
pep-0465 A dedicated infix operator for matrix multiplication
| PEP: | 465 |
|---|---|
| Title: | A dedicated infix operator for matrix multiplication |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nathaniel J. Smith <njs at pobox.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Feb-2014 |
| Python-Version: | 3.5 |
| Post-History: | 13-Mar-2014 |
Contents
- Abstract
- Specification
- Motivation
- Executive summary
- Background: What's wrong with the status quo?
- Why should matrix multiplication be infix?
- Transparent syntax is especially crucial for non-expert programmers
- But isn't matrix multiplication a pretty niche requirement?
- So @ is good for matrix formulas, but how common are those really?
- But isn't it weird to add an operator with no stdlib uses?
- Compatibility considerations
- Intended usage details
- Implementation details
- Rationale for specification details
- Rejected alternatives to adding a new operator
- Discussions of this PEP
- References
- Copyright
Abstract
This PEP proposes a new binary operator to be used for matrix multiplication, called @. (Mnemonic: @ is * for mATrices.)
Specification
A new binary operator is added to the Python language, together with the corresponding in-place version:
| Op | Precedence/associativity | Methods |
|---|---|---|
| @ | Same as * | __matmul__, __rmatmul__ |
| @= | n/a | __imatmul__ |
No implementations of these methods are added to the builtin or standard library types. However, a number of projects have reached consensus on the recommended semantics for these operations; see Intended usage details below for details.
For details on how this operator will be implemented in CPython, see Implementation details.
Motivation
Executive summary
In numerical code, there are two important operations which compete for use of Python's * operator: elementwise multiplication, and matrix multiplication. In the nearly twenty years since the Numeric library was first proposed, there have been many attempts to resolve this tension [13]; none have been really satisfactory. Currently, most numerical Python code uses * for elementwise multiplication, and function/method syntax for matrix multiplication; however, this leads to ugly and unreadable code in common circumstances. The problem is bad enough that significant amounts of code continue to use the opposite convention (which has the virtue of producing ugly and unreadable code in different circumstances), and this API fragmentation across codebases then creates yet more problems. There does not seem to be any good solution to the problem of designing a numerical API within current Python syntax -- only a landscape of options that are bad in different ways. The minimal change to Python syntax which is sufficient to resolve these problems is the addition of a single new infix operator for matrix multiplication.
Matrix multiplication has a singular combination of features which distinguish it from other binary operations, which together provide a uniquely compelling case for the addition of a dedicated infix operator:
- Just as for the existing numerical operators, there exists a vast body of prior art supporting the use of infix notation for matrix multiplication across all fields of mathematics, science, and engineering; @ harmoniously fills a hole in Python's existing operator system.
- @ greatly clarifies real-world code.
- @ provides a smoother onramp for less experienced users, who are particularly harmed by hard-to-read code and API fragmentation.
- @ benefits a substantial and growing portion of the Python user community.
- @ will be used frequently -- in fact, evidence suggests it may be used more frequently than // or the bitwise operators.
- @ allows the Python numerical community to reduce fragmentation, and finally standardize on a single consensus duck type for all numerical array objects.
Background: What's wrong with the status quo?
When we crunch numbers on a computer, we usually have lots and lots of numbers to deal with. Trying to deal with them one at a time is cumbersome and slow -- especially when using an interpreted language. Instead, we want the ability to write down simple operations that apply to large collections of numbers all at once. The n-dimensional array is the basic object that all popular numeric computing environments use to make this possible. Python has several libraries that provide such arrays, with numpy being at present the most prominent.
When working with n-dimensional arrays, there are two different ways we might want to define multiplication. One is elementwise multiplication:
[[1, 2], [[11, 12], [[1 * 11, 2 * 12], [3, 4]] x [13, 14]] = [3 * 13, 4 * 14]]
and the other is matrix multiplication [19]:
[[1, 2], [[11, 12], [[1 * 11 + 2 * 13, 1 * 12 + 2 * 14], [3, 4]] x [13, 14]] = [3 * 11 + 4 * 13, 3 * 12 + 4 * 14]]
Elementwise multiplication is useful because it lets us easily and quickly perform many multiplications on a large collection of values, without writing a slow and cumbersome for loop. And this works as part of a very general schema: when using the array objects provided by numpy or other numerical libraries, all Python operators work elementwise on arrays of all dimensionalities. The result is that one can write functions using straightforward code like a * b + c / d, treating the variables as if they were simple values, but then immediately use this function to efficiently perform this calculation on large collections of values, while keeping them organized using whatever arbitrarily complex array layout works best for the problem at hand.
Matrix multiplication is more of a special case. It's only defined on 2d arrays (also known as "matrices"), and multiplication is the only operation that has an important "matrix" version -- "matrix addition" is the same as elementwise addition; there is no such thing as "matrix bitwise-or" or "matrix floordiv"; "matrix division" and "matrix to-the-power-of" can be defined but are not very useful, etc. However, matrix multiplication is still used very heavily across all numerical application areas; mathematically, it's one of the most fundamental operations there is.
Because Python syntax currently allows for only a single multiplication operator *, libraries providing array-like objects must decide: either use * for elementwise multiplication, or use * for matrix multiplication. And, unfortunately, it turns out that when doing general-purpose number crunching, both operations are used frequently, and there are major advantages to using infix rather than function call syntax in both cases. Thus it is not at all clear which convention is optimal, or even acceptable; often it varies on a case-by-case basis.
Nonetheless, network effects mean that it is very important that we pick just one convention. In numpy, for example, it is technically possible to switch between the conventions, because numpy provides two different types with different __mul__ methods. For numpy.ndarray objects, * performs elementwise multiplication, and matrix multiplication must use a function call (numpy.dot). For numpy.matrix objects, * performs matrix multiplication, and elementwise multiplication requires function syntax. Writing code using numpy.ndarray works fine. Writing code using numpy.matrix also works fine. But trouble begins as soon as we try to integrate these two pieces of code together. Code that expects an ndarray and gets a matrix, or vice-versa, may crash or return incorrect results. Keeping track of which functions expect which types as inputs, and return which types as outputs, and then converting back and forth all the time, is incredibly cumbersome and impossible to get right at any scale. Functions that defensively try to handle both types as input and DTRT, find themselves floundering into a swamp of isinstance and if statements.
PEP 238 split / into two operators: / and //. Imagine the chaos that would have resulted if it had instead split int into two types: classic_int, whose __div__ implemented floor division, and new_int, whose __div__ implemented true division. This, in a more limited way, is the situation that Python number-crunchers currently find themselves in.
In practice, the vast majority of projects have settled on the convention of using * for elementwise multiplication, and function call syntax for matrix multiplication (e.g., using numpy.ndarray instead of numpy.matrix). This reduces the problems caused by API fragmentation, but it doesn't eliminate them. The strong desire to use infix notation for matrix multiplication has caused a number of specialized array libraries to continue to use the opposing convention (e.g., scipy.sparse, pyoperators, pyviennacl) despite the problems this causes, and numpy.matrix itself still gets used in introductory programming courses, often appears in StackOverflow answers, and so forth. Well-written libraries thus must continue to be prepared to deal with both types of objects, and, of course, are also stuck using unpleasant funcall syntax for matrix multiplication. After nearly two decades of trying, the numerical community has still not found any way to resolve these problems within the constraints of current Python syntax (see Rejected alternatives to adding a new operator below).
This PEP proposes the minimum effective change to Python syntax that will allow us to drain this swamp. It splits * into two operators, just as was done for /: * for elementwise multiplication, and @ for matrix multiplication. (Why not the reverse? Because this way is compatible with the existing consensus, and because it gives us a consistent rule that all the built-in numeric operators also apply in an elementwise manner to arrays; the reverse convention would lead to more special cases.)
So that's why matrix multiplication doesn't and can't just use *. Now, in the the rest of this section, we'll explain why it nonetheless meets the high bar for adding a new operator.
Why should matrix multiplication be infix?
Right now, most numerical code in Python uses syntax like numpy.dot(a, b) or a.dot(b) to perform matrix multiplication. This obviously works, so why do people make such a fuss about it, even to the point of creating API fragmentation and compatibility swamps?
Matrix multiplication shares two features with ordinary arithmetic operations like addition and multiplication on numbers: (a) it is used very heavily in numerical programs -- often multiple times per line of code -- and (b) it has an ancient and universally adopted tradition of being written using infix syntax. This is because, for typical formulas, this notation is dramatically more readable than any function call syntax. Here's an example to demonstrate:
One of the most useful tools for testing a statistical hypothesis is the linear hypothesis test for OLS regression models. It doesn't really matter what all those words I just said mean; if we find ourselves having to implement this thing, what we'll do is look up some textbook or paper on it, and encounter many mathematical formulas that look like:
Here the various variables are all vectors or matrices (details for the curious: [5]).
Now we need to write code to perform this calculation. In current numpy, matrix multiplication can be performed using either the function or method call syntax. Neither provides a particularly readable translation of the formula:
import numpy as np
from numpy.linalg import inv, solve
# Using dot function:
S = np.dot((np.dot(H, beta) - r).T,
np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))
# Using dot method:
S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)
With the @ operator, the direct translation of the above formula becomes:
S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)
Notice that there is now a transparent, 1-to-1 mapping between the symbols in the original formula and the code that implements it.
Of course, an experienced programmer will probably notice that this is not the best way to compute this expression. The repeated computation of Hβ − r should perhaps be factored out; and, expressions of the form dot(inv(A), B) should almost always be replaced by the more numerically stable solve(A, B). When using @, performing these two refactorings gives us:
# Version 1 (as above) S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r) # Version 2 trans_coef = H @ beta - r S = trans_coef.T @ inv(H @ V @ H.T) @ trans_coef # Version 3 S = trans_coef.T @ solve(H @ V @ H.T, trans_coef)
Notice that when comparing between each pair of steps, it's very easy to see exactly what was changed. If we apply the equivalent transformations to the code using the .dot method, then the changes are much harder to read out or verify for correctness:
# Version 1 (as above) S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r) # Version 2 trans_coef = H.dot(beta) - r S = trans_coef.T.dot(inv(H.dot(V).dot(H.T))).dot(trans_coef) # Version 3 S = trans_coef.T.dot(solve(H.dot(V).dot(H.T)), trans_coef)
Readability counts! The statements using @ are shorter, contain more whitespace, can be directly and easily compared both to each other and to the textbook formula, and contain only meaningful parentheses. This last point is particularly important for readability: when using function-call syntax, the required parentheses on every operation create visual clutter that makes it very difficult to parse out the overall structure of the formula by eye, even for a relatively simple formula like this one. Eyes are terrible at parsing non-regular languages. I made and caught many errors while trying to write out the 'dot' formulas above. I know they still contain at least one error, maybe more. (Exercise: find it. Or them.) The @ examples, by contrast, are not only correct, they're obviously correct at a glance.
If we are even more sophisticated programmers, and writing code that we expect to be reused, then considerations of speed or numerical accuracy might lead us to prefer some particular order of evaluation. Because @ makes it possible to omit irrelevant parentheses, we can be certain that if we do write something like (H @ V) @ H.T, then our readers will know that the parentheses must have been added intentionally to accomplish some meaningful purpose. In the dot examples, it's impossible to know which nesting decisions are important, and which are arbitrary.
Infix @ dramatically improves matrix code usability at all stages of programmer interaction.
Transparent syntax is especially crucial for non-expert programmers
A large proportion of scientific code is written by people who are experts in their domain, but are not experts in programming. And there are many university courses run each year with titles like "Data analysis for social scientists" which assume no programming background, and teach some combination of mathematical techniques, introduction to programming, and the use of programming to implement these mathematical techniques, all within a 10-15 week period. These courses are more and more often being taught in Python rather than special-purpose languages like R or Matlab.
For these kinds of users, whose programming knowledge is fragile, the existence of a transparent mapping between formulas and code often means the difference between succeeding and failing to write that code at all. This is so important that such classes often use the numpy.matrix type which defines * to mean matrix multiplication, even though this type is buggy and heavily disrecommended by the rest of the numpy community for the fragmentation that it causes. This pedagogical use case is, in fact, the only reason numpy.matrix remains a supported part of numpy. Adding @ will benefit both beginning and advanced users with better syntax; and furthermore, it will allow both groups to standardize on the same notation from the start, providing a smoother on-ramp to expertise.
But isn't matrix multiplication a pretty niche requirement?
The world is full of continuous data, and computers are increasingly called upon to work with it in sophisticated ways. Arrays are the lingua franca of finance, machine learning, 3d graphics, computer vision, robotics, operations research, econometrics, meteorology, computational linguistics, recommendation systems, neuroscience, astronomy, bioinformatics (including genetics, cancer research, drug discovery, etc.), physics engines, quantum mechanics, geophysics, network analysis, and many other application areas. In most or all of these areas, Python is rapidly becoming a dominant player, in large part because of its ability to elegantly mix traditional discrete data structures (hash tables, strings, etc.) on an equal footing with modern numerical data types and algorithms.
We all live in our own little sub-communities, so some Python users may be surprised to realize the sheer extent to which Python is used for number crunching -- especially since much of this particular sub-community's activity occurs outside of traditional Python/FOSS channels. So, to give some rough idea of just how many numerical Python programmers are actually out there, here are two numbers: In 2013, there were 7 international conferences organized specifically on numerical Python [3] [4]. At PyCon 2014, ~20% of the tutorials appear to involve the use of matrices [6].
To quantify this further, we used Github's "search" function to look at what modules are actually imported across a wide range of real-world code (i.e., all the code on Github). We checked for imports of several popular stdlib modules, a variety of numerically oriented modules, and various other extremely high-profile modules like django and lxml (the latter of which is the #1 most downloaded package on PyPI). Starred lines indicate packages which export array- or matrix-like objects which will adopt @ if this PEP is approved:
Count of Python source files on Github matching given search terms
(as of 2014-04-10, ~21:00 UTC)
================ ========== =============== ======= ===========
module "import X" "from X import" total total/numpy
================ ========== =============== ======= ===========
sys 2374638 63301 2437939 5.85
os 1971515 37571 2009086 4.82
re 1294651 8358 1303009 3.12
numpy ************** 337916 ********** 79065 * 416981 ******* 1.00
warnings 298195 73150 371345 0.89
subprocess 281290 63644 344934 0.83
django 62795 219302 282097 0.68
math 200084 81903 281987 0.68
threading 212302 45423 257725 0.62
pickle+cPickle 215349 22672 238021 0.57
matplotlib 119054 27859 146913 0.35
sqlalchemy 29842 82850 112692 0.27
pylab *************** 36754 ********** 41063 ** 77817 ******* 0.19
scipy *************** 40829 ********** 28263 ** 69092 ******* 0.17
lxml 19026 38061 57087 0.14
zlib 40486 6623 47109 0.11
multiprocessing 25247 19850 45097 0.11
requests 30896 560 31456 0.08
jinja2 8057 24047 32104 0.08
twisted 13858 6404 20262 0.05
gevent 11309 8529 19838 0.05
pandas ************** 14923 *********** 4005 ** 18928 ******* 0.05
sympy 2779 9537 12316 0.03
theano *************** 3654 *********** 1828 *** 5482 ******* 0.01
================ ========== =============== ======= ===========
These numbers should be taken with several grains of salt (see footnote for discussion: [12]), but, to the extent they can be trusted, they suggest that numpy might be the single most-imported non-stdlib module in the entire Pythonverse; it's even more-imported than such stdlib stalwarts as subprocess, math, pickle, and threading. And numpy users represent only a subset of the broader numerical community that will benefit from the @ operator. Matrices may once have been a niche data type restricted to Fortran programs running in university labs and military clusters, but those days are long gone. Number crunching is a mainstream part of modern Python usage.
In addition, there is some precedence for adding an infix operator to handle a more-specialized arithmetic operation: the floor division operator //, like the bitwise operators, is very useful under certain circumstances when performing exact calculations on discrete values. But it seems likely that there are many Python programmers who have never had reason to use // (or, for that matter, the bitwise operators). @ is no more niche than //.
So @ is good for matrix formulas, but how common are those really?
We've seen that @ makes matrix formulas dramatically easier to work with for both experts and non-experts, that matrix formulas appear in many important applications, and that numerical libraries like numpy are used by a substantial proportion of Python's user base. But numerical libraries aren't just about matrix formulas, and being important doesn't necessarily mean taking up a lot of code: if matrix formulas only occured in one or two places in the average numerically-oriented project, then it still wouldn't be worth adding a new operator. So how common is matrix multiplication, really?
When the going gets tough, the tough get empirical. To get a rough estimate of how useful the @ operator will be, the table below shows the rate at which different Python operators are actually used in the stdlib, and also in two high-profile numerical packages -- the scikit-learn machine learning library, and the nipy neuroimaging library -- normalized by source lines of code (SLOC). Rows are sorted by the 'combined' column, which pools all three code bases together. The combined column is thus strongly weighted towards the stdlib, which is much larger than both projects put together (stdlib: 411575 SLOC, scikit-learn: 50924 SLOC, nipy: 37078 SLOC). [7]
The dot row (marked ******) counts how common matrix multiply operations are in each codebase.
==== ====== ============ ==== ======== op stdlib scikit-learn nipy combined ==== ====== ============ ==== ======== = 2969 5536 4932 3376 / 10,000 SLOC - 218 444 496 261 + 224 201 348 231 == 177 248 334 196 * 156 284 465 192 % 121 114 107 119 ** 59 111 118 68 != 40 56 74 44 / 18 121 183 41 > 29 70 110 39 += 34 61 67 39 < 32 62 76 38 >= 19 17 17 18 <= 18 27 12 18 dot ***** 0 ********** 99 ** 74 ****** 16 | 18 1 2 15 & 14 0 6 12 << 10 1 1 8 // 9 9 1 8 -= 5 21 14 8 *= 2 19 22 5 /= 0 23 16 4 >> 4 0 0 3 ^ 3 0 0 3 ~ 2 4 5 2 |= 3 0 0 2 &= 1 0 0 1 //= 1 0 0 1 ^= 1 0 0 0 **= 0 2 0 0 %= 0 0 0 0 <<= 0 0 0 0 >>= 0 0 0 0 ==== ====== ============ ==== ========
These two numerical packages alone contain ~780 uses of matrix multiplication. Within these packages, matrix multiplication is used more heavily than most comparison operators (< != <= >=). Even when we dilute these counts by including the stdlib into our comparisons, matrix multiplication is still used more often in total than any of the bitwise operators, and 2x as often as //. This is true even though the stdlib, which contains a fair amount of integer arithmetic and no matrix operations, makes up more than 80% of the combined code base.
By coincidence, the numeric libraries make up approximately the same proportion of the 'combined' codebase as numeric tutorials make up of PyCon 2014's tutorial schedule, which suggests that the 'combined' column may not be wildly unrepresentative of new Python code in general. While it's impossible to know for certain, from this data it seems entirely possible that across all Python code currently being written, matrix multiplication is already used more often than // and the bitwise operations.
But isn't it weird to add an operator with no stdlib uses?
It's certainly unusual (though extended slicing existed for some time builtin types gained support for it, Ellipsis is still unused within the stdlib, etc.). But the important thing is whether a change will benefit users, not where the software is being downloaded from. It's clear from the above that @ will be used, and used heavily. And this PEP provides the critical piece that will allow the Python numerical community to finally reach consensus on a standard duck type for all array-like objects, which is a necessary precondition to ever adding a numerical array type to the stdlib.
Compatibility considerations
Currently, the only legal use of the @ token in Python code is at statement beginning in decorators. The new operators are both infix; the one place they can never occur is at statement beginning. Therefore, no existing code will be broken by the addition of these operators, and there is no possible parsing ambiguity between decorator-@ and the new operators.
Another important kind of compatibility is the mental cost paid by users to update their understanding of the Python language after this change, particularly for users who do not work with matrices and thus do not benefit. Here again, @ has minimal impact: even comprehensive tutorials and references will only need to add a sentence or two to fully document this PEP's changes for a non-numerical audience.
Intended usage details
This section is informative, rather than normative -- it documents the consensus of a number of libraries that provide array- or matrix-like objects on how @ will be implemented.
This section uses the numpy terminology for describing arbitrary multidimensional arrays of data, because it is a superset of all other commonly used models. In this model, the shape of any array is represented by a tuple of integers. Because matrices are two-dimensional, they have len(shape) == 2, while 1d vectors have len(shape) == 1, and scalars have shape == (), i.e., they are "0 dimensional". Any array contains prod(shape) total entries. Notice that prod(()) == 1 [20] (for the same reason that sum(()) == 0); scalars are just an ordinary kind of array, not a special case. Notice also that we distinguish between a single scalar value (shape == (), analogous to 1), a vector containing only a single entry (shape == (1,), analogous to [1]), a matrix containing only a single entry (shape == (1, 1), analogous to [[1]]), etc., so the dimensionality of any array is always well-defined. Other libraries with more restricted representations (e.g., those that support 2d arrays only) might implement only a subset of the functionality described here.
Semantics
The recommended semantics for @ for different inputs are:
2d inputs are conventional matrices, and so the semantics are obvious: we apply conventional matrix multiplication. If we write arr(2, 3) to represent an arbitrary 2x3 array, then arr(2, 3) @ arr(3, 4) returns an array with shape (2, 4).
1d vector inputs are promoted to 2d by prepending or appending a '1' to the shape, the operation is performed, and then the added dimension is removed from the output. The 1 is always added on the "outside" of the shape: prepended for left arguments, and appended for right arguments. The result is that matrix @ vector and vector @ matrix are both legal (assuming compatible shapes), and both return 1d vectors; vector @ vector returns a scalar. This is clearer with examples.
- arr(2, 3) @ arr(3, 1) is a regular matrix product, and returns an array with shape (2, 1), i.e., a column vector.
- arr(2, 3) @ arr(3) performs the same computation as the previous (i.e., treats the 1d vector as a matrix containing a single column, shape = (3, 1)), but returns the result with shape (2,), i.e., a 1d vector.
- arr(1, 3) @ arr(3, 2) is a regular matrix product, and returns an array with shape (1, 2), i.e., a row vector.
- arr(3) @ arr(3, 2) performs the same computation as the previous (i.e., treats the 1d vector as a matrix containing a single row, shape = (1, 3)), but returns the result with shape (2,), i.e., a 1d vector.
- arr(1, 3) @ arr(3, 1) is a regular matrix product, and returns an array with shape (1, 1), i.e., a single value in matrix form.
- arr(3) @ arr(3) performs the same computation as the previous, but returns the result with shape (), i.e., a single scalar value, not in matrix form. So this is the standard inner product on vectors.
An infelicity of this definition for 1d vectors is that it makes @ non-associative in some cases ((Mat1 @ vec) @ Mat2 != Mat1 @ (vec @ Mat2)). But this seems to be a case where practicality beats purity: non-associativity only arises for strange expressions that would never be written in practice; if they are written anyway then there is a consistent rule for understanding what will happen (Mat1 @ vec @ Mat2 is parsed as (Mat1 @ vec) @ Mat2, just like a - b - c); and, not supporting 1d vectors would rule out many important use cases that do arise very commonly in practice. No-one wants to explain to new users why to solve the simplest linear system in the obvious way, they have to type (inv(A) @ b[:, np.newaxis]).flatten() instead of inv(A) @ b, or perform an ordinary least-squares regression by typing solve(X.T @ X, X @ y[:, np.newaxis]).flatten() instead of solve(X.T @ X, X @ y). No-one wants to type (a[np.newaxis, :] @ b[:, np.newaxis])[0, 0] instead of a @ b every time they compute an inner product, or (a[np.newaxis, :] @ Mat @ b[:, np.newaxis])[0, 0] for general quadratic forms instead of a @ Mat @ b. In addition, sage and sympy (see below) use these non-associative semantics with an infix matrix multiplication operator (they use *), and they report that they haven't experienced any problems caused by it.
For inputs with more than 2 dimensions, we treat the last two dimensions as being the dimensions of the matrices to multiply, and 'broadcast' across the other dimensions. This provides a convenient way to quickly compute many matrix products in a single operation. For example, arr(10, 2, 3) @ arr(10, 3, 4) performs 10 separate matrix multiplies, each of which multiplies a 2x3 and a 3x4 matrix to produce a 2x4 matrix, and then returns the 10 resulting matrices together in an array with shape (10, 2, 4). The intuition here is that we treat these 3d arrays of numbers as if they were 1d arrays of matrices, and then apply matrix multiplication in an elementwise manner, where now each 'element' is a whole matrix. Note that broadcasting is not limited to perfectly aligned arrays; in more complicated cases, it allows several simple but powerful tricks for controlling how arrays are aligned with each other; see [10] for details. (In particular, it turns out that when broadcasting is taken into account, the standard scalar * matrix product is a special case of the elementwise multiplication operator *.)
If one operand is >2d, and another operand is 1d, then the above rules apply unchanged, with 1d->2d promotion performed before broadcasting. E.g., arr(10, 2, 3) @ arr(3) first promotes to arr(10, 2, 3) @ arr(3, 1), then broadcasts the right argument to create the aligned operation arr(10, 2, 3) @ arr(10, 3, 1), multiplies to get an array with shape (10, 2, 1), and finally removes the added dimension, returning an array with shape (10, 2). Similarly, arr(2) @ arr(10, 2, 3) produces an intermediate array with shape (10, 1, 3), and a final array with shape (10, 3).
0d (scalar) inputs raise an error. Scalar * matrix multiplication is a mathematically and algorithmically distinct operation from matrix @ matrix multiplication, and is already covered by the elementwise * operator. Allowing scalar @ matrix would thus both require an unnecessary special case, and violate TOOWTDI.
Adoption
We group existing Python projects which provide array- or matrix-like types based on what API they currently use for elementwise and matrix multiplication.
Projects which currently use * for elementwise multiplication, and function/method calls for matrix multiplication:
The developers of the following projects have expressed an intention to implement @ on their array-like types using the above semantics:
- numpy
- pandas
- blaze
- theano
The following projects have been alerted to the existence of the PEP, but it's not yet known what they plan to do if it's accepted. We don't anticipate that they'll have any objections, though, since everything proposed here is consistent with how they already do things:
- pycuda
- panda3d
Projects which currently use * for matrix multiplication, and function/method calls for elementwise multiplication:
The following projects have expressed an intention, if this PEP is accepted, to migrate from their current API to the elementwise-*, matmul-@ convention (i.e., this is a list of projects whose API fragmentation will probably be eliminated if this PEP is accepted):
- numpy (numpy.matrix)
- scipy.sparse
- pyoperators
- pyviennacl
The following projects have been alerted to the existence of the PEP, but it's not known what they plan to do if it's accepted (i.e., this is a list of projects whose API fragmentation may or may not be eliminated if this PEP is accepted):
- cvxopt
Projects which currently use * for matrix multiplication, and which don't really care about elementwise multiplication of matrices:
There are several projects which implement matrix types, but from a very different perspective than the numerical libraries discussed above. These projects focus on computational methods for analyzing matrices in the sense of abstract mathematical objects (i.e., linear maps over free modules over rings), rather than as big bags full of numbers that need crunching. And it turns out that from the abstract math point of view, there isn't much use for elementwise operations in the first place; as discussed in the Background section above, elementwise operations are motivated by the bag-of-numbers approach. So these projects don't encounter the basic problem that this PEP exists to address, making it mostly irrelevant to them; while they appear superficially similar to projects like numpy, they're actually doing something quite different. They use * for matrix multiplication (and for group actions, and so forth), and if this PEP is accepted, their expressed intention is to continue doing so, while perhaps adding @ as an alias. These projects include:
- sympy
- sage
Implementation details
New functions operator.matmul and operator.__matmul__ are added to the standard library, with the usual semantics.
A corresponding function PyObject* PyObject_MatrixMultiply(PyObject *o1, PyObject *o2) is added to the C API.
A new AST node is added named MatMult, along with a new token ATEQUAL and new bytecode opcodes BINARY_MATRIX_MULTIPLY and INPLACE_MATRIX_MULTIPLY.
Two new type slots are added; whether this is to PyNumberMethods or a new PyMatrixMethods struct remains to be determined.
Rationale for specification details
Choice of operator
Why @ instead of some other spelling? There isn't any consensus across other programming languages about how this operator should be named [11]; here we discuss the various options.
Restricting ourselves only to symbols present on US English keyboards, the punctuation characters that don't already have a meaning in Python expression context are: @, backtick, $, !, and ?. Of these options, @ is clearly the best; ! and ? are already heavily freighted with inapplicable meanings in the programming context, backtick has been banned from Python by BDFL pronouncement (see PEP 3099), and $ is uglier, even more dissimilar to * and ⋅, and has Perl/PHP baggage. $ is probably the second-best option of these, though.
Symbols which are not present on US English keyboards start at a significant disadvantage (having to spend 5 minutes at the beginning of every numeric Python tutorial just going over keyboard layouts is not a hassle anyone really wants). Plus, even if we somehow overcame the typing problem, it's not clear there are any that are actually better than @. Some options that have been suggested include:
- U+00D7 MULTIPLICATION SIGN: A × B
- U+22C5 DOT OPERATOR: A ⋅ B
- U+2297 CIRCLED TIMES: A ⊗ B
- U+00B0 DEGREE: A ° B
What we need, though, is an operator that means "matrix multiplication, as opposed to scalar/elementwise multiplication". There is no conventional symbol with this meaning in either programming or mathematics, where these operations are usually distinguished by context. (And U+2297 CIRCLED TIMES is actually used conventionally to mean exactly the wrong things: elementwise multiplication -- the "Hadamard product" -- or outer product, rather than matrix/inner product like our operator). @ at least has the virtue that it looks like a funny non-commutative operator; a naive user who knows maths but not programming couldn't look at A * B versus A × B, or A * B versus A ⋅ B, or A * B versus A ° B and guess which one is the usual multiplication, and which one is the special case.
Finally, there is the option of using multi-character tokens. Some options:
- Matlab and Julia use a .* operator. Aside from being visually confusable with *, this would be a terrible choice for us because in Matlab and Julia, * means matrix multiplication and .* means elementwise multiplication, so using .* for matrix multiplication would make us exactly backwards from what Matlab and Julia users expect.
- APL apparently used +.×, which by combining a multi-character token, confusing attribute-access-like . syntax, and a unicode character, ranks somewhere below U+2603 SNOWMAN on our candidate list. If we like the idea of combining addition and multiplication operators as being evocative of how matrix multiplication actually works, then something like +* could be used -- though this may be too easy to confuse with *+, which is just multiplication combined with the unary + operator.
- PEP 211 suggested ~*. This has the downside that it sort of suggests that there is a unary * operator that is being combined with unary ~, but it could work.
- R uses %*% for matrix multiplication. In R this forms part of a general extensible infix system in which all tokens of the form %foo% are user-defined binary operators. We could steal the token without stealing the system.
- Some other plausible candidates that have been suggested: >< (= ascii drawing of the multiplication sign ×); the footnote operator [*] or |*| (but when used in context, the use of vertical grouping symbols tends to recreate the nested parentheses visual clutter that was noted as one of the major downsides of the function syntax we're trying to get away from); ^*.
So, it doesn't matter much, but @ seems as good or better than any of the alternatives:
- It's a friendly character that Pythoneers are already used to typing in decorators, but the decorator usage and the math expression usage are sufficiently dissimilar that it would be hard to confuse them in practice.
- It's widely accessible across keyboard layouts (and thanks to its use in email addresses, this is true even of weird keyboards like those in phones).
- It's round like * and ⋅.
- The mATrices mnemonic is cute.
- The swirly shape is reminiscent of the simultaneous sweeps over rows and columns that define matrix multiplication
- Its asymmetry is evocative of its non-commutative nature.
- Whatever, we have to pick something.
Precedence and associativity
There was a long discussion [15] about whether @ should be right- or left-associative (or even something more exotic [18]). Almost all Python operators are left-associative, so following this convention would be the simplest approach, but there were two arguments that suggested matrix multiplication might be worth making right-associative as a special case:
First, matrix multiplication has a tight conceptual association with function application/composition, so many mathematically sophisticated users have an intuition that an expression like RSx proceeds from right-to-left, with first S transforming the vector x, and then R transforming the result. This isn't universally agreed (and not all number-crunchers are steeped in the pure-math conceptual framework that motivates this intuition [16]), but at the least this intuition is more common than for other operations like 2⋅3⋅4 which everyone reads as going from left-to-right.
Second, if expressions like Mat @ Mat @ vec appear often in code, then programs will run faster (and efficiency-minded programmers will be able to use fewer parentheses) if this is evaluated as Mat @ (Mat @ vec) then if it is evaluated like (Mat @ Mat) @ vec.
However, weighing against these arguments are the following:
Regarding the efficiency argument, empirically, we were unable to find any evidence that Mat @ Mat @ vec type expressions actually dominate in real-life code. Parsing a number of large projects that use numpy, we found that when forced by numpy's current funcall syntax to choose an order of operations for nested calls to dot, people actually use left-associative nesting slightly more often than right-associative nesting [17]. And anyway, writing parentheses isn't so bad -- if an efficiency-minded programmer is going to take the trouble to think through the best way to evaluate some expression, they probably should write down the parentheses regardless of whether they're needed, just to make it obvious to the next reader that they order of operations matter.
In addition, it turns out that other languages, including those with much more of a focus on linear algebra, overwhelmingly make their matmul operators left-associative. Specifically, the @ equivalent is left-associative in R, Matlab, Julia, IDL, and Gauss. The only exceptions we found are Mathematica, in which a @ b @ c would be parsed non-associatively as dot(a, b, c), and APL, in which all operators are right-associative. There do not seem to exist any languages that make @ right-associative and * left-associative. And these decisions don't seem to be controversial -- I've never seen anyone complaining about this particular aspect of any of these other languages, and the left-associativity of * doesn't seem to bother users of the existing Python libraries that use * for matrix multiplication. So, at the least we can conclude from this that making @ left-associative will certainly not cause any disasters. Making @ right-associative, OTOH, would be exploring new and uncertain ground.
And another advantage of left-associativity is that it is much easier to learn and remember that @ acts like *, than it is to remember first that @ is unlike other Python operators by being right-associative, and then on top of this, also have to remember whether it is more tightly or more loosely binding than *. (Right-associativity forces us to choose a precedence, and intuitions were about equally split on which precedence made more sense. So this suggests that no matter which choice we made, no-one would be able to guess or remember it.)
On net, therefore, the general consensus of the numerical community is that while matrix multiplication is something of a special case, it's not special enough to break the rules, and @ should parse like * does.
(Non)-Definitions for built-in types
No __matmul__ or __matpow__ are defined for builtin numeric types (float, int, etc.) or for the numbers.Number hierarchy, because these types represent scalars, and the consensus semantics for @ are that it should raise an error on scalars.
We do not -- for now -- define a __matmul__ method on the standard memoryview or array.array objects, for several reasons. Of course this could be added if someone wants it, but these types would require quite a bit of additional work beyond __matmul__ before they could be used for numeric work -- e.g., they have no way to do addition or scalar multiplication either! -- and adding such functionality is beyond the scope of this PEP. In addition, providing a quality implementation of matrix multiplication is highly non-trivial. Naive nested loop implementations are very slow and shipping such an implementation in CPython would just create a trap for users. But the alternative -- providing a modern, competitive matrix multiply -- would require that CPython link to a BLAS library, which brings a set of new complications. In particular, several popular BLAS libraries (including the one that ships by default on OS X) currently break the use of multiprocessing [8]. Together, these considerations mean that the cost/benefit of adding __matmul__ to these types just isn't there, so for now we'll continue to delegate these problems to numpy and friends, and defer a more systematic solution to a future proposal.
There are also non-numeric Python builtins which define __mul__ (str, list, ...). We do not define __matmul__ for these types either, because why would we even do that.
Non-definition of matrix power
Earlier versions of this PEP also proposed a matrix power operator, @@, analogous to **. But on further consideration, it was decided that the utility of this was sufficiently unclear that it would be better to leave it out for now, and only revisit the issue if -- once we have more experience with @ -- it turns out that @@ is truly missed. [14]
Rejected alternatives to adding a new operator
Over the past few decades, the Python numeric community has explored a variety of ways to resolve the tension between matrix and elementwise multiplication operations. PEP 211 and PEP 225, both proposed in 2000 and last seriously discussed in 2008 [9], were early attempts to add new operators to solve this problem, but suffered from serious flaws; in particular, at that time the Python numerical community had not yet reached consensus on the proper API for array objects, or on what operators might be needed or useful (e.g., PEP 225 proposes 6 new operators with unspecified semantics). Experience since then has now led to consensus that the best solution, for both numeric Python and core Python, is to add a single infix operator for matrix multiply (together with the other new operators this implies like @=).
We review some of the rejected alternatives here.
Use a second type that defines __mul__ as matrix multiplication: As discussed above (Background: What's wrong with the status quo?), this has been tried this for many years via the numpy.matrix type (and its predecessors in Numeric and numarray). The result is a strong consensus among both numpy developers and developers of downstream packages that numpy.matrix should essentially never be used, because of the problems caused by having conflicting duck types for arrays. (Of course one could then argue we should only define __mul__ to be matrix multiplication, but then we'd have the same problem with elementwise multiplication.) There have been several pushes to remove numpy.matrix entirely; the only counter-arguments have come from educators who find that its problems are outweighed by the need to provide a simple and clear mapping between mathematical notation and code for novices (see Transparent syntax is especially crucial for non-expert programmers). But, of course, starting out newbies with a dispreferred syntax and then expecting them to transition later causes its own problems. The two-type solution is worse than the disease.
Add lots of new operators, or add a new generic syntax for defining infix operators: In addition to being generally un-Pythonic and repeatedly rejected by BDFL fiat, this would be using a sledgehammer to smash a fly. The scientific python community has consensus that adding one operator for matrix multiplication is enough to fix the one otherwise unfixable pain point. (In retrospect, we all think PEP 225 was a bad idea too -- or at least far more complex than it needed to be.)
Add a new @ (or whatever) operator that has some other meaning in general Python, and then overload it in numeric code: This was the approach taken by PEP 211, which proposed defining @ to be the equivalent of itertools.product. The problem with this is that when taken on its own terms, it's pretty clear that itertools.product doesn't actually need a dedicated operator. It hasn't even been deemed worth of a builtin. (During discussions of this PEP, a similar suggestion was made to define @ as a general purpose function composition operator, and this suffers from the same problem; functools.compose isn't even useful enough to exist.) Matrix multiplication has a uniquely strong rationale for inclusion as an infix operator. There almost certainly don't exist any other binary operations that will ever justify adding any other infix operators to Python.
Add a .dot method to array types so as to allow "pseudo-infix" A.dot(B) syntax: This has been in numpy for some years, and in many cases it's better than dot(A, B). But it's still much less readable than real infix notation, and in particular still suffers from an extreme overabundance of parentheses. See Why should matrix multiplication be infix? above.
Use a 'with' block to toggle the meaning of * within a single code block: E.g., numpy could define a special context object so that we'd have:
c = a * b # element-wise multiplication
with numpy.mul_as_dot:
c = a * b # matrix multiplication
However, this has two serious problems: first, it requires that every array-like type's __mul__ method know how to check some global state (numpy.mul_is_currently_dot or whatever). This is fine if a and b are numpy objects, but the world contains many non-numpy array-like objects. So this either requires non-local coupling -- every numpy competitor library has to import numpy and then check numpy.mul_is_currently_dot on every operation -- or else it breaks duck-typing, with the above code doing radically different things depending on whether a and b are numpy objects or some other sort of object. Second, and worse, with blocks are dynamically scoped, not lexically scoped; i.e., any function that gets called inside the with block will suddenly find itself executing inside the mul_as_dot world, and crash and burn horribly -- if you're lucky. So this is a construct that could only be used safely in rather limited cases (no function calls), and which would make it very easy to shoot yourself in the foot without warning.
Use a language preprocessor that adds extra numerically-oriented operators and perhaps other syntax: (As per recent BDFL suggestion: [1]) This suggestion seems based on the idea that numerical code needs a wide variety of syntax additions. In fact, given @, most numerical users don't need any other operators or syntax; it solves the one really painful problem that cannot be solved by other means, and that causes painful reverberations through the larger ecosystem. Defining a new language (presumably with its own parser which would have to be kept in sync with Python's, etc.), just to support a single binary operator, is neither practical nor desireable. In the numerical context, Python's competition is special-purpose numerical languages (Matlab, R, IDL, etc.). Compared to these, Python's killer feature is exactly that one can mix specialized numerical code with code for XML parsing, web page generation, database access, network programming, GUI libraries, and so forth, and we also gain major benefits from the huge variety of tutorials, reference material, introductory classes, etc., which use Python. Fragmenting "numerical Python" from "real Python" would be a major source of confusion. A major motivation for this PEP is to reduce fragmentation. Having to set up a preprocessor would be an especially prohibitive complication for unsophisticated users. And we use Python because we like Python! We don't want almost-but-not-quite-Python.
Use overloading hacks to define a "new infix operator" like *dot*, as in a well-known Python recipe: (See: [2]) Beautiful is better than ugly. This is... not beautiful. And not Pythonic. And especially unfriendly to beginners, who are just trying to wrap their heads around the idea that there's a coherent underlying system behind these magic incantations that they're learning, when along comes an evil hack like this that violates that system, creates bizarre error messages when accidentally misused, and whose underlying mechanisms can't be understood without deep knowledge of how object oriented systems work.
Use a special "facade" type to support syntax like arr.M * arr: This is very similar to the previous proposal, in that the .M attribute would basically return the same object as arr *dot` would, and thus suffers the same objections about 'magicalness'. This approach also has some non-obvious complexities: for example, while ``arr.M * arr must return an array, arr.M * arr.M and arr * arr.M must return facade objects, or else arr.M * arr.M * arr and arr * arr.M * arr will not work. But this means that facade objects must be able to recognize both other array objects and other facade objects (which creates additional complexity for writing interoperating array types from different libraries who must now recognize both each other's array types and their facade types). It also creates pitfalls for users who may easily type arr * arr.M or arr.M * arr.M and expect to get back an array object; instead, they will get a mysterious object that throws errors when they attempt to use it. Basically with this approach users must be careful to think of .M* as an indivisible unit that acts as an infix operator -- and as infix-operator-like token strings go, at least *dot* is prettier looking (look at its cute little ears!).
Discussions of this PEP
Collected here for reference:
- Github pull request containing much of the original discussion and drafting: https://github.com/numpy/numpy/pull/4351
- sympy mailing list discussions of an early draft:
- sage-devel mailing list discussions of an early draft: https://groups.google.com/forum/#!topic/sage-devel/YxEktGu8DeM
- 13-Mar-2014 python-ideas thread: https://mail.python.org/pipermail/python-ideas/2014-March/027053.html
- numpy-discussion thread on whether to keep @@: http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069448.html
- numpy-discussion threads on precedence/associativity of @: * http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069444.html * http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069605.html
References
| [1] | From a comment by GvR on a G+ post by GvR; the comment itself does not seem to be directly linkable: https://plus.google.com/115212051037621986145/posts/hZVVtJ9bK3u |
| [2] | http://code.activestate.com/recipes/384122-infix-operators/ http://www.sagemath.org/doc/reference/misc/sage/misc/decorators.html#sage.misc.decorators.infix_operator |
| [3] | http://conference.scipy.org/past.html |
| [4] | http://pydata.org/events/ |
| [5] | In this formula, β is a vector or matrix of regression coefficients, V is the estimated variance/covariance matrix for these coefficients, and we want to test the null hypothesis that Hβ = r; a large S then indicates that this hypothesis is unlikely to be true. For example, in an analysis of human height, the vector β might contain one value which was the the average height of the measured men, and another value which was the average height of the measured women, and then setting H = [1, − 1], r = 0 would let us test whether men and women are the same height on average. Compare to eq. 2.139 in http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xegbohtmlnode17.html Example code is adapted from https://github.com/rerpy/rerpy/blob/0d274f85e14c3b1625acb22aed1efa85d122ecb7/rerpy/incremental_ls.py#L202 |
| [6] | Out of the 36 tutorials scheduled for PyCon 2014 (https://us.pycon.org/2014/schedule/tutorials/), we guess that the 8 below will almost certainly deal with matrices:
In addition, the following tutorials could easily involve matrices:
This gives an estimated range of 8 to 12 / 36 = 22% to 33% of tutorials dealing with matrices; saying ~20% then gives us some wiggle room in case our estimates are high. |
| [7] | SLOCs were defined as physical lines which contain at least one token that is not a COMMENT, NEWLINE, ENCODING, INDENT, or DEDENT. Counts were made by using tokenize module from Python 3.2.3 to examine the tokens in all files ending .py underneath some directory. Only tokens which occur at least once in the source trees are included in the table. The counting script is available in the PEP repository. Matrix multiply counts were estimated by counting how often certain tokens which are used as matrix multiply function names occurred in each package. This creates a small number of false positives for scikit-learn, because we also count instances of the wrappers around dot that this package uses, and so there are a few dozen tokens which actually occur in import or def statements. All counts were made using the latest development version of each project as of 21 Feb 2014. 'stdlib' is the contents of the Lib/ directory in commit d6aa3fa646e2 to the cpython hg repository, and treats the following tokens as indicating matrix multiply: n/a. 'scikit-learn' is the contents of the sklearn/ directory in commit 69b71623273ccfc1181ea83d8fb9e05ae96f57c7 to the scikit-learn repository (https://github.com/scikit-learn/scikit-learn), and treats the following tokens as indicating matrix multiply: dot, fast_dot, safe_sparse_dot. 'nipy' is the contents of the nipy/ directory in commit 5419911e99546401b5a13bd8ccc3ad97f0d31037 to the nipy repository (https://github.com/nipy/nipy/), and treats the following tokens as indicating matrix multiply: dot. |
| [8] | BLAS libraries have a habit of secretly spawning threads, even when used from single-threaded programs. And threads play very poorly with fork(); the usual symptom is that attempting to perform linear algebra in a child process causes an immediate deadlock. |
| [9] | http://fperez.org/py4science/numpy-pep225/numpy-pep225.html |
| [10] | http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html |
| [11] | http://mail.scipy.org/pipermail/scipy-user/2014-February/035499.html |
| [12] | Counts were produced by manually entering the string "import foo" or "from foo import" (with quotes) into the Github code search page, e.g.: https://github.com/search?q=%22import+numpy%22&ref=simplesearch&type=Code on 2014-04-10 at ~21:00 UTC. The reported values are the numbers given in the "Languages" box on the lower-left corner, next to "Python". This also causes some undercounting (e.g., leaving out Cython code, and possibly one should also count HTML docs and so forth), but these effects are negligible (e.g., only ~1% of numpy usage appears to occur in Cython code, and probably even less for the other modules listed). The use of this box is crucial, however, because these counts appear to be stable, while the "overall" counts listed at the top of the page ("We've found ___ code results") are highly variable even for a single search -- simply reloading the page can cause this number to vary by a factor of 2 (!!). (They do seem to settle down if one reloads the page repeatedly, but nonetheless this is spooky enough that it seemed better to avoid these numbers.) These numbers should of course be taken with multiple grains of salt; it's not clear how representative Github is of Python code in general, and limitations of the search tool make it impossible to get precise counts. AFAIK this is the best data set currently available, but it'd be nice if it were better. In particular:
Also, it's possible there exist other non-stdlib modules we didn't think to test that are even more-imported than numpy -- though we tried quite a few of the obvious suspects. If you find one, let us know! The modules tested here were chosen based on a combination of intuition and the top-100 list at pypi-ranking.info. Fortunately, it doesn't really matter if it turns out that numpy is, say, merely the third most-imported non-stdlib module, since the point is just that numeric programming is a common and mainstream activity. Finally, we should point out the obvious: whether a package is import**ed** is rather different from whether it's import**ant**. No-one's claiming numpy is "the most important package" or anything like that. Certainly more packages depend on distutils, e.g., then depend on numpy -- and far fewer source files import distutils than import numpy. But this is fine for our present purposes. Most source files don't import distutils because most source files don't care how they're distributed, so long as they are; these source files thus don't care about details of how distutils' API works. This PEP is in some sense about changing how numpy's and related packages' APIs work, so the relevant metric is to look at source files that are choosing to directly interact with that API, which is sort of like what we get by looking at import statements. |
| [13] | The first such proposal occurs in Jim Hugunin's very first email to the matrix SIG in 1995, which lays out the first draft of what became Numeric. He suggests using * for elementwise multiplication, and % for matrix multiplication: https://mail.python.org/pipermail/matrix-sig/1995-August/000002.html |
| [14] | http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069502.html |
| [15] | http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069444.html http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069605.html |
| [16] | http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069610.html |
| [17] | http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069578.html |
| [18] | http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069530.html |
| [19] | https://en.wikipedia.org/wiki/Matrix_multiplication |
| [20] | https://en.wikipedia.org/wiki/Empty_product |
Copyright
This document has been placed in the public domain.
pep-0466 Network Security Enhancements for Python 2.7.x
| PEP: | 466 |
|---|---|
| Title: | Network Security Enhancements for Python 2.7.x |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 23-Mar-2014 |
| Python-Version: | 2.7.9 |
| Post-History: | 23-Mar-2014, 24-Mar-2014, 25-Mar-2014, 26-Mar-2014, 16-Apr-2014 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2014-April/134163.html |
Contents
- Abstract
- New security related features in Python 2.7 maintenance releases
- Implementation status
- Backwards compatibility considerations
- Other Considerations
- Motivation and Rationale
- Why these particular changes?
- Rejected alternative: just advise developers to migrate to Python 3
- Rejected alternative: create and release Python 2.8
- Rejected alternative: distribute the security enhancements via PyPI
- Rejected variant: provide a "legacy SSL infrastructure" branch
- Rejected variant: synchronise particular modules entirely with Python 3
- Rejected variant: open ended backport policy
- Disclosure of Interest
- Acknowledgements
- References
- Copyright
Abstract
Most CPython tracker issues are classified as errors in behaviour or proposed enhancements. Most patches to fix behavioural errors are applied to all active maintenance branches. Enhancement patches are restricted to the default branch that becomes the next Python version.
This cadence works reasonably well during Python's normal 18-24 month feature release cycle, which is still applicable to the Python 3 series. However, the age of the standard library in Python 2 has now reached a point where it is sufficiently far behind the state of the art in network security protocols for it to be causing real problems in use cases where upgrading to Python 3 in the near term may not be feasible.
In recognition of the additional practical considerations that have arisen during the 4+ year maintenance cycle for Python 2.7, this PEP allows a critical set of network security related features to be backported from Python 3.4 to upcoming Python 2.7.x maintenance releases.
While this PEP does not make any changes to the core development team's handling of security-fix-only branches that are no longer in active maintenance, it does recommend that commercial redistributors providing extended support periods for the Python standard library either backport these features to their supported versions, or else explicitly disclaim support for the use of older versions in roles that involve connecting directly to the public internet.
Implementation status
This PEP originally proposed adding all listed features to the Python 2.7.7 maintenance release. That approach proved to be too ambitious given the limited time frame between the original creation and acceptance of the PEP and the release of Python 2.7.7rc1. Instead, the progress of each individual accepted feature backport is being tracked as an independent enhancement targeting Python 2.7.
Implemented for Python 2.7.7:
- Issue #21306 [9]: backport hmac.compare_digest
- Issue #21462 [10]: upgrade OpenSSL in the Python 2.7 Windows installers
Implemented for Python 2.7.8:
- Issue #21304 [11]: backport hashlib.pbkdf2
Implemented for Python 2.7.9 (in development):
- Issue #21308 [12]: backport specified ssl module features
- Issue #21307 [13]: backport remaining specified hashlib module features
- Issue #21305 [14]: backport os.urandom shared file descriptor change
Backwards compatibility considerations
As in the Python 3 series, the backported ssl.create_default_context() API is granted a backwards compatibility exemption that permits the protocol, options, cipher and other settings of the created SSL context to be updated in maintenance releases to use higher default security settings. This allows them to appropriately balance compatibility and security at the time of the maintenance release, rather than at the time of the original feature release.
This PEP does not grant any other exemptions to the usual backwards compatibility policy for maintenance releases. Instead, by explicitly encouraging the use of feature based checks, it is designed to make it easier to write more secure cross-version compatible Python software, while still limiting the risk of breaking currently working software when upgrading to a new Python 2.7 maintenance release.
In all cases where this proposal allows new features to be backported to the Python 2.7 release series, it is possible to write cross-version compatible code that operates by "feature detection" (for example, checking for particular attributes in a module), without needing to explicitly check the Python version.
It is then up to library and framework code to provide an appropriate warning and fallback behaviour if a desired feature is found to be missing. While some especially security sensitive software MAY fail outright if a desired security feature is unavailable, most software SHOULD instead emit a warning and continue operating using a slightly degraded security configuration.
The backported APIs allow library and application code to perform the following actions after detecting the presence of a relevant network security related feature:
- explicitly opt in to more secure settings (to allow the use of enhanced security features in older maintenance releases of Python with less secure default behaviour)
- explicitly opt in to less secure settings (to allow the use of newer Python feature releases in lower security environments)
- determine the default setting for the feature (this MAY require explicit Python version checks to determine the Python feature release, but DOES NOT require checking for a specific maintenance release)
Security related changes to other modules (such as higher level networking libraries and data format processing libraries) will continue to be made available as backports and new modules on the Python Package Index, as independent distribution remains the preferred approach to handling software that must continue to evolve to handle changing development requirements independently of the Python 2 standard library. Refer to the Motivation and Rationale section for a review of the characteristics that make the secure networking infrastructure worthy of special consideration.
OpenSSL compatibility
Under this proposal, OpenSSL may be upgraded to more recent feature releases in Python 2.7 maintenance releases. On Linux and most other POSIX systems, the specific version of OpenSSL used already varies, as CPython dynamically links to the system provided OpenSSL library by default.
For the Windows binary installers, the _ssl and _hashlib modules are statically linked with OpenSSL and the associated symbols are not exported. Marc-Andre Lemburg indicates that updating to newer OpenSSL releases in the egenix-pyopenssl binaries has not resulted in any reported compatibility issues [3]
The Mac OS X binary installers historically followed the same policy as other POSIX installations and dynamically linked to the Apple provided OpenSSL libraries. However, Apple has now ceased updating these cross-platform libraries, instead requiring that even cross-platform developers adopt Mac OS X specific interfaces to access up to date security infrastructure on their platform. Accordingly, and independently of this PEP, the Mac OS X binary installers were already going to be switched to statically linker newer versions of OpenSSL [4]
Other Considerations
Maintainability
A number of developers, including Alex Gaynor and Donald Stufft, have expressed interest in carrying out the feature backports covered by this policy, and assisting with any additional maintenance burdens that arise in the Python 2 series as a result.
Steve Dower and Brian Curtin have offered to help with the creation of the Windows installers, allowing Martin von Lรถwis the opportunity to step back from the task of maintaining the 2.7 Windows installer.
This PEP is primarily about establishing the consensus needed to allow them to carry out this work. For other core developers, this policy change shouldn't impose any additional effort beyond potentially reviewing the resulting patches for those developers specifically interested in the affected modules.
Security releases
This PEP does not propose any changes to the handling of security releases - those will continue to be source only releases that include only critical security fixes.
However, the recommendations for library and application developers are deliberately designed to accommodate commercial redistributors that choose to apply these changes to additional Python release series that are either in security fix only mode, or have been declared "end of life" by the core development team.
Whether or not redistributors choose to exercise that option will be up to the individual redistributor.
Integration testing
Third party integration testing services should offer users the ability to test against multiple Python 2.7 maintenance releases (at least 2.7.6 and 2.7.7+), to ensure that libraries, frameworks and applications can still test their handling of the legacy security infrastructure correctly (either failing or degrading gracefully, depending on the security sensitivity of the software), even after the features covered in this proposal have been backported to the Python 2.7 series.
Handling lower security environments with low risk tolerance
For better or for worse (mostly worse), there are some environments where the risk of latent security defects is more tolerated than even a slightly increased risk of regressions in maintenance releases. This proposal largely excludes these environments from consideration where the modules covered by the exemption are concerned - this approach is entirely inappropriate for software connected to the public internet, and defence in depth security principles suggest that it is not appropriate for most private networks either.
Downstream redistributors may still choose to cater to such environments, but they will need to handle the process of downgrading the security related modules and doing the associated regression testing themselves. The main CPython continuous integration infrastructure will not cover this scenario.
Motivation and Rationale
The creation of this PEP was prompted primarily by the aging SSL support in the Python 2 series. As of March 2014, the Python 2.7 SSL module is approaching four years of age, and the SSL support in the still popular Python 2.6 release had its feature set locked six years ago.
These are simply too old to provide a foundation that can be recommended in good conscience for secure networking software that operates over the public internet, especially in an era where it is becoming quite clearly evident that advanced persistent security threats are even more widespread and more indiscriminate in their targeting than had previously been understood. While they represented reasonable security infrastructure in their time, the state of the art has moved on, and we need to investigate mechanisms for effectively providing more up to date network security infrastructure for users that, for whatever reason, are not currently in a position to migrate to Python 3.
While the use of the system OpenSSL installation addresses many of these concerns on Linux platforms, it doesn't address all of them (in particular, it is still difficult for sotware to explicitly require some higher level security settings). The standard library support can be bypassed by using a third party library like PyOpenSSL or Pycurl, but this still results in a security problem, as these can be difficult dependencies to deploy, and many users will remain unaware that they might want them. Rather than explaining to potentially naive users how to obtain and use these libraries, it seems better to just fix the included batteries.
In the case of the binary installers for Windows and Mac OS X that are published on python.org, the version of OpenSSL used is entirely within the control of the Python core development team, but is currently limited to OpenSSL maintenance releases for the version initially shipped with the corresponding Python feature release.
With increased popularity comes increased responsibility, and this proposal aims to acknowledge the fact that Python's popularity and adoption is at a sufficiently high level that some of our design and policy decisions have significant implications beyond the Python development community.
As one example, the Python 2 ssl module does not support the Server Name Indication standard. While it is possible to obtain SNI support by using the third party requests client library, actually doing so currently requires using not only requests and its embedded dependencies, but also half a dozen or more additional libraries. The lack of support in the Python 2 series thus serves as an impediment to making effective use of SNI on servers, as Python 2 clients will frequently fail to handle it correctly.
Another more critical example is the lack of SSL hostname matching in the Python 2 standard library - it is currently necessary to rely on a third party library, such as requests or backports.ssl_match_hostname to obtain that functionality in Python 2.
The Python 2 series also remains more vulnerable to remote timing attacks on security sensitive comparisons than the Python 3 series, as it lacks a standard library equivalent to the timing attack resistant hmac.compare_digest() function. While appropriate secure comparison functions can be implemented in third party extensions, many users don't even consider the issue and use ordinary equality comparisons instead - while a standard library solution doesn't automatically fix that problem, it does make the barrier to resolution much lower once the problem is pointed out.
Python 2.7 represents the only long term maintenance release the core development team has provided, and it is natural that there will be things that worked over a historically shorter maintenance lifespan that don't work over this longer support period. In the specific case of the problem described in this PEP, the simplest available solution is to acknowledge that long term maintenance of network security related modules requires the ability to add new features, even while retaining backwards compatibility for existing interfaces.
For those familiar with it, it is worth comparing the approach described in this PEP with Red Hat's handling of its long term open source support commitments: it isn't the RHEL 6.0 release itself that receives 10 years worth of support, but the overall RHEL 6 series. The individual RHEL 6.x point releases within the series then receive a wide variety of new features, including security enhancements, all while meeting strict backwards compatibility guarantees for existing software. The proposal covered in this PEP brings our approach to long term maintenance more into line with this precedent - we retain our strict backwards compatibility requirements, but make an exception to the restriction against adding new features.
To date, downstream redistributors have respected our upstream policy of "no new features in Python maintenance releases". This PEP explicitly accepts that a more nuanced policy is appropriate in the case of network security related features, and the specific change it describes is deliberately designed such that it is potentially suitable for Red Hat Enterprise Linux and its downstream derivatives.
Why these particular changes?
The key requirement for a feature to be considered for inclusion in this proposal was that it must have security implications beyond the specific application that is written in Python and the system that application is running on. Thus the focus on network security protocols, password storage and related cryptographic infrastructure - Python is a popular choice for the development of web services and clients, and thus the capabilities of widely used Python versions have implications for the security design of other services that may themselves be using newer versions of Python or other development languages, but need to interoperate with clients or servers written using older versions of Python.
The intent behind this requirement was to minimise any impact that the introduction of this policy may have on the stability and compatibility of maintenance releases, while still addressing some key security concerns relating to the particular aspects of Python 2.7. It would be thoroughly counterproductive if end users became as cautious about updating to new Python 2.7 maintenance releases as they are about updating to new feature releases within the same release series.
The ssl module changes are included in this proposal to bring the Python 2 series up to date with the past 4 years of evolution in network security standards, and make it easier for those standards to be broadly adopted in both servers and clients. Similarly the hash algorithm availability indicators in hashlib are included to make it easier for applications to detect and employ appropriate hash definitions across both Python 2 and 3.
The hmac.compare_digest() and hashlib.pbkdf2_hmac() are included to help lower the barriers to secure password storage and checking in Python 2 server applications.
The os.urandom() change has been included in this proposal to further encourage users to leave the task of providing high quality random numbers for cryptographic use cases to operating system vendors. The use of insufficiently random numbers has the potential to compromise any cryptographic system, and operating system developers have more tools available to address that problem adequately than the typical Python application runtime.
Rejected alternative: just advise developers to migrate to Python 3
This alternative represents the status quo. Unfortunately, it has proven to be unworkable in practice, as the backwards compatibility implications mean that this is a non-trivial migration process for large applications and integration projects. While the tools for migration have evolved to a point where it is possible to migrate even large applications opportunistically and incrementally (rather than all at once) by updating code to run in the large common subset of Python 2 and Python 3, using the most recent technology often isn't a priority in commercial environments.
Previously, this was considered an acceptable harm, as while it was an unfortunate problem for the affected developers to have to face, it was seen as an issue between them and their management chain to make the case for infrastructure modernisation, and this case would become naturally more compelling as the Python 3 series evolved.
However, now that we're fully aware of the impact the limitations of the Python 2 standard library may be having on the evolution of internet security standards, I no longer believe that it is reasonable to expect platform and application developers to resolve all of the latent defects in an application's Unicode correctness solely in order to gain access to the network security enhancements already available in Python 3.
While Ubuntu (and to some extent Debian as well) are committed to porting all default system services and scripts to Python 3, and to removing Python 2 from its default distribution images (but not from its archives), this is a mammoth task and won't be completed for the Ubuntu 14.04 LTS release (at least for the desktop image - it may be achieved for the mobile and server images).
Fedora has even more work to do to migrate, and it will take a non-trivial amount of time to migrate the relevant infrastructure components. While Red Hat are also actively working to make it easier for users to use more recent versions of Python on our stable platforms, it's going to take time for those efforts to start having an impact on end users' choice of version, and any such changes also don't benefit the core platform infrastructure that runs in the integrated system Python by necessity.
The OpenStack migration to Python 3 is also still in its infancy, and even though that's a project with an extensive and relatively robust automated test suite, it's still large enough that it is going to take quite some time to migrate fully to a Python 2/3 compatible code base.
And that's just three of the highest profile open source projects that make heavy use of Python. Given the likely existence of large amounts of legacy code that lacks the kind of automated regression test suite needed to help support a migration from Python 2 to Python 3, there are likely to be many cases where reimplementation (perhaps even in Python 3) proves easier than migration. The key point of this PEP is that those situations affect more people than just the developers and users of the affected application: the existence of clients and servers with outdated network security infrastructure becomes something that developers of secure networked services need to take into account as part of their security design, and that's a problem that inhibits the adoption of better security standards.
As Terry Reedy noted, if we try to persist with the status quo, the likely outcome is that commercial redistributors will attempt to do something like this on behalf of their customers anyway, but in a potentially inconsistent and ad hoc manner. By drawing the scope definition process into the upstream project we are in a better position to influence the approach taken to address the situation and to help ensure some consistency across redistributors.
The problem is real, so something needs to change, and this PEP describes my preferred approach to addressing the situation.
Rejected alternative: create and release Python 2.8
With sufficient corporate support, it likely would be possible to create and release Python 2.8 (it's highly unlikely such a project would garner enough interest to be achievable with only volunteers). However, this wouldn't actually solve the problem, as the aim is to provide a relatively low impact way to incorporate enhanced security features into integrated products and deployments that make use of Python 2.
Upgrading to a new Python feature release would mean both more work for the core development team, as well as a more disruptive update that most potential end users would likely just skip entirely.
Attempting to create a Python 2.8 release would also bring in suggestions to backport many additional features from Python 3 (such as tracemalloc and the improved coroutine support), making the migration from Python 2.7 to this hypothetical 2.8 release even riskier and more disruptive.
This is not a recommended approach, as it would involve substantial additional work for a result that is actually less effective in achieving the original aim (which is to eliminate the current widespread use of the aging network security infrastructure in the Python 2 series).
Furthermore, while I can't make any commitments to actually addressing this issue on Red Hat platforms, I can categorically rule out the idea of a Python 2.8 being of any use to me in even attempting to get it addressed.
Rejected alternative: distribute the security enhancements via PyPI
While this initially appears to be an attractive and easier to manage approach, it actually suffers from several significant problems.
Firstly, this is complex, low level, cross-platform code that integrates with the underlying operating system across a variety of POSIX platforms (including Mac OS X) and Windows. The CPython BuildBot fleet is already set up to handle continuous integration in that context, but most of the freely available continuous integration services just offer Linux, and perhaps paid access to Windows. Those services work reasonably well for software that largely runs on the abstraction layers offered by Python and other dynamic languages, as well as the more comprehensive abstraction offered by the JVM, but won't suffice for the kind of code involved here.
The OpenSSL dependency for the network security support also qualifies as the kind of "complex binary dependency" that isn't yet handled well by the pip based software distribution ecosystem. Relying on a third party binary dependency also creates potential compatibility problems for pip when running on other interpreters like PyPy.
Another practical problem with the idea is the fact that pip itself relies on the ssl support in the standard library (with some additional support from a bundled copy of requests, which in turn bundles backport.ssl_match_hostname), and hence would require any replacement module to also be bundled within pip. This wouldn't pose any insurmountable difficulties (it's just another dependency to vendor), but it would mean yet another copy of OpenSSL to keep up to date.
This approach also has the same flaw as all other "improve security by renaming things" approaches: they completely miss the users who most need help, and raise significant barriers against being able to encourage users to do the right thing when their infrastructure supports it (since "use this other module" is a much higher impact change than "turn on this higher security setting"). Deprecating the aging SSL infrastructure in the standard library in favour of an external module would be even more user hostile than accepting the slightly increased risk of regressions associated with upgrading it in place.
Last, but certainly not least, this approach suffers from the same problem as the idea of doing a Python 2.8 release: likely not solving the actual problem. Commercial redistributors of Python are set up to redistribute Python, and a pre-existing set of additional packages. Getting new packages added to the pre-existing set can be done, but means approaching each and every redistributor and asking them to update their repackaging process accordingly. By contrast, the approach described in this PEP would require redistributors to deliberately opt out of the security enhancements by deliberately downgrading the provided network security infrastructure, which most of them are unlikely to do.
Rejected variant: provide a "legacy SSL infrastructure" branch
Earlier versions of this PEP included the concept of a 2.7-legacy-ssl branch that preserved the exact feature set of the Python 2.7.6 network security infrastructure.
In my opinion, anyone that actually wants this is almost certainly making a mistake, and if they insist they really do want it in their specific situation, they're welcome to either make it themselves or arrange for a downstream redistributor to make it for them.
If they are made publicly available, any such rebuilds should be referred to as "Python 2.7 with Legacy SSL" to clearly distinguish them from the official Python 2.7 releases that include more up to date network security infrastructure.
After the first Python 2.7 maintenance release that implements this PEP, it would also be appropriate to refer to Python 2.7.6 and earlier releases as "Python 2.7 with Legacy SSL".
Rejected variant: synchronise particular modules entirely with Python 3
Earlier versions of this PEP suggested synchronising the hmac, hashlib and ssl modules entirely with their Python 3 counterparts.
This approach proved too vague to build a compelling case for the exception, and has thus been replaced by the current more explicit proposal.
Rejected variant: open ended backport policy
Earlier versions of this PEP suggested a general policy change related to future Python 3 enhancements that impact the general security of the internet.
That approach created unnecessary uncertainty, so it has been simplified to propose backport a specific concrete set of changes. Future feature backport proposals can refer back to this PEP as precedent, but it will still be necessary to make a specific case for each feature addition to the Python 2.7 long term support release.
Disclosure of Interest
The author of this PEP currently works for Red Hat on test automation tools. If this proposal is accepted, I will be strongly encouraging Red Hat to take advantage of the resulting opportunity to help improve the overall security of the Python ecosystem. However, I do not speak for Red Hat in this matter, and cannot make any commitments on Red Hat's behalf.
Acknowledgements
Thanks to Christian Heimes and other for their efforts in greatly improving Python's SSL support in the Python 3 series, and a variety of members of the Python community for helping me to better understand the implications of the default settings we provide in our SSL modules, and the impact that tolerating the use of SSL infrastructure that was defined in 2010 (Python 2.7) or even 2008 (Python 2.6) potentially has for the security of the web as a whole.
Thanks to Donald Stufft and Alex Gaynor for identifying a more limited set of essential security features that allowed the proposal to be made more fine-grained than backporting entire modules from Python 3.4 ([7], [8]).
Christian and Donald also provided valuable feedback on a preliminary draft of this proposal.
Thanks also to participants in the python-dev mailing list threads ([1], [2], [5], [6]), as well as the various folks I discussed this issue with at PyCon 2014 in Montreal.
References
| [1] | PEP 466 discussion (round 1) (https://mail.python.org/pipermail/python-dev/2014-March/133334.html) |
| [2] | PEP 466 discussion (round 2) (https://mail.python.org/pipermail/python-dev/2014-March/133389.html) |
| [3] | Marc-Andre Lemburg's OpenSSL feedback for Windows (https://mail.python.org/pipermail/python-dev/2014-March/133438.html) |
| [4] | Ned Deily's OpenSSL feedback for Mac OS X (https://mail.python.org/pipermail/python-dev/2014-March/133347.html) |
| [5] | PEP 466 discussion (round 3) (https://mail.python.org/pipermail/python-dev/2014-March/133442.html) |
| [6] | PEP 466 discussion (round 4) (https://mail.python.org/pipermail/python-dev/2014-March/133472.html) |
| [7] | Donald Stufft's recommended set of backported features (https://mail.python.org/pipermail/python-dev/2014-March/133500.html) |
| [8] | Alex Gaynor's recommended set of backported features (https://mail.python.org/pipermail/python-dev/2014-March/133503.html) |
| [9] | http://bugs.python.org/issue21306 |
| [10] | http://bugs.python.org/issue21462 |
| [11] | http://bugs.python.org/issue21304 |
| [12] | http://bugs.python.org/issue21308 |
| [13] | http://bugs.python.org/issue21307 |
| [14] | http://bugs.python.org/issue21305 |
Copyright
This document has been placed in the public domain.
pep-0467 Minor API improvements for binary sequences
| PEP: | 467 |
|---|---|
| Title: | Minor API improvements for binary sequences |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2014-03-30 |
| Python-Version: | 3.5 |
| Post-History: | 2014-03-30 2014-08-15 2014-08-16 |
Abstract
During the initial development of the Python 3 language specification, the core bytes type for arbitrary binary data started as the mutable type that is now referred to as bytearray. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series.
This PEP proposes four small adjustments to the APIs of the bytes, bytearray and memoryview types to make it easier to operate entirely in the binary domain:
- Deprecate passing single integer values to bytes and bytearray
- Add bytes.zeros and bytearray.zeros alternative constructors
- Add bytes.byte and bytearray.byte alternative constructors
- Add bytes.iterbytes, bytearray.iterbytes and memoryview.iterbytes alternative iterators
Proposals
Deprecation of current "zero-initialised sequence" behaviour
Currently, the bytes and bytearray constructors accept an integer argument and interpret it as meaning to create a zero-initialised sequence of the given size:
>>> bytes(3) b'\x00\x00\x00' >>> bytearray(3) bytearray(b'\x00\x00\x00')
This PEP proposes to deprecate that behaviour in Python 3.5, and remove it entirely in Python 3.6.
No other changes are proposed to the existing constructors.
Addition of explicit "zero-initialised sequence" constructors
To replace the deprecated behaviour, this PEP proposes the addition of an explicit zeros alternative constructor as a class method on both bytes and bytearray:
>>> bytes.zeros(3) b'\x00\x00\x00' >>> bytearray.zeros(3) bytearray(b'\x00\x00\x00')
It will behave just as the current constructors behave when passed a single integer.
The specific choice of zeros as the alternative constructor name is taken from the corresponding initialisation function in NumPy (although, as these are 1-dimensional sequence types rather than N-dimensional matrices, the constructors take a length as input rather than a shape tuple)
Addition of explicit "single byte" constructors
As binary counterparts to the text chr function, this PEP proposes the addition of an explicit byte alternative constructor as a class method on both bytes and bytearray:
>>> bytes.byte(3) b'\x03' >>> bytearray.byte(3) bytearray(b'\x03')
These methods will only accept integers in the range 0 to 255 (inclusive):
>>> bytes.byte(512) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: bytes must be in range(0, 256) >>> bytes.byte(1.0) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'float' object cannot be interpreted as an integer
The documentation of the ord builtin will be updated to explicitly note that bytes.byte is the inverse operation for binary data, while chr is the inverse operation for text data.
Behaviourally, bytes.byte(x) will be equivalent to the current bytes([x]) (and similarly for bytearray). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types).
As a separate method, the new spelling will also work better with higher order functions like map.
Addition of optimised iterator methods that produce bytes objects
This PEP proposes that bytes, bytearray and memoryview gain an optimised iterbytes method that produces length 1 bytes objects rather than integers:
for x in data.iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
The method can be used with arbitrary buffer exporting objects by wrapping them in a memoryview instance first:
for x in memoryview(data).iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
For memoryview, the semantics of iterbytes() are defined such that:
memview.tobytes() == b''.join(memview.iterbytes())
This allows the raw bytes of the memory view to be iterated over without needing to make a copy, regardless of the defined shape and format.
The main advantage this method offers over the map(bytes.byte, data) approach is that it is guaranteed not to fail midstream with a ValueError or TypeError. By contrast, when using the map based approach, the type and value of the individual items in the iterable are only checked as they are retrieved and passed through the bytes.byte constructor.
Design discussion
Why not rely on sequence repetition to create zero-initialised sequences?
Zero-initialised sequences can be created via sequence repetition:
>>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00')
However, this was also the case when the bytearray type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable bytes type then inherited that feature when it was introduced in PEP 3137.
This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behaviour of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that bytes(x) (where x is an integer) should behave like the bytes.byte(x) proposal in this PEP. Providing both behaviours as separate class methods avoids that ambiguity.
References
| [1] | Initial March 2014 discussion thread on python-ideas (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) |
| [2] | Guido's initial feedback in that thread (https://mail.python.org/pipermail/python-ideas/2014-March/027376.html) |
| [3] | Issue proposing moving zero-initialised sequences to a dedicated API (http://bugs.python.org/issue20895) |
| [4] | Issue proposing to use calloc() for zero-initialised binary sequences (http://bugs.python.org/issue21644) |
| [5] | August 2014 discussion thread on python-dev (https://mail.python.org/pipermail/python-ideas/2014-March/027295.html) |
Copyright
This document has been placed in the public domain.
pep-0468 Preserving the order of **kwargs in a function.
| PEP: | 468 |
|---|---|
| Title: | Preserving the order of **kwargs in a function. |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Eric Snow <ericsnowcurrently at gmail.com> |
| Discussions-To: | python-ideas at python.org |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 5-Apr-2014 |
| Python-Version: | 3.5 |
| Post-History: | 5-Apr-2014 |
| Resolution: |
Contents
Abstract
The **kwargs syntax in a function definition indicates that the interpreter should collect all keyword arguments that do not correspond to other named parameters. However, Python does not preserved the order in which those collected keyword arguments were passed to the function. In some contexts the order matters. This PEP introduces a mechanism by which the passed order of collected keyword arguments will now be preserved.
Motivation
Python's **kwargs syntax in function definitions provides a powerful means of dynamically handling keyword arguments. In some applications of the syntax (see Use Cases), the semantics applied to the collected keyword arguments requires that order be preserved. Unsurprisingly, this is similar to how OrderedDict is related to dict.
Currently to preserved the order you have to do so manually and separately from the actual function call. This involves building an ordered mapping, whether an OrderedDict or an iterable of 2-tuples, which is then passed as a single argument to the function. [1]
With the capability described in this PEP, that boilerplate would no longer be required.
For comparision, currently:
kwargs = OrderedDict()
kwargs['eggs'] = ...
...
def spam(a, kwargs):
...
and with this proposal:
def spam(a, **kwargs):
...
Nick Coglan, speaking of some of the uses cases, summed it up well [2]:
These *can* all be done today, but *not* by using keyword arguments. In my view, the problem to be addressed is that keyword arguments *look* like they should work for these cases, because they have a definite order in the source code. The only reason they don't work is because the interpreter throws that ordering information away. It's a textbook case of a language feature becoming an attractive nuisance in some circumstances: the simple and obvious solution for the above use cases *doesn't actually work* for reasons that aren't obviously clear if you don't have a firm grasp of Python's admittedly complicated argument handling.
This observation is supported by the appearance of this proposal over the years and the numerous times that people have been confused by the constructor for OrderedDict. [3] [4] [5]
Use Cases
As Nick noted, the current behavior of **kwargs is unintuitive in cases where one would expect order to matter. Aside from more specific cases outlined below, in general "anything else where you want to control the iteration order and set field names and values in a single call will potentially benefit." [6] That matters in the case of factories (e.g. __init__()) for ordered types.
Serialization
Obviously OrderedDict would benefit (both __init__() and update()) from ordered kwargs. However, the benefit also extends to serialization APIs [2]:
In the context of serialisation, one key lesson we have learned is that arbitrary ordering is a problem when you want to minimise spurious diffs, and sorting isn't a simple solution. Tools like doctest don't tolerate spurious diffs at all, but are often amenable to a sorting based answer. The cases where it would be highly desirable to be able use keyword arguments to control the order of display of a collection of key value pairs are ones like: * printing out key:value pairs in CLI output * mapping semantic names to column order in a CSV * serialising attributes and elements in particular orders in XML * serialising map keys in particular orders in human readable formats like JSON and YAML (particularly when they're going to be placed under source control)
Debugging
In the words of Raymond Hettinger [7]:
It makes it easier to debug if the arguments show-up in the order they were created. AFAICT, no purpose is served by scrambling them.
Other Use Cases
- Mock objects. [8]
- Controlling object presentation.
- Alternate namedtuple() where defaults can be specified.
- Specifying argument priority by order.
Concerns
Performance
As already noted, the idea of ordered keyword arguments has come up on a number of occasions. Each time it has been met with the same response, namely that preserving keyword arg order would have a sufficiently adverse effect on function call performance that it's not worth doing. However, Guido noted the following [9]:
Making **kwds ordered is still open, but requires careful design and implementation to avoid slowing down function calls that don't benefit.
As will be noted below, there are ways to work around this at the expense of increased complication. Ultimately the simplest approach is the one that makes the most sense: pack collected key word arguments into an OrderedDict. However, without a C implementation of OrderedDict there isn't much to discuss. That should change in Python 3.5. [10]
In some cases the difference of performance between dict and OrderedDict may be of significance. For instance: when the collected kwargs has an extended lifetime outside the originating function or the number of collected kwargs is massive. However, the difference in performance (both CPU and memory) in those cases should not be significant. Furthermore, the performance of the C OrderedDict implementation is essentially identical with dict for the non-mutating API. A concrete representation of the difference in performance will be a part of this proposal before its resolution.
Other Python Implementations
Another important issue to consider is that new features must be cognizant of the multiple Python implementations. At some point each of them would be expected to have implemented ordered kwargs. In this regard there doesn't seem to be an issue with the idea. [11] Each of the major Python implementations will be consulted regarding this proposal before its resolution.
Specification
Starting in version 3.5 Python will preserve the order of keyword arguments as passed to a function. To accomplish this the collected kwargs will now be an OrderedDict rather than a dict.
This will apply only to functions for which the definition uses the **kwargs syntax for collecting otherwise unspecified keyword arguments. Only the order of those keyword arguments will be preserved.
Relationship to **-unpacking syntax
The ** unpacking syntax in function calls has no special connection with this proposal. Keyword arguments provided by unpacking will be treated in exactly the same way as they are now: ones that match defined parameters are gather there and the remainder will be collected into the ordered kwargs (just like any other unmatched keyword argument).
Note that unpacking a mapping with undefined order, such as dict, will preserve its iteration order like normal. It's just that the order will remain undefined. The OrderedDict into which the unpacked key-value pairs will then be packed will not be able to provide any alternate ordering. This should not be surprising.
There have been brief discussions of simply passing these mappings through to the functions kwargs without unpacking and repacking them, but that is both outside the scope of this proposal and probably a bad idea regardless. (There is a reason those discussions were brief.)
Relationship to inspect.Signature
Signature objects should need no changes. The kwargs parameter of inspect.BoundArguments (returned by Signature.bind() and Signature.bind_partial()) will change from a dict to an OrderedDict.
C-API
TBD
Syntax
No syntax is added or changed by this proposal.
Backward-Compatibility
The following will change:
- type(kwargs)
- iteration order of kwargs will now be consistent (except of course in the case described above)
- as already noted, performance will be marginally different
None of these should be an issue. However, each will be carefully considered while this proposal is under discussion.
Alternate Approaches
Opt-out Decorator
This is identical to the current proposal with the exception that Python would also provide a decorator in functools that would cause collected keyword arguments to be packed into a normal dict instead of an OrderedDict.
Prognosis:
This would only be necessary if performance is determined to be significantly different in some uncommon cases or that there are other backward-compatibility concerns that cannot be resolved otherwise.
Opt-in Decorator
The status quo would be unchanged. Instead Python would provide a decorator in functools that would register or mark the decorated function as one that should get ordered keyword arguments. The performance overhead to check the function at call time would be marginal.
Prognosis:
The only real down-side is in the case of function wrappers factories (e.g. functools.partial and many decorators) that aim to perfectly preserve keyword arguments by using kwargs in the wrapper definition and kwargs unpacking in the call to the wrapped function. Each wrapper would have to be updated separately, though having functools.wraps() do this automaticallywould help.
__kworder__
The order of keyword arguments would be stored separately in a list at call time. The list would be bound to __kworder__ in the function locals.
Prognosis:
This likewise complicates the wrapper case.
Compact dict with faster iteration
Raymond Hettinger has introduced the idea of a dict implementation that would result in preserving insertion order on dicts (until the first deletion). This would be a perfect fit for kwargs. [5]
Prognosis:
The idea is still uncertain in both viability and timeframe.
***kwargs
This would add a new form to a function's signature as a mutually exclusive parallel to **kwargs. The new syntax, ***kwargs (note that there are three asterisks), would indicate that kwargs should preserve the order of keyword arguments.
Prognosis:
New syntax is only added to Python under the most dire circumstances. With other available solutions, new syntax is not justifiable. Furthermore, like all opt-in solutions, the new syntax would complicate the pass-through case.
annotations
This is a variation on the decorator approach. Instead of using a decorator to mark the function, you would use a function annotation on **kwargs.
Prognosis:
In addition to the pass-through complication, annotations have been actively discouraged in Python core development. Use of annotations to opt-in to order preservation runs the risk of interfering with other application-level use of annotations.
dict.__order__
dict objects would have a new attribute, __order__ that would default to None and that in the kwargs case the interpreter would use in the same way as described above for __kworder__.
Prognosis:
It would mean zero impact on kwargs performance but the change would be pretty intrusive (Python uses dict a lot). Also, for the wrapper case the interpreter would have to be careful to preserve __order__.
KWArgsDict.__order__
This is the same as the dict.__order__ idea, but kwargs would be an instance of a new minimal dict subclass that provides the __order__ attribute. dict would instead be unchanged.
Prognosis:
Simply switching to OrderedDict is a less complicated and more intuitive change.
Acknowledgements
Thanks to Andrew Barnert for helpful feedback and to the participants of all the past email threads.
Footnotes
| [1] | Alternately, you could also replace ** in your function definition with * and then pass in key/value 2-tuples. This has the advantage of not requiring the keys to be valid identifier strings. See https://mail.python.org/pipermail/python-ideas/2014-April/027491.html. |
References
| [2] | (1, 2) https://mail.python.org/pipermail/python-ideas/2014-April/027512.html |
| [4] | https://mail.python.org/pipermail/python-dev/2007-February/071310.html |
| [6] | https://mail.python.org/pipermail/python-dev/2012-December/123105.html |
| [7] | https://mail.python.org/pipermail/python-dev/2013-May/126327.html |
| [9] | https://mail.python.org/pipermail/python-dev/2013-May/126404.html |
| [10] | http://bugs.python.org/issue16991 |
| [11] | https://mail.python.org/pipermail/python-dev/2012-December/123100.html |
Copyright
This document has been placed in the public domain.
pep-0469 Migration of dict iteration code to Python 3
| PEP: | 469 |
|---|---|
| Title: | Migration of dict iteration code to Python 3 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2014-04-18 |
| Python-Version: | 3.5 |
| Post-History: | 2014-04-18, 2014-04-21 |
Contents
Abstract
For Python 3, PEP 3106 changed the design of the dict builtin and the mapping API in general to replace the separate list based and iterator based APIs in Python 2 with a merged, memory efficient set and multiset view based API. This new style of dict iteration was also added to the Python 2.7 dict type as a new set of iteration methods.
This means that there are now 3 different kinds of dict iteration that may need to be migrated to Python 3 when an application makes the transition:
- Lists as mutable snapshots: d.items() -> list(d.items())
- Iterator objects: d.iteritems() -> iter(d.items())
- Set based dynamic views: d.viewitems() -> d.items()
There is currently no widely agreed best practice on how to reliably convert all Python 2 dict iteration code to the common subset of Python 2 and 3, especially when test coverage of the ported code is limited. This PEP reviews the various ways the Python 2 iteration APIs may be accessed, and looks at the available options for migrating that code to Python 3 by way of the common subset of Python 2.6+ and Python 3.0+.
The PEP also considers the question of whether or not there are any additions that may be worth making to Python 3.5 that may ease the transition process for application code that doesn't need to worry about supporting earlier versions when eventually making the leap to Python 3.
PEP Withdrawal
In writing the second draft of this PEP, I came to the conclusion that the readability of hybrid Python 2/3 mapping code can actually be best enhanced by better helper functions rather than by making changes to Python 3.5+. The main value I now see in this PEP is as a clear record of the recommended approaches to migrating mapping iteration code from Python 2 to Python 3, as well as suggesting ways to keep things readable and maintainable when writing hybrid code that supports both versions.
Notably, I recommend that hybrid code avoid calling mapping iteration methods directly, and instead rely on builtin functions where possible, and some additional helper functions for cases that would be a simple combination of a builtin and a mapping method in pure Python 3 code, but need to be handled slightly differently to get the exact same semantics in Python 2.
Static code checkers like pylint could potentially be extended with an optional warning regarding direct use of the mapping iteration methods in a hybrid code base.
Mapping iteration models
Python 2.7 provides three different sets of methods to extract the keys, values and items from a dict instance, accounting for 9 out of the 18 public methods of the dict type.
In Python 3, this has been rationalised to just 3 out of 11 public methods (as the has_key method has also been removed).
Lists as mutable snapshots
This is the oldest of the three styles of dict iteration, and hence the one implemented by the d.keys(), d.values() and d.items() methods in Python 2.
These methods all return lists that are snapshots of the state of the mapping at the time the method was called. This has a few consequences:
- the original object can be mutated freely without affecting iteration over the snapshot
- the snapshot can be modified independently of the original object
- the snapshot consumes memory proportional to the size of the original mapping
The semantic equivalent of these operations in Python 3 are list(d.keys()), list(d.values()) and list(d.iteritems()).
Iterator objects
In Python 2.2, dict objects gained support for the then-new iterator protocol, allowing direct iteration over the keys stored in the dictionary, thus avoiding the need to build a list just to iterate over the dictionary contents one entry at a time. iter(d) provides direct access to the iterator object for the keys.
Python 2 also provides a d.iterkeys() method that is essentially synonymous with iter(d), along with d.itervalues() and d.iteritems() methods.
These iterators provide live views of the underlying object, and hence may fail if the set of keys in the underlying object is changed during iteration:
>>> d = dict(a=1) >>> for k in d: ... del d[k] ... Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dictionary changed size during iteration
As iterators, iteration over these objects is also a one-time operation: once the iterator is exhausted, you have to go back to the original mapping in order to iterate again.
In Python 3, direct iteration over mappings works the same way as it does in Python 2. There are no method based equivalents - the semantic equivalents of d.itervalues() and d.iteritems() in Python 3 are iter(d.values()) and iter(d.iteritems()).
The six and future.utils compatibility modules also both provide iterkeys(), itervalues() and iteritems() helper functions that provide efficient iterator semantics in both Python 2 and 3.
Set based dynamic views
The model that is provided in Python 3 as a method based API is that of set based dynamic views (technically multisets in the case of the values() view).
In Python 3, the objects returned by d.keys(), d.values() and d. items() provide a live view of the current state of the underlying object, rather than taking a full snapshot of the current state as they did in Python 2. This change is safe in many circumstances, but does mean that, as with the direct iteration API, it is necessary to avoid adding or removing keys during iteration, in order to avoid encountering the following error:
>>> d = dict(a=1) >>> for k, v in d.items(): ... del d[k] ... Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dictionary changed size during iteration
Unlike the iteration API, these objects are iterables, rather than iterators: you can iterate over them multiple times, and each time they will iterate over the entire underlying mapping.
These semantics are also available in Python 2.7 as the d.viewkeys(), d.viewvalues() and `d.viewitems() methods.
The future.utils compatibility module also provides viewkeys(), viewvalues() and viewitems() helper functions when running on Python 2.7 or Python 3.x.
Migrating directly to Python 3
The 2to3 migration tool handles direct migrations to Python 3 in accordance with the semantic equivalents described above:
- d.keys() -> list(d.keys())
- d.values() -> list(d.values())
- d.items() -> list(d.items())
- d.iterkeys() -> iter(d.keys())
- d.itervalues() -> iter(d.values())
- d.iteritems() -> iter(d.items())
- d.viewkeys() -> d.keys()
- d.viewvalues() -> d.values()
- d.viewitems() -> d.items()
Rather than 9 distinct mapping methods for iteration, there are now only the 3 view methods, which combine in straightforward ways with the two relevant builtin functions to cover all of the behaviours that are available as dict methods in Python 2.7.
Note that in many cases d.keys() can be replaced by just d, but the 2to3 migration tool doesn't attempt that replacement.
The 2to3 migration tool also does not provide any automatic assistance for migrating references to these objects as bound or unbound methods - it only automates conversions where the API is called immediately.
Migrating to the common subset of Python 2 and 3
When migrating to the common subset of Python 2 and 3, the above transformations are not generally appropriate, as they all either result in the creation of a redundant list in Python 2, have unexpectedly different semantics in at least some cases, or both.
Since most code running in the common subset of Python 2 and 3 supports at least as far back as Python 2.6, the currently recommended approach to conversion of mapping iteration operation depends on two helper functions for efficient iteration over mapping values and mapping item tuples:
- d.keys() -> list(d)
- d.values() -> list(itervalues(d))
- d.items() -> list(iteritems(d))
- d.iterkeys() -> iter(d)
- d.itervalues() -> itervalues(d)
- d.iteritems() -> iteritems(d)
Both six and future.utils provide appropriate definitions of itervalues() and iteritems() (along with essentially redundant definitions of iterkeys()). Creating your own definitions of these functions in a custom compatibility module is also relatively straightforward:
try:
dict.iteritems
except AttributeError:
# Python 3
def itervalues(d):
return iter(d.values())
def iteritems(d):
return iter(d.items())
else:
# Python 2
def itervalues(d):
return d.itervalues()
def iteritems(d):
return d.iteritems()
The greatest loss of readability currently arises when converting code that actually needs the list based snapshots that were the default in Python 2. This readability loss could likely be mitigated by also providing listvalues and listitems helper functions, allowing the affected conversions to be simplified to:
- d.values() -> listvalues(d)
- d.items() -> listitems(d)
The corresponding compatibility function definitions are as straightforward as their iterator counterparts:
try:
dict.iteritems
except AttributeError:
# Python 3
def listvalues(d):
return list(d.values())
def listitems(d):
return list(d.items())
else:
# Python 2
def listvalues(d):
return d.values()
def listitems(d):
return d.items()
With that expanded set of compatibility functions, Python 2 code would then be converted to "idiomatic" hybrid 2/3 code as:
- d.keys() -> list(d)
- d.values() -> listvalues(d)
- d.items() -> listitems(d)
- d.iterkeys() -> iter(d)
- d.itervalues() -> itervalues(d)
- d.iteritems() -> iteritems(d)
This compares well for readability with the idiomatic pure Python 3 code that uses the mapping methods and builtins directly:
- d.keys() -> list(d)
- d.values() -> list(d.values())
- d.items() -> list(d.items())
- d.iterkeys() -> iter(d)
- d.itervalues() -> iter(d.values())
- d.iteritems() -> iter(d.items())
It's also notable that when using this approach, hybrid code would never invoke the mapping methods directly: it would always invoke either a builtin or helper function instead, in order to ensure the exact same semantics on both Python 2 and 3.
Migrating from Python 3 to the common subset with Python 2.7
While the majority of migrations are currently from Python 2 either directly to Python 3 or to the common subset of Python 2 and Python 3, there are also some migrations of newer projects that start in Python 3 and then later add Python 2 support, either due to user demand, or to gain access to Python 2 libraries that are not yet available in Python 3 (and porting them to Python 3 or creating a Python 3 compatible replacement is not a trivial exercise).
In these cases, Python 2.7 compatibility is often sufficient, and the 2.7+ only view based helper functions provided by future.utils allow the bare accesses to the Python 3 mapping view methods to be replaced with code that is compatible with both Python 2.7 and Python 3 (note, this is the only migration chart in the PEP that has Python 3 code on the left of the conversion):
- d.keys() -> viewkeys(d)
- d.values() -> viewvalues(d)
- d.items() -> viewitems(d)
- list(d.keys()) -> list(d)
- list(d.values()) -> listvalues(d)
- list(d.items()) -> listitems(d)
- iter(d.keys()) -> iter(d)
- iter(d.values()) -> itervalues(d)
- iter(d.items()) -> iteritems(d)
As with migrations from Python 2 to the common subset, note that the hybrid code ends up never invoking the mapping methods directly - it only calls builtins and helper methods, with the latter addressing the semantic differences between Python 2 and Python 3.
Possible changes to Python 3.5+
The main proposal put forward to potentially aid migration of existing Python 2 code to Python 3 is the restoration of some or all of the alternate iteration APIs to the Python 3 mapping API. In particular, the initial draft of this PEP proposed making the following conversions possible when migrating to the common subset of Python 2 and Python 3.5+:
- d.keys() -> list(d)
- d.values() -> list(d.itervalues())
- d.items() -> list(d.iteritems())
- d.iterkeys() -> d.iterkeys()
- d.itervalues() -> d.itervalues()
- d.iteritems() -> d.iteritems()
Possible mitigations of the additional language complexity in Python 3 created by restoring these methods included immediately deprecating them, as well as potentially hiding them from the dir() function (or perhaps even defining a way to make pydoc aware of function deprecations).
However, in the case where the list output is actually desired, the end result of that proposal is actually less readable than an appropriately defined helper function, and the function and method forms of the iterator versions are pretty much equivalent from a readability perspective.
So unless I've missed something critical, readily available listvalues() and listitems() helper functions look like they will improve the readability of hybrid code more than anything we could add back to the Python 3.5+ mapping API, and won't have any long term impact on the complexity of Python 3 itself.
Discussion
The fact that 5 years in to the Python 3 migration we still have users considering the dict API changes a significant barrier to migration suggests that there are problems with previously recommended approaches. This PEP attempts to explore those issues and tries to isolate those cases where previous advice (such as it was) could prove problematic.
My assessment (largely based on feedback from Twisted devs) is that problems are most likely to arise when attempting to use d.keys(), d.values(), and d.items() in hybrid code. While superficially it seems as though there should be cases where it is safe to ignore the semantic differences, in practice, the change from "mutable snapshot" to "dynamic view" is significant enough that it is likely better to just force the use of either list or iterator semantics for hybrid code, and leave the use of the view semantics to pure Python 3 code.
This approach also creates rules that are simple enough and safe enough that it should be possible to automate them in code modernisation scripts that target the common subset of Python 2 and Python 3, just as 2to3 converts them automatically when targeting pure Python 3 code.
Acknowledgements
Thanks to the folks at the Twisted sprint table at PyCon for a very vigorous discussion of this idea (and several other topics), and especially to Hynek Schlawack for acting as a moderator when things got a little too heated :)
Thanks also to JP Calderone and Itamar Turner-Trauring for their email feedback, as well to the participants in the python-dev review of the initial version of the PEP.
Copyright
This document has been placed in the public domain.
pep-0470 Using Multi Repository Support for External to PyPI Package File Hosting
| PEP: | 470 |
|---|---|
| Title: | Using Multi Repository Support for External to PyPI Package File Hosting |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io>, |
| BDFL-Delegate: | Richard Jones <richard@python.org> |
| Discussions-To: | distutils-sig at python.org |
| Status: | Draft |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 12-May-2014 |
| Post-History: | 14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014 |
| Replaces: | 438 |
Contents
Abstract
This PEP proposes a mechanism for project authors to register with PyPI an external repository where their project's downloads can be located. This information can than be included as part of the simple API so that installers can use it to tell users where the item they are attempting to install is located and what they need to do to enable this additional repository. In addition to adding discovery information to make explicit multiple repositories easy to use, this PEP also deprecates and removes the implicit multiple repository support which currently functions through directly or indirectly linking off site via the simple API. Finally this PEP also proposes deprecating and removing the functionality added by PEP 438, particularly the additional rel information and the meta tag to indicate the API version.
This PEP does not propose mandating that all authors upload their projects to PyPI in order to exist in the index nor does it propose any change to the human facing elements of PyPI.
Rationale
Historically PyPI did not have any method of hosting files nor any method of automatically retrieving installables, it was instead focused on providing a central registry of names, to prevent naming collisions, and as a means of discovery for finding projects to use. In the course of time setuptools began to scrape these human facing pages, as well as pages linked from those pages, looking for things it could automatically download and install. Eventually this became the "Simple" API which used a similar URL structure however it eliminated any of the extraneous links and information to make the API more efficient. Additionally PyPI grew the ability for a project to upload release files directly to PyPI enabling PyPI to act as a repository in addition to an index.
This gives PyPI two equally important roles that it plays in the Python ecosystem, that of index to enable easy discovery of Python projects and central repository to enable easy hosting, download, and installation of Python projects. Due to the history behind PyPI and the very organic growth it has experienced the lines between these two roles are blurry, and this blurring has caused confusion for the end users of both of these roles and this has in turn caused ire between people attempting to use PyPI in different capacities, most often when end users want to use PyPI as a repository but the author wants to use PyPI solely as an index.
This confusion comes down to end users of projects not realizing if a project is hosted on PyPI or if it relies on an external service. This often manifests itself when the external service is down but PyPI is not. People will see that PyPI works, and other projects works, but this one specific one does not. They often times do not realize who they need to contact in order to get this fixed or what their remediation steps are.
By moving to using explicit multiple repositories we can make the lines between these two roles much more explicit and remove the "hidden" surprises caused by the current implementation of handling people who do not want to use PyPI as a repository. However simply moving to explicit multiple repositories is a regression in discoverability, and for that reason this PEP adds an extension to the current simple API which will enable easy discovery of the specific repository that a project can be found in.
PEP 438 attempted to solve this issue by allowing projects to explicitly declare if they were using the repository features or not, and if they were not, it had the installers classify the links it found as either "internal", "verifiable external" or "unverifiable external". PEP 438 was accepted and implemented in pip 1.4 (released on Jul 23, 2013) with the final transition implemented in pip 1.5 (released on Jan 2, 2014).
PEP 438 was successful in bringing about more people to utilize PyPI's repository features, an altogether good thing given the global CDN powering PyPI providing speed ups for a lot of people, however it did so by introducing a new point of confusion and pain for both the end users and the authors.
Key User Experience Expectations
- Easily allow external hosting to "just work" when appropriately configured at the system, user or virtual environment level.
- Easily allow package authors to tell PyPI "my releases are hosted <here>" and have that advertised in such a way that tools can clearly communicate it to users, without silently introducing unexpected dependencies on third party services.
- Eliminate any and all references to the confusing "verifiable external" and "unverifiable external" distinction from the user experience (both when installing and when releasing packages).
- The repository aspects of PyPI should become just the default package hosting location (i.e. the only one that is treated as opt-out rather than opt-in by most client tools in their default configuration). Aside from that aspect, hosting on PyPI should not otherwise provide an enhanced user experience over hosting your own package repository.
- Do all of the above while providing default behaviour that is secure against most attackers below the nation state adversary level.
Why Additional Repositories?
The two common installer tools, pip and easy_install/setuptools, both support the concept of additional locations to search for files to satisfy the installation requirements and have done so for many years. This means that there is no need to "phase" in a new flag or concept and the solution to installing a project from a repository other than PyPI will function regardless of how old (within reason) the end user's installer is. Not only has this concept existed in the Python tooling for some time, but it is a concept that exists across languages and even extending to the OS level with OS package tools almost universally using multiple repository support making it extremely likely that someone is already familiar with the concept.
Additionally, the multiple repository approach is a concept that is useful outside of the narrow scope of allowing projects which wish to be included on the index portion of PyPI but do not wish to utilize the repository portion of PyPI. This includes places where a company may wish to host a repository that contains their internal packages or where a project may wish to have multiple "channels" of releases, such as alpha, beta, release candidate, and final release. This could also be used for projects wishing to host files which cannot be uploaded to PyPI, such as multi-gigabyte data files or, currently at least, Linux Wheels.
Why Not PEP 438 or Similar?
While the additional search location support has existed in pip and setuptools for quite some time support for PEP 438 has only existed in pip since the 1.4 version, and still has yet to be implemented in setuptools. The design of PEP 438 did mean that users still benefited for projects which did not require external files even with older installers, however for projects which did require external files, users are still silently being given either potentially unreliable or, even worse, unsafe files to download. This system is also unique to Python as it arises out of the history of PyPI, this means that it is almost certain that this concept will be foreign to most, if not all users, until they encounter it while attempting to use the Python toolchain.
Additionally, the classification system proposed by PEP 438 has, in practice, turned out to be extremely confusing to end users, so much so that it is a position of this PEP that the situation as it stands is completely untenable. The common pattern for a user with this system is to attempt to install a project possibly get an error message (or maybe not if the project ever uploaded something to PyPI but later switched without removing old files), see that the error message suggests --allow-external, they reissue the command adding that flag most likely getting another error message, see that this time the error message suggests also adding --allow-unverified, and again issue the command a third time, this time finally getting the thing they wish to install.
This UX failure exists for several reasons.
If pip can locate files at all for a project on the Simple API it will simply use that instead of attempting to locate more. This is generally the right thing to do as attempting to locate more would erase a large part of the benefit of PEP 438. This means that if a project ever uploaded a file that matches what the user has requested for install that will be used regardless of how old it is.
PEP 438 makes an implicit assumption that most projects would either upload themselves to PyPI or would update themselves to directly linking to release files. While a large number of projects did ultimately decide to upload to PyPI, some of them did so only because the UX around what PEP 438 was so bad that they felt forced to do so. More concerning however, is the fact that very few projects have opted to directly and safely link to files and instead they still simply link to pages which must be scraped in order to find the actual files, thus rendering the safe variant (--allow-external) largely useless.
Even if an author wishes to directly link to their files, doing so safely is non-obvious. It requires the inclusion of a MD5 hash (for historical reasons) in the hash of the URL. If they do not include this then their files will be considered "unverified".
PEP 438 takes a security centric view and disallows any form of a global opt in for unverified projects. While this is generally a good thing, it creates extremely verbose and repetitive command invocations such as:
$ pip install --allow-external myproject --allow-unverified myproject myproject $ pip install --allow-all-external --allow-unverified myproject myproject
Multiple Repository/Index Support
Installers SHOULD implement or continue to offer, the ability to point the installer at multiple URL locations. The exact mechanisms for a user to indicate they wish to use an additional location is left up to each individual implementation.
Additionally the mechanism discovering an installation candidate when multiple repositories are being used is also up to each individual implementation, however once configured an implementation should not discourage, warn, or otherwise cast a negative light upon the use of a repository simply because it is not the default repository.
Currently both pip and setuptools implement multiple repository support by using the best installation candidate it can find from either repository, essentially treating it as if it were one large repository.
Installers SHOULD also implement some mechanism for removing or otherwise disabling use of the default repository. The exact specifics of how that is achieved is up to each individual implementation.
Installers SHOULD also implement some mechanism for whitelisting and blacklisting which projects a user wishes to install from a particular repository. The exact specifics of how that is achieved is up to each individual implementation.
External Index Discovery
One of the problems with using an additional index is one of discovery. Users will not generally be aware that an additional index is required at all much less where that index can be found. Projects can attempt to convey this information using their description on the PyPI page however that excludes people who discover their project organically through pip search.
To support projects that wish to externally host their files and to enable users to easily discover what additional indexes are required, PyPI will gain the ability for projects to register external index URLs along with an associated comment for each. These URLs will be made available on the simple page however they will not be linked or provided in a form that older installers will automatically search them.
This ability will take the form of a <meta> tag. The name of this tag must be set to repository or find-link and the content will be a link to the location of the repository. An optional data-description attribute will convey any comments or description that the author has provided.
An example would look something like:
<meta name="repository" content="https://index.example.com/" data-description="Primary Repository"> <meta name="repository" content="https://index.example.com/Ubuntu-14.04/" data-description="Wheels built for Ubuntu 14.04"> <meta name="find-link" content="https://links.example.com/find-links/" data-description="A flat index for find links">
When an installer fetches the simple page for a project, if it finds this additional meta-data then it should use this data to tell the user how to add one or more of the additional URLs to search in. This message should include any comments that the project has included to enable them to communicate to the user and provide hints as to which URL they might want (e.g. if some are only useful or compatible with certain platforms or situations). When the installer has implemented the auto discovery mechanisms they should also deprecate any of the mechanisms added for PEP 438 (such as --allow-external) for removal at the end of the deprecation period proposed by the PEP.
In addition to the API for programtic access to the registered external repositories, PyPI will also prevent these URLs in the UI so that users with an installer that does not implement the discovery mechanism can still easily discover what repository the project is using to host itself.
This feature MUST be added to PyPI and be contained in a released version of pip prior to starting the deprecation and removal process for the implicit offsite hosting functionality.
Deprecation and Removal of Link Spidering
Important
The deprecation specified in this section MUST not start to until after the discovery mechanisms have been implemented and released in pip.
The only exception to this is the addition of the pypi-only mode and defaulting new projects to it without abilility to switch to a different mode.
A new hosting mode will be added to PyPI. This hosting mode will be called pypi-only and will be in addition to the three that PEP 438 has already given us which are pypi-explicit, pypi-scrape, pypi-scrape-crawl. This new hosting mode will modify a project's simple api page so that it only lists the files which are directly hosted on PyPI and will not link to anything else.
Upon acceptance of this PEP and the addition of the pypi-only mode, all new projects will be defaulted to the PyPI only mode and they will be locked to this mode and unable to change this particular setting. pypi-only projects will still be able to register external index URLs as described above - the "pypi-only" refers only to the download links that are published directly on PyPI.
An email will then be sent out to all of the projects which are hosted only on PyPI informing them that in one month their project will be automatically converted to the pypi-only mode. A month after these emails have been sent any of those projects which were emailed, which still are hosted only on PyPI will have their mode set to pypi-only.
After that switch, an email will be sent to projects which rely on hosting external to PyPI. This email will warn these projects that externally hosted files have been deprecated on PyPI and that in 6 months from the time of that email that all external links will be removed from the installer APIs. This email MUST include instructions for converting their projects to be hosted on PyPI and MUST include links to a script or package that will enable them to enter their PyPI credentials and package name and have it automatically download and re-host all of their files on PyPI. This email MUST also include instructions for setting up their own index page and registering that with PyPI, including the fact that they can use pythonhosted.org as a host for an index page without requiring them to host any additional infrastructure or purchase a TLS certificate. This email must also contain a link to the Terms of Service for PyPI as many users may have signed up a long time ago and may not recall what those terms are. Finally this email must also contain a list of the links registered with PyPI where we were able to detect an installable file was located.
Five months after the initial email, another email must be sent to any projects still relying on external hosting. This email will include all of the same information that the first email contained, except that the removal date will be one month away instead of six.
Finally a month later all projects will be switched to the pypi-only mode and PyPI will be modified to remove the externally linked files functionality, when switching these projects to the pypi-only mode we will move any links which are able to be used for discovering other projects automatically to as an external repository.
Summary of Changes
Repository side
- Implement simple API changes to allow the addition of an external repository.
- (Optional, Mandatory on PyPI) Deprecate and remove the hosting modes as defined by PEP 438.
- (Optional, Mandatory on PyPI) Restrict simple API to only list the files that are contained within the repository and the external repository metadata.
Client side
- Implement multiple repository support.
- Implement some mechanism for removing/disabling the default repository.
- Implement the discovery mechanism.
- (Optional) Deprecate / Remove PEP 438
Impact
The large impact of this PEP will be that for users of older installation clients they will not get a discovery mechanism built into the install command. This will require them to browse to the PyPI web UI and discover the repository there. Since any URLs required to instal a project will be automatically migrated to the new format, the biggest change to users will be requiring a new option to install these projects.
Looking at the numbers the actual impact should be quite low, with it affecting just 3.8% of projects which host any files only externally or 2.2% which have their latest version hosted only externally.
6674 unique IP addresses have accessed the Simple API for these 3.8% of projects in a single day (2014-09-30). Of those, 99.5% of them installed something which could not be verified, and thus they were open to a Remote Code Execution via a Man-In-The-Middle attack, while 7.9% installed something which could be verified and only 0.4% only installed things which could be verified.
This means that 99.5% users of these features, both new and old, are doing something unsafe, and for anything using an older copy of pip or using setuptools at all they are silently unsafe.
Projects Which Rely on Externally Hosted files
This is determined by crawling the simple index and looking for installable files using a similar detection method as pip and setuptools use. The "latest" version is determined using pkg_resources.parse_version sort order and it is used to show whether or not the latest version is hosted externally or only old versions are.
| PyPI | External (old) | External (latest) | Total | |
|---|---|---|---|---|
| Safe | 43313 | 16 | 39 | 43368 |
| Unsafe | 0 | 756 | 1092 | 1848 |
| Total | 43313 | 772 | 1131 | 45216 |
Top Externally Hosted Projects by Requests
This is determined by looking at the number of requests the /simple/<project>/ page had gotten in a single day. The total number of requests during that day was 10,623,831.
| Project | Requests |
|---|---|
| PIL | 63869 |
| Pygame | 2681 |
| mysql-connector-python | 1562 |
| pyodbc | 724 |
| elementtree | 635 |
| salesforce-python-toolkit | 316 |
| wxPython | 295 |
| PyXML | 251 |
| RBTools | 235 |
| python-graph-core | 123 |
| cElementTree | 121 |
Top Externally Hosted Projects by Unique IPs
This is determined by looking at the IP addresses of requests the /simple/<project>/ page had gotten in a single day. The total number of unique IP addresses during that day was 124,604.
| Project | Unique IPs |
|---|---|
| PIL | 4553 |
| mysql-connector-python | 462 |
| Pygame | 202 |
| pyodbc | 181 |
| elementtree | 166 |
| wxPython | 126 |
| RBTools | 114 |
| PyXML | 87 |
| salesforce-python-toolkit | 76 |
| pyDes | 76 |
Rejected Proposals
Keep the current classification system but adjust the options
This PEP rejects several related proposals which attempt to fix some of the usability problems with the current system but while still keeping the general gist of PEP 438.
This includes:
- Default to allowing safely externally hosted files, but disallow unsafely hosted.
- Default to disallowing safely externally hosted files with only a global flag to enable them, but disallow unsafely hosted.
- Continue on the suggested path of PEP 438 and remove the option to unsafely host externally but continue to allow the option to safely host externally.
These proposals are rejected because:
- The classification system introduced in PEP 438 in an entirely unique concept to PyPI which is not generically applicable even in the context of Python packaging. Adding additional concepts comes at a cost.
- The classification system itself is non-obvious to explain and to pre-determine what classification of link a project will require entails inspecting the project's /simple/<project>/ page, and possibly any URLs linked from that page.
- The ability to host externally while still being linked for automatic discovery is mostly a historic relic which causes a fair amount of pain and complexity for little reward.
- The installer's ability to optimize or clean up the user interface is limited due to the nature of the implicit link scraping which would need to be done. This extends to the --allow-* options as well as the inability to determine if a link is expected to fail or not.
- The mechanism paints a very broad brush when enabling an option, while PEP 438 attempts to limit this with per package options. However a project that has existed for an extended period of time may often times have several different URLs listed in their simple index. It is not unusual for at least one of these to no longer be under control of the project. While an unregistered domain will sit there relatively harmless most of the time, pip will continue to attempt to install from it on every discovery phase. This means that an attacker simply needs to look at projects which rely on unsafe external URLs and register expired domains to attack users.
Implement this PEP, but Do Not Remove the Existing Links
This is essentially the backwards compatible version of this PEP. It attempts to allow people using older clients, or clients which do not implement this PEP to continue on as if nothing had changed. This proposal is rejected because the vast bulk of those scenarios are unsafe uses of the deprecated features. It is the opinion of this PEP that silently allowing unsafe actions to take place on behalf of end users is simply not an acceptable solution.
Copyright
This document has been placed in the public domain.
pep-0471 os.scandir() function -- a better and faster directory iterator
| PEP: | 471 |
|---|---|
| Title: | os.scandir() function -- a better and faster directory iterator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ben Hoyt <benhoyt at gmail.com> |
| BDFL-Delegate: | Victor Stinner <victor.stinner@gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-May-2014 |
| Python-Version: | 3.5 |
| Post-History: | 27-Jun-2014, 8-Jul-2014, 14-Jul-2014 |
Contents
- Abstract
- Rationale
- Implementation
- Specifics of proposal
- Examples
- Support
- Use in the wild
- Rejected ideas
- Naming
- Wildcard support
- Methods not following symlinks by default
- DirEntry attributes being properties
- DirEntry fields being "static" attribute-only objects
- DirEntry fields being static with an ensure_lstat option
- Return values being (name, stat_result) two-tuples
- Return values being overloaded stat_result objects
- Return values being pathlib.Path objects
- Possible improvements
- Previous discussion
- References
- Copyright
Abstract
This PEP proposes including a new directory iteration function, os.scandir(), in the standard library. This new function adds useful functionality and increases the speed of os.walk() by 2-20 times (depending on the platform and file system) by avoiding calls to os.stat() in most cases.
Rationale
Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.
But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.
In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)
In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro- optimizations. See more benchmarks here [1].
Somewhat relatedly, many people (see Python Issue 11406 [2]) are also keen on a version of os.listdir() that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories.
So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.
Implementation
The implementation of this proposal was written by Ben Hoyt (initial version) and Tim Golden (who helped a lot with the C extension module). It lives on GitHub at benhoyt/scandir [3]. (The implementation may lag behind the updates to this PEP a little.)
Note that this module has been used and tested (see "Use in the wild" section in this PEP), so it's more than a proof-of-concept. However, it is marked as beta software and is not extensively battle-tested. It will need some cleanup and more thorough testing before going into the standard library, as well as integration into posixmodule.c.
Specifics of proposal
os.scandir()
Specifically, this PEP proposes adding a single function to the os module in the standard library, scandir, that takes a single, optional string as its argument:
scandir(path='.') -> generator of DirEntry objects
Like listdir, scandir calls the operating system's directory iteration system calls to get the names of the files in the given path, but it's different from listdir in two ways:
- Instead of returning bare filename strings, it returns lightweight DirEntry objects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned.
- It returns a generator instead of a list, so that scandir acts as a true iterator instead of returning the full list immediately.
scandir() yields a DirEntry object for each file and sub-directory in path. Just like listdir, the '.' and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each DirEntry object has the following attributes and methods:
- name: the entry's filename, relative to the scandir path argument (corresponds to the return values of os.listdir)
- path: the entry's full path name (not necessarily an absolute path) -- the equivalent of os.path.join(scandir_path, entry.name)
- is_dir(*, follow_symlinks=True): similar to pathlib.Path.is_dir(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
- is_file(*, follow_symlinks=True): similar to pathlib.Path.is_file(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases; don't follow symbolic links if follow_symlinks is False
- is_symlink(): similar to pathlib.Path.is_symlink(), but the return value is cached on the DirEntry object; doesn't require a system call in most cases
- stat(*, follow_symlinks=True): like os.stat(), but the return value is cached on the DirEntry object; does not require a system call on Windows (except for symlinks); don't follow symbolic links (like os.lstat()) if follow_symlinks is False
All methods may perform system calls in some cases and therefore possibly raise OSError -- see the "Notes on exception handling" section for more details.
The DirEntry attribute and method names were chosen to be the same as those in the new pathlib module where possible, for consistency. The only difference in functionality is that the DirEntry methods cache their values on the entry object after the first call.
Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).
os.walk()
As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir(). This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system).
Examples
First, a very simple example of scandir() showing use of the DirEntry.name attribute and the DirEntry.is_dir() method:
def subdirs(path):
"""Yield directory names not starting with '.' under given path."""
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_dir():
yield entry.name
This subdirs() function will be significantly faster with scandir than os.listdir() and os.path.isdir() on both Windows and POSIX systems, especially on medium-sized or large directories.
Or, for getting the total size of files in a directory tree, showing use of the DirEntry.stat() method and DirEntry.path attribute:
def get_tree_size(path):
"""Return total size of files in given path and subdirs."""
total = 0
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
total += get_tree_size(entry.path)
else:
total += entry.stat(follow_symlinks=False).st_size
return total
This also shows the use of the follow_symlinks parameter to is_dir() -- in a recursive function like this, we probably don't want to follow links. (To properly follow links in a recursive function like this we'd want special handling for the case where following a symlink leads to a recursive loop.)
Note that get_tree_size() will get a huge speed boost on Windows, because no extra stat call are needed, but on POSIX systems the size information is not returned by the directory iteration functions, so this function won't gain anything there.
Notes on caching
The DirEntry objects are relatively dumb -- the name and path attributes are obviously always cached, and the is_X and stat methods cache their values (immediately on Windows via FindNextFile, and on first use on POSIX systems via a stat system call) and never refetch from the system.
For this reason, DirEntry objects are intended to be used and thrown away after iteration, not stored in long-lived data structured and the methods called again and again.
If developers want "refresh" behaviour (for example, for watching a file's size change), they can simply use pathlib.Path objects, or call the regular os.stat() or os.path.getsize() functions which get fresh data from the operating system every call.
Notes on exception handling
DirEntry.is_X() and DirEntry.stat() are explicitly methods rather than attributes or properties, to make it clear that they may not be cheap operations (although they often are), and they may do a system call. As a result, these methods may raise OSError.
For example, DirEntry.stat() will always make a system call on POSIX-based systems, and the DirEntry.is_X() methods will make a stat() system call on such systems if readdir() does not support d_type or returns a d_type with a value of DT_UNKNOWN, which can occur under certain conditions or on certain file systems.
Often this does not matter -- for example, os.walk() as defined in the standard library only catches errors around the listdir() calls.
Also, because the exception-raising behaviour of the DirEntry.is_X methods matches that of pathlib -- which only raises OSError in the case of permissions or other fatal errors, but returns False if the path doesn't exist or is a broken symlink -- it's often not necessary to catch errors around the is_X() calls.
However, when a user requires fine-grained error handling, it may be desirable to catch OSError around all method calls and handle as appropriate.
For example, below is a version of the get_tree_size() example shown above, but with fine-grained error handling added:
def get_tree_size(path):
"""Return total size of files in path and subdirs. If
is_dir() or stat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
for entry in os.scandir(path):
try:
is_dir = entry.is_dir(follow_symlinks=False)
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
total += get_tree_size(entry.path)
else:
try:
total += entry.stat(follow_symlinks=False).st_size
except OSError as error:
print('Error calling stat():', error, file=sys.stderr)
return total
Support
The scandir module on GitHub has been forked and used quite a bit (see "Use in the wild" in this PEP), but there's also been a fair bit of direct support for a scandir-like function from core developers and others on the python-dev and python-ideas mailing lists. A sampling:
- python-dev: a good number of +1's and very few negatives for scandir and PEP 471 on this June 2014 python-dev thread
- Nick Coghlan, a core Python developer: "I've had the local Red Hat release engineering team express their displeasure at having to stat every file in a network mounted directory tree for info that is present in the dirent structure, so a definite +1 to os.scandir from me, so long as it makes that info available." [source1]
- Tim Golden, a core Python developer, supports scandir enough to have spent time refactoring and significantly improving scandir's C extension module. [source2]
- Christian Heimes, a core Python developer: "+1 for something like yielddir()" [source3] and "Indeed! I'd like to see the feature in 3.4 so I can remove my own hack from our code base." [source4]
- Gregory P. Smith, a core Python developer: "As 3.4beta1 happens tonight, this isn't going to make 3.4 so i'm bumping this to 3.5. I really like the proposed design outlined above." [source5]
- Guido van Rossum on the possibility of adding scandir to Python 3.5 (as it was too late for 3.4): "The ship has likewise sailed for adding scandir() (whether to os or pathlib). By all means experiment and get it ready for consideration for 3.5, but I don't want to add it to 3.4." [source6]
Support for this PEP itself (meta-support?) was given by Nick Coghlan on python-dev: "A PEP reviewing all this for 3.5 and proposing a specific os.scandir API would be a good thing." [source7]
Use in the wild
To date, the scandir implementation is definitely useful, but has been clearly marked "beta", so it's uncertain how much use of it there is in the wild. Ben Hoyt has had several reports from people using it. For example:
- Chris F: "I am processing some pretty large directories and was half expecting to have to modify getdents. So thanks for saving me the effort." [via personal email]
- bschollnick: "I wanted to let you know about this, since I am using Scandir as a building block for this code. Here's a good example of scandir making a radical performance improvement over os.listdir." [source8]
- Avram L: "I'm testing our scandir for a project I'm working on. Seems pretty solid, so first thing, just want to say nice work!" [via personal email]
- Matt Z: "I used scandir to dump the contents of a network dir in under 15 seconds. 13 root dirs, 60,000 files in the structure. This will replace some old VBA code embedded in a spreadsheet that was taking 15-20 minutes to do the exact same thing." [via personal email]
Others have requested a PyPI package [4] for it, which has been created. See PyPI package [5].
GitHub stats don't mean too much, but scandir does have several watchers, issues, forks, etc. Here's the run-down as of the stats as of July 7, 2014:
- Watchers: 17
- Stars: 57
- Forks: 20
- Issues: 4 open, 26 closed
Also, because this PEP will increase the speed of os.walk() significantly, there are thousands of developers and scripts, and a lot of production code, that would benefit from it. For example, on GitHub, there are almost as many uses of os.walk (194,000) as there are of os.mkdir (230,000).
Rejected ideas
Naming
The only other real contender for this function's name was iterdir(). However, iterX() functions in Python (mostly found in Python 2) tend to be simple iterator equivalents of their non-iterator counterparts. For example, dict.iterkeys() is just an iterator version of dict.keys(), but the objects returned are identical. In scandir()'s case, however, the return values are quite different objects (DirEntry objects vs filename strings), so this should probably be reflected by a difference in name -- hence scandir().
See some relevant discussion on python-dev.
Wildcard support
FindFirstFile/FindNextFile on Windows support passing a "wildcard" like *.jpg, so at first folks (this PEP's author included) felt it would be a good idea to include a windows_wildcard keyword argument to the scandir function so users could pass this in.
However, on further thought and discussion it was decided that this would be bad idea, unless it could be made cross-platform (a pattern keyword argument or similar). This seems easy enough at first -- just use the OS wildcard support on Windows, and something like fnmatch or re afterwards on POSIX-based systems.
Unfortunately the exact Windows wildcard matching rules aren't really documented anywhere by Microsoft, and they're quite quirky (see this blog post), meaning it's very problematic to emulate using fnmatch or regexes.
So the consensus was that Windows wildcard support was a bad idea. It would be possible to add at a later date if there's a cross-platform way to achieve it, but not for the initial version.
Read more on the this Nov 2012 python-ideas thread and this June 2014 python-dev thread on PEP 471.
Methods not following symlinks by default
There was much debate on python-dev (see messages in this thread) over whether the DirEntry methods should follow symbolic links or not (when the is_X() methods had no follow_symlinks parameter).
Initially they did not (see previous versions of this PEP and the scandir.py module), but Victor Stinner made a pretty compelling case on python-dev that following symlinks by default is a better idea, because:
- following links is usually what you want (in 92% of cases in the standard library, functions using os.listdir() and os.path.isdir() do follow symlinks)
- that's the precedent set by the similar functions os.path.isdir() and pathlib.Path.is_dir(), so to do otherwise would be confusing
- with the non-link-following approach, if you wanted to follow links you'd have to say something like if (entry.is_symlink() and os.path.isdir(entry.path)) or entry.is_dir(), which is clumsy
As a case in point that shows the non-symlink-following version is error prone, this PEP's author had a bug caused by getting this exact test wrong in his initial implementation of scandir.walk() in scandir.py (see Issue #4 here).
In the end there was not total agreement that the methods should follow symlinks, but there was basic consensus among the most involved participants, and this PEP's author believes that the above case is strong enough to warrant following symlinks by default.
In addition, it's straight-forward to call the relevant methods with follow_symlinks=False if the other behaviour is desired.
DirEntry attributes being properties
In some ways it would be nicer for the DirEntry is_X() and stat() to be properties instead of methods, to indicate they're very cheap or free. However, this isn't quite the case, as stat() will require an OS call on POSIX-based systems but not on Windows. Even is_dir() and friends may perform an OS call on POSIX-based systems if the dirent.d_type value is DT_UNKNOWN (on certain file systems).
Also, people would expect the attribute access entry.is_dir to only ever raise AttributeError, not OSError in the case it makes a system call under the covers. Calling code would have to have a try/except around what looks like a simple attribute access, and so it's much better to make them methods.
See this May 2013 python-dev thread where this PEP author makes this case and there's agreement from a core developers.
DirEntry fields being "static" attribute-only objects
In this July 2014 python-dev message, Paul Moore suggested a solution that was a "thin wrapper round the OS feature", where the DirEntry object had only static attributes: name, path, and is_X, with the st_X attributes only present on Windows. The idea was to use this simpler, lower-level function as a building block for higher-level functions.
At first there was general agreement that simplifying in this way was a good thing. However, there were two problems with this approach. First, the assumption is the is_dir and similar attributes are always present on POSIX, which isn't the case (if d_type is not present or is DT_UNKNOWN). Second, it's a much harder-to-use API in practice, as even the is_dir attributes aren't always present on POSIX, and would need to be tested with hasattr() and then os.stat() called if they weren't present.
See this July 2014 python-dev response from this PEP's author detailing why this option is a non-ideal solution, and the subsequent reply from Paul Moore voicing agreement.
DirEntry fields being static with an ensure_lstat option
Another seemingly simpler and attractive option was suggested by Nick Coghlan in this June 2014 python-dev message: make DirEntry.is_X and DirEntry.lstat_result properties, and populate DirEntry.lstat_result at iteration time, but only if the new argument ensure_lstat=True was specified on the scandir() call.
This does have the advantage over the above in that you can easily get the stat result from scandir() if you need it. However, it has the serious disadvantage that fine-grained error handling is messy, because stat() will be called (and hence potentially raise OSError) during iteration, leading to a rather ugly, hand-made iteration loop:
it = os.scandir(path)
while True:
try:
entry = next(it)
except OSError as error:
handle_error(path, error)
except StopIteration:
break
Or it means that scandir() would have to accept an onerror argument -- a function to call when stat() errors occur during iteration. This seems to this PEP's author neither as direct nor as Pythonic as try/except around a DirEntry.stat() call.
Another drawback is that os.scandir() is written to make code faster. Always calling os.lstat() on POSIX would not bring any speedup. In most cases, you don't need the full stat_result object -- the is_X() methods are enough and this information is already known.
See Ben Hoyt's July 2014 reply to the discussion summarizing this and detailing why he thinks the original PEP 471 proposal is "the right one" after all.
Return values being (name, stat_result) two-tuples
Initially this PEP's author proposed this concept as a function called iterdir_stat() which yielded two-tuples of (name, stat_result). This does have the advantage that there are no new types introduced. However, the stat_result is only partially filled on POSIX-based systems (most fields set to None and other quirks), so they're not really stat_result objects at all, and this would have to be thoroughly documented as different from os.stat().
Also, Python has good support for proper objects with attributes and methods, which makes for a saner and simpler API than two-tuples. It also makes the DirEntry objects more extensible and future-proof as operating systems add functionality and we want to include this in DirEntry.
See also some previous discussion:
- May 2013 python-dev thread where Nick Coghlan makes the original case for a DirEntry-style object.
- June 2014 python-dev thread where Nick Coghlan makes (another) good case against the two-tuple approach.
Return values being overloaded stat_result objects
Another alternative discussed was making the return values to be overloaded stat_result objects with name and path attributes. However, apart from this being a strange (and strained!) kind of overloading, this has the same problems mentioned above -- most of the stat_result information is not fetched by readdir() on POSIX systems, only (part of) the st_mode value.
Return values being pathlib.Path objects
With Antoine Pitrou's new standard library pathlib module, it at first seems like a great idea for scandir() to return instances of pathlib.Path. However, pathlib.Path's is_X() and stat() functions are explicitly not cached, whereas scandir has to cache them by design, because it's (often) returning values from the original directory iteration system call.
And if the pathlib.Path instances returned by scandir cached stat values, but the ordinary pathlib.Path objects explicitly don't, that would be more than a little confusing.
Guido van Rossum explicitly rejected pathlib.Path caching stat in the context of scandir here, making pathlib.Path objects a bad choice for scandir return values.
Possible improvements
There are many possible improvements one could make to scandir, but here is a short list of some this PEP's author has in mind:
- scandir could potentially be further sped up by calling readdir / FindNextFile say 50 times per Py_BEGIN_ALLOW_THREADS block so that it stays in the C extension module for longer, and may be somewhat faster as a result. This approach hasn't been tested, but was suggested by on Issue 11406 by Antoine Pitrou. [source9]
- scandir could use a free list to avoid the cost of memory allocation for each iteration -- a short free list of 10 or maybe even 1 may help. Suggested by Victor Stinner on a python-dev thread on June 27 [6].
Previous discussion
- Original November 2012 thread Ben Hoyt started on python-ideas about speeding up os.walk()
- Python Issue 11406 [2], which includes the original proposal for a scandir-like function
- Further May 2013 thread Ben Hoyt started on python-dev that refined the scandir() API, including Nick Coghlan's suggestion of scandir yielding DirEntry-like objects
- November 2013 thread Ben Hoyt started on python-dev to discuss the interaction between scandir and the new pathlib module
- June 2014 thread Ben Hoyt started on python-dev to discuss the first version of this PEP, with extensive discussion about the API
- First July 2014 thread Ben Hoyt started on python-dev to discuss his updates to PEP 471
- Second July 2014 thread Ben Hoyt started on python-dev to discuss the remaining decisions needed to finalize PEP 471, specifically whether the DirEntry methods should follow symlinks by default
- Question on StackOverflow about why os.walk() is slow and pointers on how to fix it (this inspired the author of this PEP early on)
- BetterWalk, this PEP's author's previous attempt at this, on which the scandir code is based
References
| [1] | https://github.com/benhoyt/scandir#benchmarks |
| [2] | (1, 2) http://bugs.python.org/issue11406 |
| [3] | https://github.com/benhoyt/scandir |
| [4] | https://github.com/benhoyt/scandir/issues/12 |
| [5] | https://pypi.python.org/pypi/scandir |
| [6] | https://mail.python.org/pipermail/python-dev/2014-June/135232.html |
Copyright
This document has been placed in the public domain.
pep-0472 Support for indexing with keyword arguments
| PEP: | 472 |
|---|---|
| Title: | Support for indexing with keyword arguments |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Stefano Borini, Joseph Martinot-Lagarde |
| Discussions-To: | python-ideas at python.org |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 24-Jun-2014 |
| Python-Version: | 3.6 |
| Post-History: | 02-Jul-2014 |
Contents
Abstract
This PEP proposes an extension of the indexing operation to support keyword arguments. Notations in the form a[K=3,R=2] would become legal syntax. For future-proofing considerations, a[1:2, K=3, R=4] are considered and may be allowed as well, depending on the choice for implementation. In addition to a change in the parser, the index protocol (__getitem__, __setitem__ and __delitem__) will also potentially require adaptation.
Motivation
The indexing syntax carries a strong semantic content, differentiating it from a method call: it implies referring to a subset of data. We believe this semantic association to be important, and wish to expand the strategies allowed to refer to this data.
As a general observation, the number of indices needed by an indexing operation depends on the dimensionality of the data: one-dimensional data (e.g. a list) requires one index (e.g. a[3]), two-dimensional data (e.g. a matrix) requires two indices (e.g. a[2,3]) and so on. Each index is a selector along one of the axes of the dimensionality, and the position in the index tuple is the metainformation needed to associate each index to the corresponding axis.
The current python syntax focuses exclusively on position to express the association to the axes, and also contains syntactic sugar to refer to non-punctiform selection (slices)
>>> a[3] # returns the fourth element of a >>> a[1:10:2] # slice notation (extract a non-trivial data subset) >>> a[3,2] # multiple indexes (for multidimensional arrays)
The additional notation proposed in this PEP would allow notations involving keyword arguments in the indexing operation, e.g.
>>> a[K=3, R=2]
which would allow to refer to axes by conventional names.
One must additionally consider the extended form that allows both positional and keyword specification
>>> a[3,R=3,K=4]
This PEP will explore different strategies to enable the use of these notations.
Use cases
The following practical use cases present two broad categories of usage of a keyworded specification: Indexing and contextual option. For indexing:
To provide a more communicative meaning to the index, preventing e.g. accidental inversion of indexes
>>> gridValues[x=3, y=5, z=8] >>> rain[time=0:12, location=location]
In some domain, such as computational physics and chemistry, the use of a notation such as Basis[Z=5] is a Domain Specific Language notation to represent a level of accuracy
>>> low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3])
In this case, the index operation would return a basis set at the chosen level of accuracy (represented by the parameter Z). The reason behind an indexing is that the BasisSet object could be internally represented as a numeric table, where rows (the "coefficient" axis, hidden to the user in this example) are associated to individual elements (e.g. row 0:5 contains coefficients for element 1, row 5:8 coefficients for element 2) and each column is associated to a given degree of accuracy ("accuracy" or "Z" axis) so that first column is low accuracy, second column is medium accuracy and so on. With that indexing, the user would obtain another object representing the contents of the column of the internal table for accuracy level 3.
Additionally, the keyword specification can be used as an option contextual to the indexing. Specifically:
A "default" option allows to specify a default return value when the index is not present
>>> lst = [1, 2, 3] >>> value = lst[5, default=0] # value is 0
For a sparse dataset, to specify an interpolation strategy to infer a missing point from e.g. its surrounding data.
>>> value = array[1, 3, interpolate=spline_interpolator]
A unit could be specified with the same mechanism
>>> value = array[1, 3, unit="degrees"]
How the notation is interpreted is up to the implementing class.
Current implementation
Currently, the indexing operation is handled by methods __getitem__, __setitem__ and __delitem__. These methods' signature accept one argument for the index (with __setitem__ accepting an additional argument for the set value). In the following, we will analyze __getitem__(self, idx) exclusively, with the same considerations implied for the remaining two methods.
When an indexing operation is performed, __getitem__(self, idx) is called. Traditionally, the full content between square brackets is turned into a single object passed to argument idx:
- When a single element is passed, e.g. a[2], idx will be 2.
- When multiple elements are passed, they must be separated by commas: a[2, 3]. In this case, idx will be a tuple (2, 3). With a[2, 3, "hello", {}] idx will be (2, 3, "hello", {}).
- A slicing notation e.g. a[2:10] will produce a slice object, or a tuple containing slice objects if multiple values were passed.
Except for its unique ability to handle slice notation, the indexing operation has similarities to a plain method call: it acts like one when invoked with only one element; If the number of elements is greater than one, the idx argument behaves like a *args. However, as stated in the Motivation section, an indexing operation has the strong semantic implication of extraction of a subset out of a larger set, which is not automatically associated to a regular method call unless appropriate naming is chosen. Moreover, its different visual style is important for readability.
Specifications
The implementation should try to preserve the current signature for __getitem__, or modify it in a backward-compatible way. We will present different alternatives, taking into account the possible cases that need to be addressed
C0. a[1]; a[1,2] # Traditional indexing C1. a[Z=3] C2. a[Z=3, R=4] C3. a[1, Z=3] C4. a[1, Z=3, R=4] C5. a[1, 2, Z=3] C6. a[1, 2, Z=3, R=4] C7. a[1, Z=3, 2, R=4] # Interposed ordering
Strategy "Strict dictionary"
This strategy acknowledges that __getitem__ is special in accepting only one object, and the nature of that object must be non-ambiguous in its specification of the axes: it can be either by order, or by name. As a result of this assumption, in presence of keyword arguments, the passed entity is a dictionary and all labels must be specified.
C0. a[1]; a[1,2] -> idx = 1; idx = (1, 2)
C1. a[Z=3] -> idx = {"Z": 3}
C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4}
C3. a[1, Z=3] -> raise SyntaxError
C4. a[1, Z=3, R=4] -> raise SyntaxError
C5. a[1, 2, Z=3] -> raise SyntaxError
C6. a[1, 2, Z=3, R=4] -> raise SyntaxError
C7. a[1, Z=3, 2, R=4] -> raise SyntaxError
Pros
- Strong conceptual similarity between the tuple case and the dictionary case. In the first case, we are specifying a tuple, so we are naturally defining a plain set of values separated by commas. In the second, we are specifying a dictionary, so we are specifying a homogeneous set of key/value pairs, as in dict(Z=3, R=4);
- Simple and easy to parse on the __getitem__ side: if it gets a tuple, determine the axes using positioning. If it gets a dictionary, use the keywords.
- C interface does not need changes.
Neutral
- Degeneracy of a[{"Z": 3, "R": 4}] with a[Z=3, R=4] means the notation is syntactic sugar.
Strategy "mixed dictionary"
This strategy relaxes the above constraint to return a dictionary containing both numbers and strings as keys.
C0. a[1]; a[1,2] -> idx = 1; idx = (1, 2)
C1. a[Z=3] -> idx = {"Z": 3}
C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4}
C3. a[1, Z=3] -> idx = { 0: 1, "Z": 3}
C4. a[1, Z=3, R=4] -> idx = { 0: 1, "Z": 3, "R": 4}
C5. a[1, 2, Z=3] -> idx = { 0: 1, 1: 2, "Z": 3}
C6. a[1, 2, Z=3, R=4] -> idx = { 0: 1, 1: 2, "Z": 3, "R": 4}
C7. a[1, Z=3, 2, R=4] -> idx = { 0: 1, "Z": 3, 2: 2, "R": 4}
Pros
- Opens for mixed cases.
Cons
- Destroys ordering information for string keys. We have no way of saying if "Z" in C7 was in position 1 or 3.
- Implies switching from a tuple to a dict as soon as one specified index has a keyword argument. May be confusing to parse.
Strategy "named tuple"
Return a named tuple for idx instead of a tuple. Keyword arguments would obviously have their stated name as key, and positional argument would have an underscore followed by their order:
C0. a[1]; a[1,2] -> idx = 1; idx = (_0=1, _1=2)
C1. a[Z=3] -> idx = (Z=3)
C2. a[Z=3, R=2] -> idx = (Z=3, R=2)
C3. a[1, Z=3] -> idx = (_0=1, Z=3)
C4. a[1, Z=3, R=2] -> idx = (_0=1, Z=3, R=2)
C5. a[1, 2, Z=3] -> idx = (_0=1, _2=2, Z=3)
C6. a[1, 2, Z=3, R=4] -> (_0=1, _1=2, Z=3, R=4)
C7. a[1, Z=3, 2, R=4] -> (_0=1, Z=3, _1=2, R=4)
or (_0=1, Z=3, _2=2, R=4)
or raise SyntaxError
The required typename of the namedtuple could be Index or the name of the argument in the function definition, it keeps the ordering and is easy to analyse by using the _fields attribute. It is backward compatible, provided that C0 with more than one entry now passes a namedtuple instead of a plain tuple.
Pros
- Looks nice. namedtuple transparently replaces tuple and gracefully degrades to the old behavior.
- Does not require a change in the C interface
Cons
- According to some sources [4] namedtuple is not well developed. To include it as such important object would probably require rework and improvement;
- The namedtuple fields, and thus the type, will have to change according to the passed arguments. This can be a performance bottleneck, and makes it impossible to guarantee that two subsequent index accesses get the same Index class;
- the _n "magic" fields are a bit unusual, but ipython already uses them for result history.
- Python currently has no builtin namedtuple. The current one is available in the "collections" module in the standard library.
- Differently from a function, the two notations gridValues[x=3, y=5, z=8] and gridValues[3,5,8] would not gracefully match if the order is modified at call time (e.g. we ask for gridValues[y=5, z=8, x=3]). In a function, we can pre-define argument names so that keyword arguments are properly matched. Not so in __getitem__, leaving the task for interpreting and matching to __getitem__ itself.
Strategy "New argument contents"
In the current implementation, when many arguments are passed to __getitem__, they are grouped in a tuple and this tuple is passed to __getitem__ as the single argument idx. This strategy keeps the current signature, but expands the range of variability in type and contents of idx to more complex representations.
We identify four possible ways to implement this strategy:
- P1: uses a single dictionary for the keyword arguments.
- P2: uses individual single-item dictionaries.
- P3: similar to P2, but replaces single-item dictionaries with a (key, value) tuple.
- P4: similar to P2, but uses a special and additional new object: keyword()
Some of these possibilities lead to degenerate notations, i.e. indistinguishable from an already possible representation. Once again, the proposed notation becomes syntactic sugar for these representations.
Under this strategy, the old behavior for C0 is unchanged.
C0: a[1] -> idx = 1 # integer
a[1,2] -> idx = (1,2) # tuple
In C1, we can use either a dictionary or a tuple to represent key and value pair for the specific indexing entry. We need to have a tuple with a tuple in C1 because otherwise we cannot differentiate a["Z", 3] from a[Z=3].
C1: a[Z=3] -> idx = {"Z": 3} # P1/P2 dictionary with single key
or idx = (("Z", 3),) # P3 tuple of tuples
or idx = keyword("Z", 3) # P4 keyword object
As you can see, notation P1/P2 implies that a[Z=3] and a[{"Z": 3}] will call __getitem__ passing the exact same value, and is therefore syntactic sugar for the latter. Same situation occurs, although with different index, for P3. Using a keyword object as in P4 would remove this degeneracy.
For the C2 case:
C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} # P1 dictionary/ordereddict
or idx = ({"Z": 3}, {"R": 4}) # P2 tuple of two single-key dict
or idx = (("Z", 3), ("R", 4)) # P3 tuple of tuples
or idx = (keyword("Z", 3),
keyword("R", 4) ) # P4 keyword objects
P1 naturally maps to the traditional **kwargs behavior, however it breaks the convention that two or more entries for the index produce a tuple. P2 preserves this behavior, and additionally preserves the order. Preserving the order would also be possible with an OrderedDict as drafted by PEP-468 [5].
The remaining cases are here shown:
C3. a[1, Z=3] -> idx = (1, {"Z": 3}) # P1/P2
or idx = (1, ("Z", 3)) # P3
or idx = (1, keyword("Z", 3)) # P4
C4. a[1, Z=3, R=4] -> idx = (1, {"Z": 3, "R": 4}) # P1
or idx = (1, {"Z": 3}, {"R": 4}) # P2
or idx = (1, ("Z", 3), ("R", 4)) # P3
or idx = (1, keyword("Z", 3),
keyword("R", 4)) # P4
C5. a[1, 2, Z=3] -> idx = (1, 2, {"Z": 3}) # P1/P2
or idx = (1, 2, ("Z", 3)) # P3
or idx = (1, 2, keyword("Z", 3)) # P4
C6. a[1, 2, Z=3, R=4] -> idx = (1, 2, {"Z":3, "R": 4}) # P1
or idx = (1, 2, {"Z": 3}, {"R": 4}) # P2
or idx = (1, 2, ("Z", 3), ("R", 4)) # P3
or idx = (1, 2, keyword("Z", 3),
keyword("R", 4)) # P4
C7. a[1, Z=3, 2, R=4] -> idx = (1, 2, {"Z": 3, "R": 4}) # P1. Pack the keyword arguments. Ugly.
or raise SyntaxError # P1. Same behavior as in function calls.
or idx = (1, {"Z": 3}, 2, {"R": 4}) # P2
or idx = (1, ("Z", 3), 2, ("R", 4)) # P3
or idx = (1, keyword("Z", 3),
2, keyword("R", 4)) # P4
Pros
- Signature is unchanged;
- P2/P3 can preserve ordering of keyword arguments as specified at indexing,
- P1 needs an OrderedDict, but would destroy interposed ordering if allowed: all keyword indexes would be dumped into the dictionary;
- Stays within traditional types: tuples and dicts. Evt. OrderedDict;
- Some proposed strategies are similar in behavior to a traditional function call;
- The C interface for PyObject_GetItem and family would remain unchanged.
Cons
- Apparenty complex and wasteful;
- Degeneracy in notation (e.g. a[Z=3] and a[{"Z":3}] are equivalent and indistinguishable notations at the __[get|set|del]item__ level). This behavior may or may not be acceptable.
- for P4, an additional object similar in nature to slice() is needed, but only to disambiguate the above degeneracy.
- idx type and layout seems to change depending on the whims of the caller;
- May be complex to parse what is passed, especially in the case of tuple of tuples;
- P2 Creates a lot of single keys dictionary as members of a tuple. Looks ugly. P3 would be lighter and easier to use than the tuple of dicts, and still preserves order (unlike the regular dict), but would result in clumsy extraction of keywords.
Strategy "kwargs argument"
__getitem__ accepts an optional **kwargs argument which should be keyword only. idx also becomes optional to support a case where no non-keyword arguments are allowed. The signature would then be either
__getitem__(self, idx) __getitem__(self, idx, **kwargs) __getitem__(self, **kwargs)
Applied to our cases would produce:
C0. a[1,2] -> idx=(1,2); kwargs={}
C1. a[Z=3] -> idx=None ; kwargs={"Z":3}
C2. a[Z=3, R=4] -> idx=None ; kwargs={"Z":3, "R":4}
C3. a[1, Z=3] -> idx=1 ; kwargs={"Z":3}
C4. a[1, Z=3, R=4] -> idx=1 ; kwargs={"Z":3, "R":4}
C5. a[1, 2, Z=3] -> idx=(1,2); kwargs={"Z":3}
C6. a[1, 2, Z=3, R=4] -> idx=(1,2); kwargs={"Z":3, "R":4}
C7. a[1, Z=3, 2, R=4] -> raise SyntaxError # in agreement to function behavior
Empty indexing a[] of course remains invalid syntax.
Pros
- Similar to function call, evolves naturally from it;
- Use of keyword indexing with an object whose __getitem__ doesn't have a kwargs will fail in an obvious way. That's not the case for the other strategies.
Cons
- It doesn't preserve order, unless an OrderedDict is used;
- Forbids C7, but is it really needed?
- Requires a change in the C interface to pass an additional PyObject for the keyword arguments.
C interface
As briefly introduced in the previous analysis, the C interface would potentially have to change to allow the new feature. Specifically, PyObject_GetItem and related routines would have to accept an additional PyObject *kw argument for Strategy "kwargs argument". The remaining strategies would not require a change in the C function signatures, but the different nature of the passed object would potentially require adaptation.
Strategy "named tuple" would behave correctly without any change: the class returned by the factory method in collections returns a subclass of tuple, meaning that PyTuple_* functions can handle the resulting object.
Alternative Solutions
In this section, we present alternative solutions that would workaround the missing feature and make the proposed enhancement not worth of implementation.
Use a method
One could keep the indexing as is, and use a traditional get() method for those cases where basic indexing is not enough. This is a good point, but as already reported in the introduction, methods have a different semantic weight from indexing, and you can't use slices directly in methods. Compare e.g. a[1:3, Z=2] with a.get(slice(1,3), Z=2).
The authors however recognize this argument as compelling, and the advantage in semantic expressivity of a keyword-based indexing may be offset by a rarely used feature that does not bring enough benefit and may have limited adoption.
Emulate requested behavior by abusing the slice object
This extremely creative method exploits the slice objects' behavior, provided that one accepts to use strings (or instantiate properly named placeholder objects for the keys), and accept to use ":" instead of "=".
>>> a["K":3]
slice('K', 3, None)
>>> a["K":3, "R":4]
(slice('K', 3, None), slice('R', 4, None))
>>>
While clearly smart, this approach does not allow easy inquire of the key/value pair, it's too clever and esotheric, and does not allow to pass a slice as in a[K=1:10:2].
However, Tim Delaney comments
"I really do think that a[b=c, d=e] should just be syntax sugar for a['b':c, 'd':e]. It's simple to explain, and gives the greatest backwards compatibility. In particular, libraries that already abused slices in this way will just continue to work with the new syntax."
We think this behavior would produce inconvenient results. The library Pandas uses strings as labels, allowing notation such as
>>> a[:, "A":"F"]
to extract data from column "A" to column "F". Under the above comment, this notation would be equally obtained with
>>> a[:, A="F"]
which is weird and collides with the intended meaning of keyword in indexing, that is, specifying the axis through conventional names rather than positioning.
Pass a dictionary as an additional index
>>> a[1, 2, {"K": 3}]
this notation, although less elegant, can already be used and achieves similar results. It's evident that the proposed Strategy "New argument contents" can be interpreted as syntactic sugar for this notation.
Additional Comments
Commenters also expressed the following relevant points:
Relevance of ordering of keyword arguments
As part of the discussion of this PEP, it's important to decide if the ordering information of the keyword arguments is important, and if indexes and keys can be ordered in an arbitrary way (e.g. a[1,Z=3,2,R=4]). PEP-468 [5] tries to address the first point by proposing the use of an ordereddict, however one would be inclined to accept that keyword arguments in indexing are equivalent to kwargs in function calls, and therefore as of today equally unordered, and with the same restrictions.
Need for homogeneity of behavior
Relative to Strategy "New argument contents", a comment from Ian Cordasco points out that
"it would be unreasonable for just one method to behave totally differently from the standard behaviour in Python. It would be confusing for only __getitem__ (and ostensibly, __setitem__) to take keyword arguments but instead of turning them into a dictionary, turn them into individual single-item dictionaries." We agree with his point, however it must be pointed out that __getitem__ is already special in some regards when it comes to passed arguments.
Chris Angelico also states:
"it seems very odd to start out by saying "here, let's give indexing the option to carry keyword args, just like with function calls", and then come back and say "oh, but unlike function calls, they're inherently ordered and carried very differently"." Again, we agree on this point. The most straightforward strategy to keep homogeneity would be Strategy "kwargs argument", opening to a **kwargs argument on __getitem__.
One of the authors (Stefano Borini) thinks that only the "strict dictionary" strategy is worth of implementation. It is non-ambiguous, simple, does not force complex parsing, and addresses the problem of referring to axes either by position or by name. The "options" use case is probably best handled with a different approach, and may be irrelevant for this PEP. The alternative "named tuple" is another valid choice.
Having .get() become obsolete for indexing with default fallback
Introducing a "default" keyword could make dict.get() obsolete, which would be replaced by d["key", default=3]. Chris Angelico however states:
"Currently, you need to write __getitem__ (which raises an exception on finding a problem) plus something else, e.g. get(), which returns a default instead. By your proposal, both branches would go inside __getitem__, which means they could share code; but there still need to be two branches."
Additionally, Chris continues:
"There'll be an ad-hoc and fairly arbitrary puddle of names (some will go default=, others will say that's way too long and go def=, except that that's a keyword so they'll use dflt= or something...), unless there's a strong force pushing people to one consistent name.".
This argument is valid but it's equally valid for any function call, and is generally fixed by established convention and documentation.
On degeneracy of notation
User Drekin commented: "The case of a[Z=3] and a[{"Z": 3}] is similar to current a[1, 2] and a[(1, 2)]. Even though one may argue that the parentheses are actually not part of tuple notation but are just needed because of syntax, it may look as degeneracy of notation when compared to function call: f(1, 2) is not the same thing as f((1, 2)).".
References
| [1] | "keyword-only args in __getitem__" (http://article.gmane.org/gmane.comp.python.ideas/27584) |
| [2] | "Accepting keyword arguments for __getitem__" (https://mail.python.org/pipermail/python-ideas/2014-June/028164.html) |
| [3] | "PEP pre-draft: Support for indexing with keyword arguments" https://mail.python.org/pipermail/python-ideas/2014-July/028250.html |
| [4] | "namedtuple is not as good as it should be" (https://mail.python.org/pipermail/python-ideas/2013-June/021257.html) |
| [5] | (1, 2, 3) "Preserving the order of **kwargs in a function." http://legacy.python.org/dev/peps/pep-0468/ |
Copyright
This document has been placed in the public domain.
pep-0473 Adding structured data to built-in exceptions
| PEP: | 473 |
|---|---|
| Title: | Adding structured data to built-in exceptions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Sebastian Kreft <skreft at deezer.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Mar-2014 |
| Post-History: |
Contents
Abstract
Exceptions like AttributeError, IndexError, KeyError, LookupError, NameError, TypeError, and ValueError do not provide all information required by programmers to debug and better understand what caused them. Furthermore, in some cases the messages even have slightly different formats, which makes it really difficult for tools to automatically provide additional information to diagnose the problem. To tackle the former and to lay ground for the latter, it is proposed to expand these exceptions so to hold both the offending and affected entities.
Rationale
The main issue this PEP aims to solve is the fact that currently error messages are not that expressive and lack some key information to resolve the exceptions. Additionally, the information present on the error message is not always in the same format, which makes it very difficult for third-party libraries to provide automated diagnosis of the error.
These automated tools could, for example, detect typos or display or log extra debug information. These could be particularly useful when running tests or in a long running application.
Although it is in theory possible to have such libraries, they need to resort to hacks in order to achieve the goal. One such example is python-improved-exceptions [1], which modifies the byte-code to keep references to the possibly interesting objects and also parses the error messages to extract information like types or names. Unfortunately, such approach is extremely fragile and not portable.
A similar proposal [2] has been implemented for ImportError and in the same fashion this idea has received support [3]. Additionally, almost 10 years ago Guido asked in [11] to have a clean API to access the affected objects in Exceptions like KeyError, AttributeError, NameError, and IndexError. Similar issues and proposals ideas have been written in the last year. Some other issues have been created, but despite receiving support they finally get abandoned. References to the created issues are listed below:
- AttributeError: [11], [10], [5], [4], [3]
- IndexError: [11], [6], [3]
- KeyError: [11], [7], [3]
- LookupError: [11]
- NameError: [11], [10], [3]
- TypeError: [8]
- ValueError: [9]
To move forward with the development and to centralize the information and discussion, this PEP aims to be a meta-issue summarizing all the above discussions and ideas.
Examples
IndexError
The error message does not reference the list's length nor the index used.
a = [1, 2, 3, 4, 5] a[5] IndexError: list index out of range
KeyError
By convention the key is the first element of the error's argument, but there's no other information regarding the affected dictionary (keys types, size, etc.)
b = {'foo': 1}
b['fo']
KeyError: 'fo'
AttributeError
The object's type and the offending attribute are part of the error message. However, there are some different formats and the information is not always available. Furthermore, although the object type is useful in some cases, given the dynamic nature of Python, it would be much more useful to have a reference to the object itself. Additionally the reference to the type is not fully qualified and in some cases the type is just too generic to provide useful information, for example in case of accessing a module's attribute.
c = object() c.foo AttributeError: 'object' object has no attribute 'foo' import string string.foo AttributeError: 'module' object has no attribute 'foo' a = string.Formatter() a.foo AttributeError: 'Formatter' object has no attribute 'foo'
NameError
The error message provides typically the name.
foo = 1 fo NameError: global name 'fo' is not defined
Other Cases
Issues are even harder to debug when the target object is the result of another expression, for example:
a[b[c[0]]]
This issue is also related to the fact that opcodes only have line number information and not the offset. This proposal would help in this case but not as much as having offsets.
Proposal
Extend the exceptions AttributeError, IndexError, KeyError, LookupError, NameError, TypeError, and ValueError with the following:
- AttributeError: target w, attribute
- IndexError: target w, key w, index (just an alias to key)
- KeyError: target w, key w
- LookupError: target w, key w
- NameError: name, scope?
- TypeError: unexpected_type
- ValueError: unexpected_value w
Attributes with the superscript w may need to be weak references [12] to prevent any memory cycles. However, this may add an unnecessary extra complexity as noted by R. David Murray [13]. This is specially true given that builtin types do not support being weak referenced.
TODO(skreft): expand this with examples of corner cases.
To remain backwards compatible these new attributes will be optional and keyword only.
It is proposed to add this information, rather than just improve the error, as the former would allow new debugging frameworks and tools and also in the future to switch to a lazy generated message. Generated messages are discussed in [2], although they are not implemented at the moment. They would not only save some resources, but also uniform the messages.
The stdlib will be then gradually changed so to start using these new attributes.
Potential Uses
An automated tool could for example search for similar keys within the object, allowing to display the following::
a = {'foo': 1}
a['fo']
KeyError: 'fo'. Did you mean 'foo'?
foo = 1
fo
NameError: global name 'fo' is not defined. Did you mean 'foo'?
See [3] for the output a TestRunner could display.
Performance
Filling these new attributes would only require two extra parameters with data already available so the impact should be marginal. However, it may need special care for KeyError as the following pattern is already widespread.
try: a[foo] = a[foo] + 1 except: a[foo] = 0
Note as well that storing these objects into the error itself would allow the lazy generation of the error message, as discussed in [2].
References
| [1] | Python Exceptions Improved (https://www.github.com/sk-/python-exceptions-improved) |
| [2] | (1, 2, 3) ImportError needs attributes for module and file name (http://bugs.python.org/issue1559549) |
| [3] | (1, 2, 3, 4, 5, 6) Enhance exceptions by attaching some more information to them (https://mail.python.org/pipermail/python-ideas/2014-February/025601.html) |
| [4] | Specifity in AttributeError (https://mail.python.org/pipermail/python-ideas/2013-April/020308.html) |
| [5] | Add an 'attr' attribute to AttributeError (http://bugs.python.org/issue18156) |
| [6] | Add index attribute to IndexError (http://bugs.python.org/issue18162) |
| [7] | Add a 'key' attribute to KeyError (http://bugs.python.org/issue18163) |
| [8] | Add 'unexpected_type' to TypeError (http://bugs.python.org/issue18165) |
| [9] | 'value' attribute for ValueError (http://bugs.python.org/issue18166) |
| [10] | (1, 2) making builtin exceptions more informative (http://bugs.python.org/issue1182143) |
| [11] | (1, 2, 3, 4, 5, 6) LookupError etc. need API to get the key (http://bugs.python.org/issue614557) |
| [12] | weakref - Weak References (https://docs.python.org/3/library/weakref.html) |
| [13] | Message by R. David Murray: Weak refs on exceptions? (http://bugs.python.org/issue18163#msg190791) |
Copyright
This document has been placed in the public domain.
pep-0474 Creating forge.python.org
| PEP: | 474 |
|---|---|
| Title: | Creating forge.python.org |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Draft |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 19-Jul-2014 |
| Post-History: | 19-Jul-2014, 08-Jan-2015, 01-Feb-2015 |
Contents
Abstract
This PEP proposes setting up a new PSF provided resource, forge.python.org, as a location for maintaining various supporting repositories (such as the repository for Python Enhancement Proposals) in a way that is more accessible to new contributors, and easier to manage for core developers.
This PEP does not propose any changes to the core development workflow for CPython itself (see PEP 462 in relation to that).
Proposal
This PEP proposes that an instance of the self-hosted Kallithea code repository management system be deployed as "forge.python.org".
Individual repositories (such as the developer guide or the PEPs repository) may then be migrated from the existing hg.python.org infrastructure to the new forge.python.org infrastructure on a case by case basis. Each migration will need to decide whether to retain a read-only mirror on hg.python.org, or whether to just migrate wholesale to the new location.
In addition to supporting read-only mirrors on hg.python.org, forge.python.org will also aim to support hosting mirrors on popular proprietary hosting sites like GitHub and BitBucket. The aim will be to allow users familiar with these sites to submit and discuss pull requests using their preferred workflow, with forge.python.org automatically bringing those contributions over to the master repository.
Given the availability and popularity of commercially backed "free for open source projects" repository hosting services, this would not be a general purpose hosting site for arbitrary Python projects. The initial focus will be specifically on CPython and other repositories currently hosted on hg.python.org. In the future, this could potentially be expanded to consolidating other PSF managed repositories that are currently externally hosted to gain access to a pull request based workflow, such as the repository for the python.org Django application. As with the initial migrations, any such future migrations would be considered on a case-by-case basis, taking into account the preferences of the primary users of each repository.
Rationale
Currently, hg.python.org hosts more than just the core CPython repository, it also hosts other repositories such as those for the CPython developer guide and for Python Enhancement Proposals, along with various "sandbox" repositories for core developer experimentation.
While the simple "pull request" style workflow made popular by code hosting sites like GitHub and BitBucket isn't adequate for the complex branching model needed for parallel maintenance and development of the various CPython releases, it's a good fit for several of the ancillary projects that surround CPython that we don't wish to move to a proprietary hosting site.
The key requirements proposed for a PSF provided software forge are:
- MUST support simple "pull request" style workflows
- MUST support online editing for simple changes
- MUST be backed by an active development organisation (community or commercial)
Additional recommended requirements that are satisfied by this proposal, but may be negotiable if a sufficiently compelling alternative is presented:
- SHOULD support self-hosting on PSF infrastructure without ongoing fees
- SHOULD be a fully open source application written in Python
- SHOULD support Mercurial (for consistency with existing tooling)
- SHOULD support Git (to provide that option to users that prefer it)
- SHOULD allow users of git and Mercurial clients to transparently collaborate on the same repository
- SHOULD be open to customisation to meet the needs of CPython core development, including providing a potential path forward for the proposed migration to a core reviewer model in PEP 462
The preference for self-hosting without ongoing fees rules out the free-as-in-beer providers like GitHub and BitBucket, in addition to the various proprietary source code management offerings.
The preference for Mercurial support not only rules out GitHub, but also other Git-only solutions like GitLab and Gitorious.
The hard requirement for online editing support rules out the Apache Allura/HgForge combination.
The preference for a fully open source solution rules out RhodeCode.
Of the various options considered by the author of this proposal, that leaves Kallithea SCM as the proposed foundation for a forge.python.org service.
Kallithea is a full GPLv3 application (derived from the clearly and unambiguously GPLv3 licensed components of RhodeCode) that is being developed under the auspices of the Software Freedom Conservancy. The Conservancy has affirmed that the Kallithea codebase is completely and validly licensed under GPLv3. In addition to their role in building the initial Kallithea community, the Conservancy is also the legal home of both the Mercurial and Git projects. Other SFC member projects that may be familiar to Python users include Twisted, Gevent, BuildBot and PyPy.
Intended Benefits
The primary benefit of deploying Kallithea as forge.python.org is that supporting repositories such as the developer guide and the PEP repo could potentially be managed using pull requests and online editing. This would be much simpler than the current workflow which requires PEP editors and other core developers to act as intermediaries to apply updates suggested by other users.
The richer administrative functionality would also make it substantially easier to grant users access to particular repositories for collaboration purposes, without having to grant them general access to the entire installation. This helps lower barriers to entry, as trust can more readily be granted and earned incrementally, rather than being an all-or-nothing decision around granting core developer access.
Sustaining Engineering Considerations
Even with its current workflow, CPython itself remains one of the largest open source projects in the world (in the top 2% of projects tracked on OpenHub). Unfortunately, we have been significantly less effective at encouraging contributions to the projects that make up CPython's workflow infrastructure, including ensuring that our installations track upstream, and that wherever feasible, our own customisations are contributed back to the original project.
As such, a core component of this proposal is to actively engage with the upstream Kallithea community to lower the barriers to working with and on the Kallithea SCM, as well as with the PSF Infrastructure team to ensure the forge.python.org service integrates cleanly with the PSF's infrastructure automation.
This approach aims to provide a number of key benefits:
- allowing those of us contributing to maintenance of this service to be as productive as possible in the time we have available
- offering a compelling professional development opportunity to those volunteers that choose to participate in maintenance of this service
- making the Kallithea project itself more attractive to other potential users by making it as easy as possible to adopt, deploy and manage
- as a result of the above benefits, attracting sufficient contributors both in the upstream Kallithea community, and within the CPython infrastructure community, to allow the forge.python.org service to evolve effectively to meet changing developer expectations
Some initial steps have already been taken to address these sustaining engineering concerns:
- Tymoteusz Jankowski has been working with Donald Stufft to work out what would be involved in deploying Kallithea using the PSF's Salt based infrastructure automation.
- Graham Dumpleton and I have been working on making it easy to deploy demonstration Kallithea instances to the free tier of Red Hat's open source hosting service, OpenShift Online. (See the comments on that post, or the quickstart issue tracker for links to Graham's follow on work)
The next major step to be undertaken is to come up with a local development workflow that allows contributors on Windows, Mac OS X and Linux to run the Kallithea tests locally, without interfering with the operation of their own system. The currently planned approach for this is to focus on Vagrant, which is a popular automated virtual machine management system specifically aimed at developers running local VMs for testing purposes. The Vagrant based development guidelines for OpenShift Origin provide an extended example of the kind of workflow this approach enables. It's also worth noting that Vagrant is one of the options for working with a local build of the main python.org website.
If these workflow proposals end up working well for Kallithea, they may also be worth proposing for use by the upstream projects backing other PSF and CPython infrastructure services, including Roundup, BuildBot, and the main python.org web site.
Funding of development
As several aspects of this proposal and PEP 462 align with various workflow improvements under consideration for Red Hat's Beaker open source hardware integration testing system and other work-related projects, I have arranged to be able to devote ~1 day a week to working on CPython infrastructure projects.
Together with Rackspace's existing contributions to maintaining the pypi.python.org infrastructure, I personally believe this arrangement is indicative of a more general recognition amongst CPython redistributors and major users of the merit in helping to sustain upstream infrastructure through direct contributions of developer time, rather than expecting volunteer contributors to maintain that infrastructure entirely in their spare time or funding it indirectly through the PSF (with the additional management overhead that would entail). I consider this a positive trend, and one that I will continue to encourage as best I can.
Personal Motivation
As of March 2015, having moved from Boeing Defence Australia (where I had worked since September 1998) to Red Hat back in June 2011 , I now work for Red Hat as a software development workflow designer and process architect, focusing on the open source cross-platform Atomic Developer Bundle, which is part of the tooling ecosystem for the Project Atomic container hosting platform. Two of the key pieces of that bundle will be familiar to many readers: Docker for container management, and Vagrant for cross-platform local development VM management.
However, rather than being a developer for the downstream Red Hat Enterprise Linux Container Development Kit, I work with the development teams for a range of Red Hat's internal services, encouraging the standardisation of internal development tooling and processes on the Atomic Developer Bundle, contributing upstream as required to ensure it meets our needs and expectations. As with other Red Hat community web service development projects like PatternFly, this approach helps enable standardisation across internal services, community projects, and commercial products, while still leaving individual development teams with significant scope to appropriately prioritise their process improvement efforts by focusing on the limitations currently causing the most difficulties for them and their users.
In that role, I'll be focusing on effectively integrating the Developer Bundle with tools and technologies used across Red Hat's project and product portfolio. As Red Hat is an open source system integrator, that means touching on a wide range of services and technologies, including GitHub, GerritHub, standalone Gerrit, GitLab, Bugzilla, JIRA, Jenkins, Docker, Kubernetes, OpenShift, OpenStack, oVirt, Ansible, Puppet, and more.
However, as noted above in the section on sustaining engineering considerations, I've also secured agreement to spend a portion of my work time on similarly applying these cross platforms tools to improving the developer experience for the maintenance of Python Software Foundation infrastructure, starting with this proposal for a Kallithea-based forge.python.org service.
Between them, my day job and my personal open source engagement have given me visibility into a lot of what the popular source code management services do well and what they do poorly. While Kallithea certainly has plenty of flaws of its own, it's the one I consider most fixable from a personal perspective, as it allows me to get directly involved in tailoring it to meet the needs of the CPython core development community in a way that wouldn't be possible with a proprietary service like GitHub or BitBucket, or practical with a PHP-based service like Phabricator or a Ruby-based service like GitLab.
Technical Concerns and Challenges
Introducing a new service into the CPython infrastructure presents a number of interesting technical concerns and challenges. This section covers several of the most significant ones.
Service hosting
The default position of this PEP is that the new forge.python.org service will be integrated into the existing PSF Salt infrastructure and hosted on the PSF's Rackspace cloud infrastructure.
However, other hosting options will also be considered, in particular, possible deployment as a Kubernetes hosted web service on either Google Container Engine or the next generation of Red Hat's OpenShift Online service, by using either GCEPersistentDisk or the open source GlusterFS distributed filesystem to hold the source code repositories.
Ongoing infrastructure maintenance
Ongoing infrastructure maintenance is an area of concern within the PSF, as we currently lack a system administrator mentorship program equivalent to the Fedora Infrastructure Apprentice or GNOME Infrastructure Apprentice programs.
Instead, systems tend to be maintained largely by developers as a part time activity on top of their development related contributions, rather than seeking to recruit folks that are more interested in operations (i.e. keeping existing systems running well) than they are in development (i.e. making changes to the services to provide new features or a better user experience, or to address existing issues).
While I'd personally like to see the PSF operating such a program at some point in the future, I don't consider setting one up to be a feasible near term goal. However, I do consider it feasible to continue laying the groundwork for such a program by extending the PSF's existing usage of modern infrastructure technologies like OpenStack and Salt to cover more services, as well as starting to explore the potential benefits of containers and container platforms when it comes to maintaining and enhancing PSF provided services.
I also plan to look into the question of whether or not an open source cloud management platform like ManageIQ may help us bring our emerging "cloud sprawl" problem across Rackspace, Google, Amazon and other services more under control.
User account management
Ideally we'd like to be able to offer a single account that spans all python.org services, including Kallithea, Roundup/Rietveld, PyPI and the back end for the new python.org site, but actually implementing that would be a distinct infrastructure project, independent of this PEP. (It's also worth noting that the fine-grained control of ACLs offered by such a capability is a prerequisite for setting up an effective system administrator mentorship program)
For the initial rollout of forge.python.org, we will likely create yet another identity silo within the PSF infrastructure. A potentially superior alternative would be to add support for python-social-auth to Kallithea, but actually doing so would not be a requirement for the initial rollout of the service (the main technical concern there is that Kallithea is a Pylons application that has not yet been ported to Pyramid, so integration will require either adding a Pylons backend to python-social-auth, or else embarking on the Pyramid migration in Kallithea).
Breaking existing SSH access and links for Mercurial repositories
This PEP proposes leaving the existing hg.python.org installation alone, and setting up Kallithea on a new host. This approach minimises the risk of interfering with the development of CPython itself (and any other projects that don't migrate to the new software forge), but does make any migrations of existing repos more disruptive (since existing checkouts will break).
Integration with Roundup
Kallithea provides configurable issue tracker integration. This will need to be set up appropriately to integrate with the Roundup issue tracker at bugs.python.org before the initial rollout of the forge.python.org service.
Accepting pull requests on GitHub and BitBucket
The initial rollout of forge.python.org would support publication of read-only mirrors, both on hg.python.org and other services, as that is a relatively straightforward operation that can be implemented in a commit hook.
While a highly desirable feature, accepting pull requests on external services, and mirroring them as submissions to the master repositories on forge.python.org is a more complex problem, and would likely not be included as part of the initial rollout of the forge.python.org service.
Transparent Git and Mercurial interoperability
Kallithea's native support for both Git and Mercurial offers an opportunity to make it relatively straightforward for developers to use the client of their choice to interact with repositories hosted on forge.python.org.
This transparent interoperability does not exist yet, but running our own multi-VCS repository hosting service provides the opportunity to make this capability a reality, rather than passively waiting for a proprietary provider to deign to provide a feature that likely isn't in their commercial interest. There's a significant misalignment of incentives between open source communities and commercial providers in this particular area, as even though offering VCS client choice can significantly reduce community friction by eliminating the need for projects to make autocratic decisions that force particular tooling choices on potential contributors, top down enforcement of tool selection (regardless of developer preference) is currently still the norm in the corporate and other organisational environments that produce GitHub and Atlassian's paying customers.
Prior to acceptance, in the absence of transparent interoperability, this PEP should propose specific recommendations for inclusion in the CPython developer's guide section for git users for creating pull requests against forge.python.org hosted Mercurial repositories.
Pilot Objectives and Timeline
This proposal is part of Brett Cannon's current evaluation of improvement proposals for various aspects of the CPython development workflow. Key dates in that timeline are:
- Feb 1: Draft proposal published (for Kallithea, this PEP)
- Apr 8: Discussion of final proposals at Python Language Summit
- May 1: Brett's decision on which proposal to accept
- Sep 13: Python 3.5 released, adopting new workflows for Python 3.6
If this proposal is selected for further development, it is proposed to start with the rollout of the following pilot deployment:
- a reference implementation operational at kallithea-pilot.python.org, containing at least the developer guide and PEP repositories. This will be a "throwaway" instance, allowing core developers and other contributors to experiment freely without worrying about the long term consequences for the repository history.
- read-only live mirrors of the Kallithea hosted repositories on GitHub and BitBucket. As with the pilot service itself, these would be temporary repos, to be discarded after the pilot period ends.
- clear documentation on using those mirrors to create pull requests against Kallithea hosted Mercurial repositories (for the pilot, this will likely not include using the native pull request workflows of those hosted services)
- automatic linking of issue references in code review comments and commit messages to the corresponding issues on bugs.python.org
- draft updates to PEP 1 explaining the Kallithea based PEP editing and submission workflow
The following items would be needed for a production migration, but there doesn't appear to be an obvious way to trial an updated implementation as part of the pilot:
- adjusting the PEP publication process and the developer guide publication process to be based on the relocated Mercurial repos
The following items would be objectives of the overall workflow improvement process, but are considered "desirable, but not essential" for the initial adoption of the new service in September (if this proposal is the one selected and the proposed pilot deployment is successful):
- allowing the use of python-social-auth to authenticate against the PSF hosted Kallithea instance
- allowing the use of the GitHub and BitBucket pull request workflows to submit pull requests to the main Kallithea repo
- allowing easy triggering of forced BuildBot runs based on Kallithea hosted repos and pull requests (prior to the implementation of PEP 462, this would be intended for use with sandbox repos rather than the main CPython repo)
Future Implications for CPython Core Development
The workflow requirements for the main CPython development repository are significantly more complex than those for the repositories being discussed in this PEP. These concerns are covered in more detail in PEP 462.
Given Guido's recommendation to replace Rietveld with a more actively maintained code review system, my current plan is to rewrite that PEP to use Kallithea as the proposed glue layer, with enhanced Kallithea pull requests eventually replacing the current practice of uploading patche files directly to the issue tracker.
I've also started working with Pierre Yves-David on a custom Mercurial extension that automates some aspects of the CPython core development workflow.
Copyright
This document has been placed in the public domain.
pep-0475 Retry system calls failing with EINTR
| PEP: | 475 |
|---|---|
| Title: | Retry system calls failing with EINTR |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Charles-Franรงois Natali <cf.natali at gmail.com>, Victor Stinner <victor.stinner at gmail.com> |
| BDFL-Delegate: | Antoine Pitrou <solipsis@pitrou.net> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-July-2014 |
| Python-Version: | 3.5 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-February/138018.html |
Abstract
System call wrappers provided in the standard library should be retried automatically when they fail with EINTR, to relieve application code from the burden of doing so.
By system calls, we mean the functions exposed by the standard C library pertaining to I/O or handling of other system resources.
Rationale
Interrupted system calls
On POSIX systems, signals are common. Code calling system calls must be prepared to handle them. Examples of signals:
- The most common signal is SIGINT, the signal sent when CTRL+c is pressed. By default, Python raises a KeyboardInterrupt exception when this signal is received.
- When running subprocesses, the SIGCHLD signal is sent when a child process exits.
- Resizing the terminal sends the SIGWINCH signal to the applications running in the terminal.
- Putting the application in background (ex: press CTRL-z and then type the bg command) sends the SIGCONT signal.
Writing a C signal handler is difficult: only "async-signal-safe" functions can be called (for example, printf() and malloc() are not async-signal safe), and there are issues with reentrancy. Therefore, when a signal is received by a process during the execution of a system call, the system call can fail with the EINTR error to give the program an opportunity to handle the signal without the restriction on signal-safe functions.
This behaviour is system-dependent: on certain systems, using the SA_RESTART flag, some system calls are retried automatically instead of failing with EINTR. Regardless, Python's signal.signal() function clears the SA_RESTART flag when setting the signal handler: all system calls will probably fail with EINTR in Python.
Since receiving a signal is a non-exceptional occurrence, robust POSIX code must be prepared to handle EINTR (which, in most cases, means retry in a loop in the hope that the call eventually succeeds). Without special support from Python, this can make application code much more verbose than it needs to be.
Status in Python 3.4
In Python 3.4, handling the InterruptedError exception (EINTR's dedicated exception class) is duplicated at every call site on a case by case basis. Only a few Python modules actually handle this exception, and fixes usually took several years to cover a whole module. Example of code retrying file.read() on InterruptedError:
while True:
try:
data = file.read(size)
break
except InterruptedError:
continue
List of Python modules in the standard library which handle InterruptedError:
- asyncio
- asyncore
- io, _pyio
- multiprocessing
- selectors
- socket
- socketserver
- subprocess
Other programming languages like Perl, Java and Go retry system calls failing with EINTR at a lower level, so that libraries and applications needn't bother.
Use Case 1: Don't Bother With Signals
In most cases, you don't want to be interrupted by signals and you don't expect to get InterruptedError exceptions. For example, do you really want to write such complex code for a "Hello World" example?
while True:
try:
print("Hello World")
break
except InterruptedError:
continue
InterruptedError can happen in unexpected places. For example, os.close() and FileIO.close() may raise InterruptedError: see the article close() and EINTR.
The Python issues related to EINTR section below gives examples of bugs caused by EINTR.
The expectation in this use case is that Python hides the InterruptedError and retries system calls automatically.
Use Case 2: Be notified of signals as soon as possible
Sometimes yet, you expect some signals and you want to handle them as soon as possible. For example, you may want to immediately quit a program using the CTRL+c keyboard shortcut.
Besides, some signals are not interesting and should not disrupt the application. There are two options to interrupt an application on only some signals:
- Set up a custom signal signal handler which raises an exception, such as KeyboardInterrupt for SIGINT.
- Use a I/O multiplexing function like select() together with Python's signal wakeup file descriptor: see the function signal.set_wakeup_fd().
The expectation in this use case is for the Python signal handler to be executed timely, and the system call to fail if the handler raised an exception -- otherwise restart.
Proposal
This PEP proposes to handle EINTR and retries at the lowest level, i.e. in the wrappers provided by the stdlib (as opposed to higher-level libraries and applications).
Specifically, when a system call fails with EINTR, its Python wrapper must call the given signal handler (using PyErr_CheckSignals()). If the signal handler raises an exception, the Python wrapper bails out and fails with the exception.
If the signal handler returns successfully, the Python wrapper retries the system call automatically. If the system call involves a timeout parameter, the timeout is recomputed.
Modified functions
Example of standard library functions that need to be modified to comply with this PEP:
- open(), os.open(), io.open()
- functions of the faulthandler module
- os functions:
- os.fchdir()
- os.fchmod()
- os.fchown()
- os.fdatasync()
- os.fstat()
- os.fstatvfs()
- os.fsync()
- os.ftruncate()
- os.mkfifo()
- os.mknod()
- os.posix_fadvise()
- os.posix_fallocate()
- os.pread()
- os.pwrite()
- os.read()
- os.readv()
- os.sendfile()
- os.wait3()
- os.wait4()
- os.wait()
- os.waitid()
- os.waitpid()
- os.write()
- os.writev()
- special cases: os.close() and os.dup2() now ignore EINTR error, the syscall is not retried
- select.select(), select.poll.poll(), select.epoll.poll(), select.kqueue.control(), select.devpoll.poll()
- socket.socket() methods:
- accept()
- connect() (except for non-blocking sockets)
- recv()
- recvfrom()
- recvmsg()
- send()
- sendall()
- sendmsg()
- sendto()
- signal.sigtimedwait(), signal.sigwaitinfo()
- time.sleep()
(Note: the selector module already retries on InterruptedError, but it doesn't recompute the timeout yet)
os.close, close() methods and os.dup2() are a special case: they will ignore EINTR instead of retrying. The reason is complex but involves behaviour under Linux and the fact that the file descriptor may really be closed even if EINTR is returned. See articles:
- Returning EINTR from close()
- (LKML) Re: [patch 7/7] uml: retry host close() on EINTR
- close() and EINTR
The socket.socket.connect() method does not retry connect() for non-blocking sockets if it is interrupted by a signal (fails with EINTR). The connection runs asynchronously in background. The caller is responsible to wait until the socket becomes writable (ex: using select.select()) and then call socket.socket.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR) to check if the connection succeeded (getsockopt() returns 0) or failed.
InterruptedError handling
Since interrupted system calls are automatically retried, the InterruptedError exception should not occur anymore when calling those system calls. Therefore, manual handling of InterruptedError as described in Status in Python 3.4 can be removed, which will simplify standard library code.
Backward compatibility
Applications relying on the fact that system calls are interrupted with InterruptedError will hang. The authors of this PEP don't think that such applications exist, since they would be exposed to other issues such as race conditions (there is an opportunity for deadlock if the signal comes before the system call). Besides, such code would be non-portable.
In any case, those applications must be fixed to handle signals differently, to have a reliable behaviour on all platforms and all Python versions. A possible strategy is to set up a signal handler raising a well-defined exception, or use a wakeup file descriptor.
For applications using event loops, signal.set_wakeup_fd() is the recommanded option to handle signals. Python's low-level signal handler will write signal numbers into the file descriptor and the event loop will be awaken to read them. The event loop can handle those signals without the restriction of signal handlers (for example, the loop can be woken up in any thread, not just the main thread).
Appendix
Wakeup file descriptor
Since Python 3.3, signal.set_wakeup_fd() writes the signal number into the file descriptor, whereas it only wrote a null byte before. It becomes possible to distinguish between signals using the wakeup file descriptor.
Linux has a signalfd() system call which provides more information on each signal. For example, it's possible to know the pid and uid who sent the signal. This function is not exposed in Python yet (see issue 12304).
On Unix, the asyncio module uses the wakeup file descriptor to wake up its event loop.
Multithreading
A C signal handler can be called from any thread, but Python signal handlers will always be called in the main Python thread.
Python's C API provides the PyErr_SetInterrupt() function which calls the SIGINT signal handler in order to interrupt the main Python thread.
Signals on Windows
Control events
Windows uses "control events":
- CTRL_BREAK_EVENT: Break (SIGBREAK)
- CTRL_CLOSE_EVENT: Close event
- CTRL_C_EVENT: CTRL+C (SIGINT)
- CTRL_LOGOFF_EVENT: Logoff
- CTRL_SHUTDOWN_EVENT: Shutdown
The SetConsoleCtrlHandler() function can be used to install a control handler.
The CTRL_C_EVENT and CTRL_BREAK_EVENT events can be sent to a process using the GenerateConsoleCtrlEvent() function. This function is exposed in Python as os.kill().
Signals
The following signals are supported on Windows:
- SIGABRT
- SIGBREAK (CTRL_BREAK_EVENT): signal only available on Windows
- SIGFPE
- SIGILL
- SIGINT (CTRL_C_EVENT)
- SIGSEGV
- SIGTERM
SIGINT
The default Python signal handler for SIGINT sets a Windows event object: sigint_event.
time.sleep() is implemented with WaitForSingleObjectEx(), it waits for the sigint_event object using time.sleep() parameter as the timeout. So the sleep can be interrupted by SIGINT.
_winapi.WaitForMultipleObjects() automatically adds sigint_event to the list of watched handles, so it can also be interrupted.
PyOS_StdioReadline() also used sigint_event when fgets() failed to check if Ctrl-C or Ctrl-Z was pressed.
Implementation
The implementation is tracked in issue 23285. It was committed on February 07, 2015.
Copyright
This document has been placed in the public domain.
pep-0476 Enabling certificate verification by default for stdlib http clients
| PEP: | 476 |
|---|---|
| Title: | Enabling certificate verification by default for stdlib http clients |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Alex Gaynor <alex.gaynor at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-August-2014 |
Contents
Abstract
Currently when a standard library http client (the urllib, urllib2, http, and httplib modules) encounters an https:// URL it will wrap the network HTTP traffic in a TLS stream, as is necessary to communicate with such a server. However, during the TLS handshake it will not actually check that the server has an X509 certificate is signed by a CA in any trust root, nor will it verify that the Common Name (or Subject Alternate Name) on the presented certificate matches the requested host.
The failure to do these checks means that anyone with a privileged network position is able to trivially execute a man in the middle attack against a Python application using either of these HTTP clients, and change traffic at will.
This PEP proposes to enable verification of X509 certificate signatures, as well as hostname verification for Python's HTTP clients by default, subject to opt-out on a per-call basis. This change would be applied to Python 2.7, Python 3.4, and Python 3.5.
Rationale
The "S" in "HTTPS" stands for secure. When Python's users type "HTTPS" they are expecting a secure connection, and Python should adhere to a reasonable standard of care in delivering this. Currently we are failing at this, and in doing so, APIs which appear simple are misleading users.
When asked, many Python users state that they were not aware that Python failed to perform these validations, and are shocked.
The popularity of requests (which enables these checks by default) demonstrates that these checks are not overly burdensome in any way, and the fact that it is widely recommended as a major security improvement over the standard library clients demonstrates that many expect a higher standard for "security by default" from their tools.
The failure of various applications to note Python's negligence in this matter is a source of regular CVE assignment [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11].
| [1] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-4340 |
| [2] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-3533 |
| [3] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5822 |
| [4] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5825 |
| [5] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-1909 |
| [6] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2037 |
| [7] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2073 |
| [8] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-2191 |
| [9] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-4111 |
| [10] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6396 |
| [11] | https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2013-6444 |
Technical Details
Python would use the system provided certificate database on all platforms. Failure to locate such a database would be an error, and users would need to explicitly specify a location to fix it.
This will be achieved by adding a new ssl._create_default_https_context function, which is the same as ssl.create_default_context.
http.client can then replace its usage of ssl._create_stdlib_context with the ssl._create_default_https_context.
Additionally ssl._create_stdlib_context is renamed ssl._create_unverified_context (an alias is kept around for backwards compatibility reasons).
Trust database
This PEP proposes using the system-provided certificate database. Previous discussions have suggested bundling Mozilla's certificate database and using that by default. This was decided against for several reasons:
- Using the platform trust database imposes a lower maintenance burden on the Python developers -- shipping our own trust database would require doing a release every time a certificate was revoked.
- Linux vendors, and other downstreams, would unbundle the Mozilla certificates, resulting in a more fragmented set of behaviors.
- Using the platform stores makes it easier to handle situations such as corporate internal CAs.
OpenSSL also has a pair of environment variables, SSL_CERT_DIR and SSL_CERT_FILE which can be used to point Python at a different certificate database.
Backwards compatibility
This change will have the appearance of causing some HTTPS connections to "break", because they will now raise an Exception during handshake.
This is misleading however, in fact these connections are presently failing silently, an HTTPS URL indicates an expectation of confidentiality and authentication. The fact that Python does not actually verify that the user's request has been made is a bug, further: "Errors should never pass silently."
Nevertheless, users who have a need to access servers with self-signed or incorrect certificates would be able to do so by providing a context with custom trust roots or which disables validation (documentation should strongly recommend the former where possible). Users will also be able to add necessary certificates to system trust stores in order to trust them globally.
Twisted's 14.0 release made this same change, and it has been met with almost no opposition.
Opting out
For users who wish to opt out of certificate verification on a single connection, they can achieve this by providing the context argument to urllib.urlopen:
import ssl
# This restores the same behavior as before.
context = ssl._create_unverified_context()
urllib.urlopen("https://no-valid-cert", context=context)
It is also possible, though highly discouraged, to globally disable verification by monkeypatching the ssl module in versions of Python that implement this PEP:
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
# Legacy Python that doesn't verify HTTPS certificates by default
pass
else:
# Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
This guidance is aimed primarily at system administrators that wish to adopt newer versions of Python that implement this PEP in legacy environments that do not yet support certificate verification on HTTPS connections. For example, an administrator may opt out by adding the monkeypatch above to sitecustomize.py in their Standard Operating Environment for Python. Applications and libraries SHOULD NOT be making this change process wide (except perhaps in response to a system administrator controlled configuration setting).
Particularly security sensitive applications should always provide an explicit application defined SSL context rather than relying on the default behaviour of the underlying Python implementation.
Other protocols
This PEP only proposes requiring this level of validation for HTTP clients, not for other protocols such as SMTP.
This is because while a high percentage of HTTPS servers have correct certificates, as a result of the validation performed by browsers, for other protocols self-signed or otherwise incorrect certificates are far more common. Note that for SMTP at least, this appears to be changing and should be reviewed for a potential similar PEP in the future:
Python Versions
This PEP describes changes that will occur on both the 3.4.x, 3.5 and 2.7.X branches. For 2.7.X this will require backporting the context (SSLContext) argument to httplib, in addition to the features already backported in PEP 466.
Implementation
- LANDED: Issue 22366 adds the context argument to urlib.request.urlopen.
- Issue 22417 implements the substance of this PEP.
Copyright
This document has been placed into the public domain.
pep-0477 Backport ensurepip (PEP 453) to Python 2.7
| PEP: | 477 |
|---|---|
| Title: | Backport ensurepip (PEP 453) to Python 2.7 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io> Nick Coghlan <ncoghlan at gmail.com> |
| BDFL-Delegate: | Benjamin Peterson <benjamin@python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Aug-2014 |
| Post-History: | 1-Sep-2014 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2014-September/136238.html |
Contents
Abstract
This PEP proposes that the ensurepip module, added to Python 3.4 by PEP 453, be backported to Python 2.7. It also proposes that automatic invocation of ensurepip be added to the Python 2.7 Windows and OSX installers. However it does not propose that automatic invocation be added to the Makefile.
It also proposes that the documentation changes for the package distribution and installation guides be updated to match that in 3.4, which references using the ensurepip module to bootstrap the installer.
Rationale
Python 2.7 is effectively a LTS release of Python which represents the end of the 2.x series and there is still a very large contingent of users whom are still using Python 2.7 as their primary version. These users, in order to participate in the wider Python ecosystem, must manually attempt to go out and find the correct way to bootstrap the packaging tools.
It is the opinion of this PEP that making it as easy as possible for end users to participate in the wider Python ecosystem is important for 3 primary reasons:
- The Python 2.x to 3.x migration has a number of painpoints that are eased by a number of third party modules such as six [1], modernize [2], or future [3]. However relying on these tools requires that everyone who uses the project have a tool to install these packages.
- In addition to tooling to aid in migration from Python 2.x to 3.x, there are also a number of modules that are new in Python 3 for which there are backports available on PyPI. This can also aid in the ability for people to write 2.x and 3.x compatible software as well as enable them to use some of the newer features of Python 3 on Python 2.
- Users also will need a number of tools in order to create python packages that conform to the newer standards that are being proposed. Things like setuptools [4], Wheel [5], and twine [6] are enabling a safer, faster, and more reliable packaging tool chain. These tools can be difficult for people to use if first they must be told how to go out and install the package manager.
- One of Pythons biggest strengths is in the huge ecosystem of libraries and projects that have been built on top of it, most of which are distributed through PyPI. However in order to benefit from this wide ecosystem meaningfully requires end users, some of which are going to be new, to make a decision on which package manager they should get, how to get it, and then finally actually installing it first.
Furthermore, alternative implementations of Python are recognizing the benefits of PEP 453 and both PyPy and Jython have plans to backport ensurepip to their 2.7 runtimes.
Automatic Invocation
PEP 453 has ensurepip automatically invoked by default in the Makefile and the Windows and OSX Installers. This allowed it to ensure that, by default, all users would get Python with pip already installed. This PEP however believes that while this is fine for the Python 2.7 Windows and Mac OS X installers it is not ok for the Python 2.7 Makefile in general.
The primary consumers of the Makefile are downstream package managers which distribute Python themselves. These downstream distributors typically do not want pip to be installed via ensurepip and would prefer that end users install it with their own package manager. Not invoking ensurepip automatically from the Makefile would allow these distributors to simply ignore the fact that ensurepip has been backported and still not end up with pip installed via it.
The primary consumers of the OSX and Windows installers are end users who are attempting to install Python on their own machine. There is not a package manager available where these users can install pip into their Python through a more supported mechanism. For this reason it is the belief of this PEP that installing by default on OSX and Windows is the best course of action.
Documentation
As part of this PEP, the updated packaging distribution and installation guides for Python 3.4 would be backported to Python 2.7.
Disabling ensurepip by Downstream Distributors
Due to its use in the venv module, downstream distributors cannot disable the ensurepip module in Python 3.4. However since Python 2.7 has no such module it is explicitly allowed for downstream distributors to patch the ensurepip module to prevent it from installing anything.
If a downstream distributor wishes to disable ensurepip completely in Python 2.7, they should still at least provide the module and allow python -m ensurepip style invocation. However it should raise errors or otherwise exit with a non-zero exit code and print out an error on stderr directing users to what they can/should use instead of ensurepip.
Copyright
This document has been placed in the public domain.
pep-0478 Python 3.5 Release Schedule
| PEP: | 478 |
|---|---|
| Title: | Python 3.5 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Larry Hastings <larry at hastings.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 22-Sep-2014 |
| Python-Version: | 3.5 |
Abstract
This document describes the development and release schedule for Python 3.5. The schedule primarily concerns itself with PEP-sized items.
Release Manager and Crew
- 3.5 Release Manager: Larry Hastings
- Windows installers: Steve Dower
- Mac installers: Ned Deily
- Documentation: Georg Brandl
Release Schedule
The releases:
- 3.5.0 alpha 1: February 8, 2015
- 3.5.0 alpha 2: March 9, 2015
- 3.5.0 alpha 3: March 29, 2015
- 3.5.0 alpha 4: April 19, 2015
- 3.5.0 beta 1: May 24, 2015
- 3.5.0 beta 2: May 31, 2015
- 3.5.0 beta 3: July 5, 2015
- 3.5.0 beta 4: July 26, 2015
- 3.5.0 candidate 1: August 9, 2015
- 3.5.0 candidate 2: August 23, 2015
- 3.5.0 candidate 3: September 6, 2015
- 3.5.0 final: September 13, 2015
(Beta 1 is also "feature freeze"--no new features beyond this point.)
Features for 3.5
Implemented / Final PEPs:
- PEP 465, a new matrix multiplication operator
- PEP 461, %-formatting for binary strings
- PEP 471, os.scandir()
- PEP 479, change StopIteration handling inside generators
- PEP 441, improved Python zip application support
- PEP 448, additional unpacking generalizations
- PEP 486, make the Python Launcher aware of virtual environments
- PEP 475, retrying system calls that fail with EINTR
- PEP 492, coroutines with async and await syntax
- PEP 488, elimination of PYO files
- PEP 484, type hints
- PEP 489, redesigning extension module loading
- PEP 485, math.isclose(), a function for testing approximate equality
Proposed changes for 3.5:
- PEP 431, improved support for time zone databases
- PEP 432, simplifying Python's startup sequence
- PEP 436, a build tool generating boilerplate for extension modules
- PEP 447, support for __locallookup__ metaclass method
- PEP 455, key transforming dictionary
- PEP 468, preserving the order of **kwargs in a function
Copyright
This document has been placed in the public domain.
pep-0479 Change StopIteration handling inside generators
| PEP: | 479 |
|---|---|
| Title: | Change StopIteration handling inside generators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Chris Angelico <rosuav at gmail.com>, Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Nov-2014 |
| Python-Version: | 3.5 |
| Post-History: | 15-Nov-2014, 19-Nov-2014, 5-Dec-2014 |
Contents
Abstract
This PEP proposes a change to generators: when StopIteration is raised inside a generator, it is replaced it with RuntimeError. (More precisely, this happens when the exception is about to bubble out of the generator's stack frame.) Because the change is backwards incompatible, the feature is initially introduced using a __future__ statement.
Acceptance
This PEP was accepted by the BDFL on November 22. Because of the exceptionally short period from first draft to acceptance, the main objections brought up after acceptance were carefully considered and have been reflected in the "Alternate proposals" section below. However, none of the discussion changed the BDFL's mind and the PEP's acceptance is now final. (Suggestions for clarifying edits are still welcome -- unlike IETF RFCs, the text of a PEP is not cast in stone after its acceptance, although the core design/plan/specification should not change after acceptance.)
Rationale
The interaction of generators and StopIteration is currently somewhat surprising, and can conceal obscure bugs. An unexpected exception should not result in subtly altered behaviour, but should cause a noisy and easily-debugged traceback. Currently, StopIteration raised accidentally inside a generator function will be interpreted as the end of the iteration by the loop construct driving the generator.
The main goal of the proposal is to ease debugging in the situation where an unguarded next() call (perhaps several stack frames deep) raises StopIteration and causes the iteration controlled by the generator to terminate silently. (Whereas, when some other exception is raised, a traceback is printed pinpointing the cause of the problem.)
This is particularly pernicious in combination with the yield from construct of PEP 380 [1], as it breaks the abstraction that a subgenerator may be factored out of a generator. That PEP notes this limitation, but notes that "use cases for these [are] rare to non- existent". Unfortunately while intentional use is rare, it is easy to stumble on these cases by accident:
import contextlib
@contextlib.contextmanager
def transaction():
print('begin')
try:
yield from do_it()
except:
print('rollback')
raise
else:
print('commit')
def do_it():
print('Refactored initial setup')
yield # Body of with-statement is executed here
print('Refactored finalization of successful transaction')
def gene():
for i in range(2):
with transaction():
yield i
# return
raise StopIteration # This is wrong
print('Should not be reached')
for i in gene():
print('main: i =', i)
Here factoring out do_it into a subgenerator has introduced a subtle bug: if the wrapped block raises StopIteration, under the current behavior this exception will be swallowed by the context manager; and, worse, the finalization is silently skipped! Similarly problematic behavior occurs when an asyncio coroutine raises StopIteration, causing it to terminate silently, or when next is used to take the first result from an iterator that unexpectedly turns out to be empty, for example:
# using the same context manager as above
import pathlib
with transaction():
print('commit file {}'.format(
# I can never remember what the README extension is
next(pathlib.Path('/some/dir').glob('README*'))))
In both cases, the refactoring abstraction of yield from breaks in the presence of bugs in client code.
Additionally, the proposal reduces the difference between list comprehensions and generator expressions, preventing surprises such as the one that started this discussion [2]. Henceforth, the following statements will produce the same result if either produces a result at all:
a = list(F(x) for x in xs if P(x)) a = [F(x) for x in xs if P(x)]
With the current state of affairs, it is possible to write a function F(x) or a predicate P(x) that causes the first form to produce a (truncated) result, while the second form raises an exception (namely, StopIteration). With the proposed change, both forms will raise an exception at this point (albeit RuntimeError in the first case and StopIteration in the second).
Finally, the proposal also clears up the confusion about how to terminate a generator: the proper way is return, not raise StopIteration.
As an added bonus, the above changes bring generator functions much more in line with regular functions. If you wish to take a piece of code presented as a generator and turn it into something else, you can usually do this fairly simply, by replacing every yield with a call to print() or list.append(); however, if there are any bare next() calls in the code, you have to be aware of them. If the code was originally written without relying on StopIteration terminating the function, the transformation would be that much easier.
Background information
When a generator frame is (re)started as a result of a __next__() (or send() or throw()) call, one of three outcomes can occur:
- A yield point is reached, and the yielded value is returned.
- The frame is returned from; StopIteration is raised.
- An exception is raised, which bubbles out.
In the latter two cases the frame is abandoned (and the generator object's gi_frame attribute is set to None).
Proposal
If a StopIteration is about to bubble out of a generator frame, it is replaced with RuntimeError, which causes the next() call (which invoked the generator) to fail, passing that exception out. From then on it's just like any old exception. [3]
This affects the third outcome listed above, without altering any other effects. Furthermore, it only affects this outcome when the exception raised is StopIteration (or a subclass thereof).
Note that the proposed replacement happens at the point where the exception is about to bubble out of the frame, i.e. after any except or finally blocks that could affect it have been exited. The StopIteration raised by returning from the frame is not affected (the point being that StopIteration means that the generator terminated "normally", i.e. it did not raise an exception).
A subtle issue is what will happen if the caller, having caught the RuntimeError, calls the generator object's __next__() method again. The answer is that from this point on it will raise StopIteration -- the behavior is the same as when any other exception was raised by the generator.
Another logical consequence of the proposal: if someone uses g.throw(StopIteration) to throw a StopIteration exception into a generator, if the generator doesn't catch it (which it could do using a try/except around the yield), it will be transformed into RuntimeError.
During the transition phase, the new feature must be enabled per-module using:
from __future__ import generator_stop
Any generator function constructed under the influence of this directive will have the REPLACE_STOPITERATION flag set on its code object, and generators with the flag set will behave according to this proposal. Once the feature becomes standard, the flag may be dropped; code should not inspect generators for it.
A proof-of-concept patch has been created to facilitate testing. [4]
Consequences for existing code
This change will affect existing code that depends on StopIteration bubbling up. The pure Python reference implementation of groupby [5] currently has comments "Exit on StopIteration" where it is expected that the exception will propagate and then be handled. This will be unusual, but not unknown, and such constructs will fail. Other examples abound, e.g. [6], [7].
(Nick Coghlan comments: """If you wanted to factor out a helper function that terminated the generator you'd have to do "return yield from helper()" rather than just "helper()".""")
There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target or the predicate (rather than by the __next__() call implied in the for loop proper).
Writing backwards and forwards compatible code
With the exception of hacks that raise StopIteration to exit a generator expression, it is easy to write code that works equally well under older Python versions as under the new semantics.
This is done by enclosing those places in the generator body where a StopIteration is expected (e.g. bare next() calls or in some cases helper functions that are expected to raise StopIteration) in a try/except construct that returns when StopIteration is raised. The try/except construct should appear directly in the generator function; doing this in a helper function that is not itself a generator does not work. If raise StopIteration occurs directly in a generator, simply replace it with return.
Examples of breakage
Generators which explicitly raise StopIteration can generally be changed to simply return instead. This will be compatible with all existing Python versions, and will not be affected by __future__. Here are some illustrations from the standard library.
Lib/ipaddress.py:
if other == self:
raise StopIteration
Becomes:
if other == self:
return
In some cases, this can be combined with yield from to simplify the code, such as Lib/difflib.py:
if context is None:
while True:
yield next(line_pair_iterator)
Becomes:
if context is None:
yield from line_pair_iterator
return
(The return is necessary for a strictly-equivalent translation, though in this particular file, there is no further code, and the return can be omitted.) For compatibility with pre-3.3 versions of Python, this could be written with an explicit for loop:
if context is None:
for line in line_pair_iterator:
yield line
return
More complicated iteration patterns will need explicit try/except constructs. For example, a hypothetical parser like this:
def parser(f):
while True:
data = next(f)
while True:
line = next(f)
if line == "- end -": break
data += line
yield data
would need to be rewritten as:
def parser(f):
while True:
try:
data = next(f)
while True:
line = next(f)
if line == "- end -": break
data += line
yield data
except StopIteration:
return
or possibly:
def parser(f):
for data in f:
while True:
line = next(f)
if line == "- end -": break
data += line
yield data
The latter form obscures the iteration by purporting to iterate over the file with a for loop, but then also fetches more data from the same iterator during the loop body. It does, however, clearly differentiate between a "normal" termination (StopIteration instead of the initial line) and an "abnormal" termination (failing to find the end marker in the inner loop, which will now raise RuntimeError).
This effect of StopIteration has been used to cut a generator expression short, creating a form of takewhile:
def stop():
raise StopIteration
print(list(x for x in range(10) if x < 5 or stop()))
# prints [0, 1, 2, 3, 4]
Under the current proposal, this form of non-local flow control is not supported, and would have to be rewritten in statement form:
def gen():
for x in range(10):
if x >= 5: return
yield x
print(list(gen()))
# prints [0, 1, 2, 3, 4]
While this is a small loss of functionality, it is functionality that often comes at the cost of readability, and just as lambda has restrictions compared to def, so does a generator expression have restrictions compared to a generator function. In many cases, the transformation to full generator function will be trivially easy, and may improve structural clarity.
Explanation of generators, iterators, and StopIteration
The proposal does not change the relationship between generators and iterators: a generator object is still an iterator, and not all iterators are generators. Generators have additional methods that iterators don't have, like send and throw. All this is unchanged. Nothing changes for generator users -- only authors of generator functions may have to learn something new. (This includes authors of generator expressions that depend on early termination of the iteration by a StopIteration raised in a condition.)
An iterator is an object with a __next__ method. Like many other special methods, it may either return a value, or raise a specific exception - in this case, StopIteration - to signal that it has no value to return. In this, it is similar to __getattr__ (can raise AttributeError), __getitem__ (can raise KeyError), and so on. A helper function for an iterator can be written to follow the same protocol; for example:
def helper(x, y):
if x > y: return 1 / (x - y)
raise StopIteration
def __next__(self):
if self.a: return helper(self.b, self.c)
return helper(self.d, self.e)
Both forms of signalling are carried through: a returned value is returned, an exception bubbles up. The helper is written to match the protocol of the calling function.
A generator function is one which contains a yield expression. Each time it is (re)started, it may either yield a value, or return (including "falling off the end"). A helper function for a generator can also be written, but it must also follow generator protocol:
def helper(x, y):
if x > y: yield 1 / (x - y)
def gen(self):
if self.a: return (yield from helper(self.b, self.c))
return (yield from helper(self.d, self.e))
In both cases, any unexpected exception will bubble up. Due to the nature of generators and iterators, an unexpected StopIteration inside a generator will be converted into RuntimeError, but beyond that, all exceptions will propagate normally.
Transition plan
- Python 3.5: Enable new semantics under __future__ import; silent deprecation warning if StopIteration bubbles out of a generator not under __future__ import.
- Python 3.6: Non-silent deprecation warning.
- Python 3.7: Enable new semantics everywhere.
Alternate proposals
Raising something other than RuntimeError
Rather than the generic RuntimeError, it might make sense to raise a new exception type UnexpectedStopIteration. This has the downside of implicitly encouraging that it be caught; the correct action is to catch the original StopIteration, not the chained exception.
Supplying a specific exception to raise on return
Nick Coghlan suggested a means of providing a specific StopIteration instance to the generator; if any other instance of StopIteration is raised, it is an error, but if that particular one is raised, the generator has properly completed. This subproposal has been withdrawn in favour of better options, but is retained for reference.
Making return-triggered StopIterations obvious
For certain situations, a simpler and fully backward-compatible solution may be sufficient: when a generator returns, instead of raising StopIteration, it raises a specific subclass of StopIteration (GeneratorReturn) which can then be detected. If it is not that subclass, it is an escaping exception rather than a return statement.
The inspiration for this alternative proposal was Nick's observation [8] that if an asyncio coroutine [9] accidentally raises StopIteration, it currently terminates silently, which may present a hard-to-debug mystery to the developer. The main proposal turns such accidents into clearly distinguishable RuntimeError exceptions, but if that is rejected, this alternate proposal would enable asyncio to distinguish between a return statement and an accidentally-raised StopIteration exception.
Of the three outcomes listed above, two change:
- If a yield point is reached, the value, obviously, would still be returned.
- If the frame is returned from, GeneratorReturn (rather than StopIteration) is raised.
- If an instance of GeneratorReturn would be raised, instead an instance of StopIteration would be raised. Any other exception bubbles up normally.
In the third case, the StopIteration would have the value of the original GeneratorReturn, and would reference the original exception in its __cause__. If uncaught, this would clearly show the chaining of exceptions.
This alternative does not affect the discrepancy between generator expressions and list comprehensions, but allows generator-aware code (such as the contextlib and asyncio modules) to reliably differentiate between the second and third outcomes listed above.
However, once code exists that depends on this distinction between GeneratorReturn and StopIteration, a generator that invokes another generator and relies on the latter's StopIteration to bubble out would still be potentially wrong, depending on the use made of the distinction between the two exception types.
Converting the exception inside next()
Mark Shannon suggested [10] that the problem could be solved in next() rather than at the boundary of generator functions. By having next() catch StopIteration and raise instead ValueError, all unexpected StopIteration bubbling would be prevented; however, the backward-incompatibility concerns are far more serious than for the current proposal, as every next() call now needs to be rewritten to guard against ValueError instead of StopIteration - not to mention that there is no way to write one block of code which reliably works on multiple versions of Python. (Using a dedicated exception type, perhaps subclassing ValueError, would help this; however, all code would still need to be rewritten.)
Note that calling next(it, default) catches StopIteration and substitutes the given default value; this feature is often useful to avoid a try/except block.
Sub-proposal: decorator to explicitly request current behaviour
Nick Coghlan suggested [11] that the situations where the current behaviour is desired could be supported by means of a decorator:
from itertools import allow_implicit_stop
@allow_implicit_stop
def my_generator():
...
yield next(it)
...
Which would be semantically equivalent to:
def my_generator():
try:
...
yield next(it)
...
except StopIteration
return
but be faster, as it could be implemented by simply permitting the StopIteration to bubble up directly.
Single-source Python 2/3 code would also benefit in a 3.7+ world, since libraries like six and python-future could just define their own version of "allow_implicit_stop" that referred to the new builtin in 3.5+, and was implemented as an identity function in other versions.
However, due to the implementation complexities required, the ongoing compatibility issues created, the subtlety of the decorator's effect, and the fact that it would encourage the "quick-fix" solution of just slapping the decorator onto all generators instead of properly fixing the code in question, this sub-proposal has been rejected. [12]
Criticism
Unofficial and apocryphal statistics suggest that this is seldom, if ever, a problem. [13] Code does exist which relies on the current behaviour (e.g. [3], [6], [7]), and there is the concern that this would be unnecessary code churn to achieve little or no gain.
Steven D'Aprano started an informal survey on comp.lang.python [14]; at the time of writing only two responses have been received: one was in favor of changing list comprehensions to match generator expressions (!), the other was in favor of this PEP's main proposal.
The existing model has been compared to the perfectly-acceptable issues inherent to every other case where an exception has special meaning. For instance, an unexpected KeyError inside a __getitem__ method will be interpreted as failure, rather than permitted to bubble up. However, there is a difference. Special methods use return to indicate normality, and raise to signal abnormality; generators yield to indicate data, and return to signal the abnormal state. This makes explicitly raising StopIteration entirely redundant, and potentially surprising. If other special methods had dedicated keywords to distinguish between their return paths, they too could turn unexpected exceptions into RuntimeError; the fact that they cannot should not preclude generators from doing so.
Why not fix all __next__() methods?
When implementing a regular __next__() method, the only way to indicate the end of the iteration is to raise StopIteration. So catching StopIteration here and converting it to RuntimeError would defeat the purpose. This is a reminder of the special status of generator functions: in a generator function, raising StopIteration is redundant since the iteration can be terminated by a simple return.
References
| [1] | PEP 380 - Syntax for Delegating to a Subgenerator (https://www.python.org/dev/peps/pep-0380) |
| [2] | Initial mailing list comment (https://mail.python.org/pipermail/python-ideas/2014-November/029906.html) |
| [3] | (1, 2) Proposal by GvR (https://mail.python.org/pipermail/python-ideas/2014-November/029953.html) |
| [4] | Tracker issue with Proof-of-Concept patch (http://bugs.python.org/issue22906) |
| [5] | Pure Python implementation of groupby (https://docs.python.org/3/library/itertools.html#itertools.groupby) |
| [6] | (1, 2) Split a sequence or generator using a predicate (http://code.activestate.com/recipes/578416-split-a-sequence-or-generator-using-a-predicate/) |
| [7] | (1, 2) wrap unbounded generator to restrict its output (http://code.activestate.com/recipes/66427-wrap-unbounded-generator-to-restrict-its-output/) |
| [8] | Post from Nick Coghlan mentioning asyncio (https://mail.python.org/pipermail/python-ideas/2014-November/029961.html) |
| [9] | Coroutines in asyncio (https://docs.python.org/3/library/asyncio-task.html#coroutines) |
| [10] | Post from Mark Shannon with alternate proposal (https://mail.python.org/pipermail/python-dev/2014-November/137129.html) |
| [11] | Idea from Nick Coghlan (https://mail.python.org/pipermail/python-dev/2014-November/137201.html) |
| [12] | Rejection of above idea by GvR (https://mail.python.org/pipermail/python-dev/2014-November/137243.html) |
| [13] | Response by Steven D'Aprano (https://mail.python.org/pipermail/python-ideas/2014-November/029994.html) |
| [14] | Thread on comp.lang.python started by Steven D'Aprano (https://mail.python.org/pipermail/python-list/2014-November/680757.html) |
Copyright
This document has been placed in the public domain.
pep-0480 Surviving a Compromise of PyPI: The Maximum Security Model
| PEP: | 480 |
|---|---|
| Title: | Surviving a Compromise of PyPI: The Maximum Security Model |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Trishank Karthik Kuppusamy <trishank at nyu.edu>, Vladimir Diaz <vladimir.diaz at nyu.edu>, Donald Stufft <donald at stufft.io>, Justin Cappos <jcappos at nyu.edu> |
| BDFL-Delegate: | Richard Jones <r1chardj0n3s@gmail.com> |
| Discussions-To: | DistUtils mailing list <distutils-sig at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 458 |
| Created: | 8-Oct-2014 |
Contents
Abstract
Proposed is an extension to PEP 458 that adds support for end-to-end signing and the maximum security model. End-to-end signing allows both PyPI and developers to sign for the distributions that are downloaded by clients. The minimum security model proposed by PEP 458 supports continuous delivery of distributions (because they are signed by online keys), but that model does not protect distributions in the event that PyPI is compromised. In the minimum security model, attackers may sign for malicious distributions by compromising the signing keys stored on PyPI infrastructure. The maximum security model, described in this PEP, retains the benefits of PEP 458 (e.g., immediate availability of distributions that are uploaded to PyPI), but additionally ensures that end-users are not at risk of installing forged software if PyPI is compromised.
This PEP discusses the changes made to PEP 458 but excludes its informational elements to primarily focus on the maximum security model. For example, an overview of The Update Framework or the basic mechanisms in PEP 458 are not covered here. The changes to PEP 458 include modifications to the snapshot process, key compromise analysis, auditing snapshots, and the steps that should be taken in the event of a PyPI compromise. The signing and key management process that PyPI MAY RECOMMEND is discussed but not strictly defined. How the release process should be implemented to manage keys and metadata is left to the implementors of the signing tools. That is, this PEP delineates the expected cryptographic key type and signature format included in metadata that MUST be uploaded by developers in order to support end-to-end verification of distributions.
Rationale
PEP 458 [1] proposes how PyPI should be integrated with The Update Framework (TUF) [2]. It explains how modern package managers like pip can be made more secure, and the types of attacks that can be prevented if PyPI is modified on the server side to include TUF metadata. Package managers can reference the TUF metadata available on PyPI to download distributions more securely.
PEP 458 also describes the metadata layout of the PyPI repository and employs the minimum security model, which supports continuous delivery of projects and uses online cryptographic keys to sign the distributions uploaded by developers. Although the minimum security model guards against most attacks on software updaters [5] [7], such as mix-and-match and extraneous dependencies attacks, it can be improved to support end-to-end signing and to prohibit forged distributions in the event that PyPI is compromised.
The main strength of PEP 458 and the minimum security model is the automated and simplified release process: developers may upload distributions and then have PyPI sign for their distributions. Much of the release process is handled in an automated fashion by online roles and this approach requires storing cryptographic signing keys on the PyPI infrastructure. Unfortunately, cryptographic keys that are stored online are vulnerable to theft. The maximum security model, proposed in this PEP, permits developers to sign for the distributions that they make available to PyPI users, and does not put end-users at risk of downloading malicious distributions if the online keys stored on PyPI infrastructure are compromised.
Threat Model
The threat model assumes the following:
- Offline keys are safe and securely stored.
- Attackers can compromise at least one of PyPI's trusted keys that are stored online, and may do so at once or over a period of time.
- Attackers can respond to client requests.
- Attackers may control any number of developer keys for projects a client does not want to install.
Attackers are considered successful if they can cause a client to install (or leave installed) something other than the most up-to-date version of the software the client is updating. When an attacker is preventing the installation of updates, the attacker's goal is that clients not realize that anything is wrong.
Definitions
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [13].
This PEP focuses on integrating TUF with PyPI; however, the reader is encouraged to read about TUF's design principles [2]. It is also RECOMMENDED that the reader be familiar with the TUF specification [3], and PEP 458 [1] (which this PEP is extending).
Terms used in this PEP are defined as follows:
- Projects: Projects are software components that are made available for integration. Projects include Python libraries, frameworks, scripts, plugins, applications, collections of data or other resources, and various combinations thereof. Public Python projects are typically registered on the Python Package Index [4].
- Releases: Releases are uniquely identified snapshots of a project [4].
- Distributions: Distributions are the packaged files that are used to publish and distribute a release.
- Simple index: The HTML page that contains internal links to the distributions of a project [4].
- Roles: There is one root role in PyPI. There are multiple roles whose responsibilities are delegated to them directly or indirectly by the root role. The term "top-level role" refers to the root role and any role delegated by the root role. Each role has a single metadata file that it is trusted to provide.
- Metadata: Metadata are files that describe roles, other metadata, and target files.
- Repository: A repository is a resource comprised of named metadata and target files. Clients request metadata and target files stored on a repository.
- Consistent snapshot: A set of TUF metadata and PyPI targets that capture the complete state of all projects on PyPI as they existed at some fixed point in time.
- The snapshot (release) role: In order to prevent confusion due to the different meanings of the term "release" used in PEP 426 [1] and the TUF specification [3], the release role is renamed to the snapshot role.
- Developer: Either the owner or maintainer of a project who is allowed to update TUF metadata, as well as distribution metadata and files for a given project.
- Online key: A private cryptographic key that MUST be stored on the PyPI server infrastructure. This usually allows automated signing with the key. An attacker who compromises the PyPI infrastructure will be able to immediately read these keys.
- Offline key: A private cryptographic key that MUST be stored independent of the PyPI server infrastructure. This prevents automated signing with the key. An attacker who compromises the PyPI infrastructure will not be able to immediately read these keys.
- Threshold signature scheme: A role can increase its resilience to key compromises by specifying that at least t out of n keys are REQUIRED to sign its metadata. A compromise of t-1 keys is insufficient to compromise the role itself. Saying that a role requires (t, n) keys denotes the threshold signature property.
Maximum Security Model
The maximum security model permits developers to sign their projects and to upload signed metadata to PyPI. If the PyPI infrastructure were compromised, attackers would be unable to serve malicious versions of a claimed project without having access to that project's developer key. Figure 1 depicts the changes made to the metadata layout of the minimum security model, namely that developer roles are now supported and that three new delegated roles exist: claimed, recently-claimed, and unclaimed. The bins role from the minimum security model has been renamed unclaimed and can contain any projects that have not been added to claimed. The unclaimed role functions just as before (i.e., as explained in PEP 458, projects added to this role are signed by PyPI with an online key). Offline keys provided by developers ensure the strength of the maximum security model over the minimum model. Although the minimum security model supports continuous delivery of projects, all projects are signed by an online key. That is, an attacker is able to corrupt packages in the minimum security model, but not in the maximum model, without also compromising a developer's key.
Figure 1: An overview of the metadata layout in the maximum security model. The maximum security model supports continuous delivery and survivable key compromise.
Projects that are signed by developers and uploaded to PyPI for the first time are added to the recently-claimed role. The recently-claimed role uses an online key, so projects uploaded for the first time are immediately available to clients. After some time has passed, PyPI administrators MAY periodically move (e.g., every month) projects listed in recently-claimed to the claimed role for maximum security. The claimed role uses an offline key, thus projects added to this role cannot be easily forged if PyPI is compromised.
The recently-claimed role is separate from the unclaimed role for usability and efficiency, not security. If new project delegations were prepended to unclaimed metadata, unclaimed would need to be re-downloaded every time a project obtained a key. By separating out new projects, the amount of data retrieved is reduced. From a usability standpoint, it also makes it easier for administrators to see which projects are now claimed. This information is needed when moving keys from recently-claimed to claimed, which is discussed in more detail in the "Producing Consistent Snapshots" section.
End-to-End Signing
End-to-end signing allows both PyPI and developers to sign for the metadata downloaded by clients. PyPI is trusted to make uploaded projects available to clients (PyPI signs the metadata for this part of the process), and developers sign the distributions that they upload to PyPI.
In order to delegate trust to a project, developers are required to submit a public key to PyPI. PyPI takes the project's public key and adds it to parent metadata that PyPI then signs. After the initial trust is established, developers are required to sign distributions that they upload to PyPI using the public key's corresponding private key. The signed TUF metadata that developers upload to PyPI includes information like the distribution's file size and hash, which package managers use to verify distributions that are downloaded.
The practical implications of end-to-end signing is the extra administrative work needed to delegate trust to a project, and the signed metadata that developers MUST upload to PyPI along with the distribution. Specifically, PyPI is expected to periodically sign metadata with an offline key by adding projects to the claimed metadata file and signing it. In contrast, projects are only ever signed with an online key in the minimum security model. End-to-end signing does require manual intervention to delegate trust (i.e., to sign metadata with an offline key), but this is a one-time cost and projects have stronger protections against PyPI compromises thereafter.
Metadata Signatures, Key Management, and Signing Distributions
This section discusses the tools, signature scheme, and signing methods that PyPI MAY recommend to implementors of the signing tools. Developers are expected to use these tools to sign and upload distributions to PyPI. To summarize the RECOMMENDED tools and schemes discussed in the subsections below, developers MAY generate cryptographic keys and sign metadata (with the Ed25519 signature scheme) in some automated fashion, where the metadata includes the information required to verify the authenticity of the distribution. Developers then upload metadata to PyPI, where it will be available for download by package managers such as pip (i.e., package managers that support TUF metadata). The entire process is transparent to the end-users (using a package manager that supports TUF) that download distributions from PyPI.
The first three subsections (Cryptographic Signature Scheme, Cryptographic Key Files, and Key Management) cover the cryptographic components of the developer release process. That is, which key type PyPI supports, how keys may be stored, and how keys may be generated. The two subsections that follow the first three discuss the PyPI modules that SHOULD be modified to support TUF metadata. For example, Twine and Distutils are two projects that SHOULD be modified. Finally, the last subsection goes over the automated key management and signing solution that is RECOMMENDED for the signing tools.
TUF's design is flexible with respect to cryptographic key types, signatures, and signing methods. The tools, modification, and methods discussed in the following sections are RECOMMENDATIONS for the implementors of the signing tools.
Cryptographic Signature Scheme: Ed25519
The package manager (pip) shipped with CPython MUST work on non-CPython interpreters and cannot have dependencies that have to be compiled (i.e., the PyPI+TUF integration MUST NOT require compilation of C extensions in order to verify cryptographic signatures). Verification of signatures MUST be done in Python, and verifying RSA [11] signatures in pure-Python may be impractical due to speed. Therefore, PyPI MAY use the Ed25519 [14] signature scheme.
Ed25519 [12] is a public-key signature system that uses small cryptographic signatures and keys. A pure-Python implementation [15] of the Ed25519 signature scheme is available. Verification of Ed25519 signatures is fast even when performed in Python.
Cryptographic Key Files
The implementation MAY encrypt key files with AES-256-CTR-Mode and strengthen passwords with PBKDF2-HMAC-SHA256 (100K iterations by default, but this may be overridden by the developer). The current Python implementation of TUF can use any cryptographic library (support for PyCA cryptography will be added in the future), may override the default number of PBKDF2 iterations, and the KDF tweaked to taste.
Key Management: miniLock
An easy-to-use key management solution is needed. One solution is to derive a private key from a password so that developers do not have to manage cryptographic key files across multiple computers. miniLock [16] is an example of how this can be done. Developers may view the cryptographic key as a secondary password. miniLock also works well with a signature scheme like Ed25519, which only needs a very small key.
Third-party Upload Tools: Twine
Third-party tools like Twine [17] MAY be modified (if they wish to support distributions that include TUF metadata) to sign and upload developer projects to PyPI. Twine is a utility for interacting with PyPI that uses TLS to upload distributions, and prevents MITM attacks on usernames and passwords.
Distutils
Distutils [18] MAY be modified to sign metadata and to upload signed distributions to PyPI. Distutils comes packaged with CPython and is the most widely-used tool for uploading distributions to PyPI.
Automated Signing Solution
An easy-to-use key management solution is RECOMMENDED for developers. One approach is to generate a cryptographic private key from a user password, akin to miniLock. Although developer signatures can remain optional, this approach may be inadequate due to the great number of potentially unsigned dependencies each distribution may have. If any one of these dependencies is unsigned, it negates any benefit the project gains from signing its own distribution (i.e., attackers would only need to compromise one of the unsigned dependencies to attack end-users). Requiring developers to manually sign distributions and manage keys is expected to render key signing an unused feature.
A default, PyPI-mediated key management and package signing solution that is transparent [19] to developers and does not require a key escrow (sharing of encrypted private keys with PyPI) is RECOMMENDED for the signing tools. Additionally, the signing tools SHOULD circumvent the sharing of private keys across multiple machines of each developer.
The following outlines an automated signing solution that a new developer MAY follow to upload a distribution to PyPI:
- Register a PyPI project.
- Enter a secondary password (independent of the PyPI user account password).
- Optional: Add a new identity to the developer's PyPI user account from a second machine (after a password prompt).
- Upload project.
Step 1 is the normal procedure followed by developers to register a PyPI project [20].
Step 2 generates an encrypted key file (private), uploads an Ed25519 public key to PyPI, and signs the TUF metadata that is generated for the distribution.
Optionally adding a new identity from a second machine, by simply entering a password, in step 3 also generates an encrypted private key file and uploads an Ed25519 public key to PyPI. Separate identities MAY be created to allow a developer, or other project maintainers, to sign releases on multiple machines. An existing verified identity (its public key is contained in project metadata or has been uploaded to PyPI) signs for new identities. By default, project metadata has a signature threshold of "1" and other verified identities may create new releases to satisfy the threshold.
Step 4 uploads the distribution file and TUF metadata to PyPI. The "Snapshot Process" section discusses in detail the procedure followed by developers to upload a distribution to PyPI.
Generation of cryptographic files and signatures is transparent to the developers in the default case: developers need not be aware that packages are automatically signed. However, the signing tools should be flexible; a single project key may also be shared between multiple machines if manual key management is preferred (e.g., ssh-copy-id).
The repository [21] and developer [22] TUF tools currently support all of the recommendations previously mentioned, except for the automated signing solution, which SHOULD be added to Distutils, Twine, and other third-party signing tools. The automated signing solution calls available repository tool functions to sign metadata and to generate the cryptographic key files.
Snapshot Process
The snapshot process is fairly simple and SHOULD be automated. The snapshot process MUST keep in memory the latest working set of root, targets, and delegated roles. Every minute or so the snapshot process will sign for this latest working set. (Recall that project transaction processes continuously inform the snapshot process about the latest delegated metadata in a concurrency-safe manner. The snapshot process will actually sign for a copy of the latest working set while the latest working set in memory will be updated with information that is continuously communicated by the project transaction processes.) The snapshot process MUST generate and sign new timestamp metadata that will vouch for the metadata (root, targets, and delegated roles) generated in the previous step. Finally, the snapshot process MUST make available to clients the new timestamp and snapshot metadata representing the latest snapshot.
A claimed or recently-claimed project will need to upload in its transaction to PyPI not just targets (a simple index as well as distributions) but also TUF metadata. The project MAY do so by uploading a ZIP file containing two directories, /metadata/ (containing delegated targets metadata files) and /targets/ (containing targets such as the project simple index and distributions that are signed by the delegated targets metadata).
Whenever the project uploads metadata or targets to PyPI, PyPI SHOULD check the project TUF metadata for at least the following properties:
- A threshold number of the developers keys registered with PyPI by that project MUST have signed for the delegated targets metadata file that represents the "root" of targets for that project (e.g. metadata/targets/ project.txt).
- The signatures of delegated targets metadata files MUST be valid.
- The delegated targets metadata files MUST NOT have expired.
- The delegated targets metadata MUST be consistent with the targets.
- A delegator MUST NOT delegate targets that were not delegated to itself by another delegator.
- A delegatee MUST NOT sign for targets that were not delegated to itself by a delegator.
If PyPI chooses to check the project TUF metadata, then PyPI MAY choose to reject publishing any set of metadata or targets that do not meet these requirements.
PyPI MUST enforce access control by ensuring that each project can only write to the TUF metadata for which it is responsible. It MUST do so by ensuring that project transaction processes write to the correct metadata as well as correct locations within those metadata. For example, a project transaction process for an unclaimed project MUST write to the correct target paths in the correct delegated unclaimed metadata for the targets of the project.
On rare occasions, PyPI MAY wish to extend the TUF metadata format for projects in a backward-incompatible manner. Note that PyPI will NOT be able to automatically rewrite existing TUF metadata on behalf of projects in order to upgrade the metadata to the new backward-incompatible format because this would invalidate the signatures of the metadata as signed by developer keys. Instead, package managers SHOULD be written to recognize and handle multiple incompatible versions of TUF metadata so that claimed and recently-claimed projects could be offered a reasonable time to migrate their metadata to newer but backward-incompatible formats.
If PyPI eventually runs out of disk space to produce a new consistent snapshot, then PyPI MAY then use something like a "mark-and-sweep" algorithm to delete sufficiently outdated consistent snapshots. That is, only outdated metadata like timestamp and snapshot that are no longer used are deleted. Specifically, in order to preserve the latest consistent snapshot, PyPI would walk objects -- beginning from the root (timestamp) -- of the latest consistent snapshot, mark all visited objects, and delete all unmarked objects. The last few consistent snapshots may be preserved in a similar fashion. Deleting a consistent snapshot will cause clients to see nothing except HTTP 404 responses to any request for a target of the deleted consistent snapshot. Clients SHOULD then retry (as before) their requests with the latest consistent snapshot.
All package managers that support TUF metadata MUST be modified to download every metadata and target file (except for timestamp metadata) by including, in the request for the file, the cryptographic hash of the file in the filename. Following the filename convention RECOMMENDED in the next subsection, a request for the file at filename.ext will be transformed to the equivalent request for the file at digest.filename.
Finally, PyPI SHOULD use a transaction log [23] to record project transaction processes and queues so that it will be easier to recover from errors after a server failure.
Producing Consistent Snapshots
PyPI is responsible for updating, depending on the project, either the claimed, recently-claimed, or unclaimed metadata and associated delegated metadata. Every project MUST upload its set of metadata and targets in a single transaction. The uploaded set of files is called the "project transaction." How PyPI MAY validate files in a project transaction is discussed in a later section. The focus of this section is on how PyPI will respond to a project transaction.
Every metadata and target file MUST include in its filename the hex digest [24] of its SHA-256 [25] hash, which PyPI may prepend to filenames after the files have been uploaded. For this PEP, it is RECOMMENDED that PyPI adopt a simple convention of the form: digest.filename, where filename is the original filename without a copy of the hash, and digest is the hex digest of the hash.
When an unclaimed project uploads a new transaction, a project transaction process MUST add all new targets and relevant delegated unclaimed metadata. The project transaction process MUST inform the snapshot process about new delegated unclaimed metadata.
When a recently-claimed project uploads a new transaction, a project transaction process MUST add all new targets and delegated targets metadata for the project. If the project is new, then the project transaction process MUST also add new recently-claimed metadata with the public keys (which MUST be part of the transaction) for the project. recently-claimed projects have a threshold value of "1" set by the transaction process. Finally, the project transaction process MUST inform the snapshot process about new recently-claimed metadata, as well as the current set of delegated targets metadata for the project.
The transaction process for a claimed project is slightly different in that PyPI administrators periodically move (a manual process that MAY occur every two weeks to a month) projects from the recently-claimed role to the claimed role. (Moving a project from recently-claimed to claimed is a manual process because PyPI administrators have to use an offline key to sign the claimed project's distribution.) A project transaction process MUST then add new recently-claimed and claimed metadata to reflect this migration. As is the case for a recently-claimed project, the project transaction process MUST always add all new targets and delegated targets metadata for the claimed project. Finally, the project transaction process MUST inform the consistent snapshot process about new recently-claimed or claimed metadata, as well as the current set of delegated targets metadata for the project.
Project transaction processes SHOULD be automated, except when PyPI administrators move a project from the recently-claimed role to the claimed role. Project transaction processes MUST also be applied atomically: either all metadata and targets -- or none of them -- are added. The project transaction processes and snapshot process SHOULD work concurrently. Finally, project transaction processes SHOULD keep in memory the latest claimed, recently-claimed, and unclaimed metadata so that they will be correctly updated in new consistent snapshots.
The queue MAY be processed concurrently in order of appearance, provided that the following rules are observed:
- No pair of project transaction processes may concurrently work on the same project.
- No pair of project transaction processes may concurrently work on unclaimed projects that belong to the same delegated unclaimed role.
- No pair of project transaction processes may concurrently work on new recently-claimed projects.
- No pair of project transaction processes may concurrently work on new claimed projects.
- No project transaction process may work on a new claimed project while another project transaction process is working on a new recently-claimed project and vice versa.
These rules MUST be observed to ensure that metadata is not read from or written to inconsistently.
Auditing Snapshots
If a malicious party compromises PyPI, they can sign arbitrary files with any of the online keys. The roles with offline keys (i.e., root and targets) are still protected. To safely recover from a repository compromise, snapshots should be audited to ensure that files are only restored to trusted versions.
When a repository compromise has been detected, the integrity of three types of information must be validated:
- If the online keys of the repository have been compromised, they can be revoked by having the targets role sign new metadata, delegated to a new key.
- If the role metadata on the repository has been changed, this will impact the metadata that is signed by online keys. Any role information created since the compromise should be discarded. As a result, developers of new projects will need to re-register their projects.
- If the packages themselves may have been tampered with, they can be validated using the stored hash information for packages that existed in trusted metadata before the compromise. Also, new distributions that are signed by developers in the claimed role may be safely retained. However, any distributions signed by developers in the recently-claimed or unclaimed roles should be discarded.
In order to safely restore snapshots in the event of a compromise, PyPI SHOULD maintain a small number of its own mirrors to copy PyPI snapshots according to some schedule. The mirroring protocol can be used immediately for this purpose. The mirrors must be secured and isolated such that they are responsible only for mirroring PyPI. The mirrors can be checked against one another to detect accidental or malicious failures.
Another approach is to periodically generate the cryptographic hash of snapshot and tweet it. For example, upon receiving the tweet, a user comes forward with the actual metadata and the repository maintainers are then able to verify metadata's cryptographic hash. Alternatively, PyPI may periodically archive its own versions of snapshot rather than rely on externally provided metadata. In this case, PyPI SHOULD take the cryptographic hash of every package on the repository and store this data on an offline device. If any package hash has changed, this indicates an attack has occurred.
Attacks that serve different versions of metadata or that freeze a version of a package at a specific version can be handled by TUF with techniques such as implicit key revocation and metadata mismatch detection [2]. n
Key Compromise Analysis
This PEP has covered the maximum security model, the TUF roles that should be added to support continuous delivery of distributions, how to generate and sign the metadata of each role, and how to support distributions that have been signed by developers. The remaining sections discuss how PyPI SHOULD audit repository metadata, and the methods PyPI can use to detect and recover from a PyPI compromise.
Table 1 summarizes a few of the attacks possible when a threshold number of private cryptographic keys (belonging to any of the PyPI roles) are compromised. The leftmost column lists the roles (or a combination of roles) that have been compromised, and the columns to the right show whether the compromised roles leaves clients susceptible to malicious updates, freeze attacks, or metadata inconsistency attacks.
| Role Compromise | Malicious Updates | Freeze Attack | Metadata Inconsistency Attacks |
|---|---|---|---|
| timetamp | NO snapshot and targets or any of the delegated roles need to cooperate | YES limited by earliest root, targets, or bin metadata expiry time | NO snapshot needs to cooperate |
| snapshot | NO timestamp and targets or any of the delegated roles need to cooperate | NO timestamp needs to coorperate | NO timestamp needs to cooperate |
| timestamp AND snapshot | NO targets or any of the delegated roles need to cooperate | YES limited by earliest root, targets, or bin metadata expiry time | YES limited by earliest root, targets, or bin metadata expiry time |
| targets OR claimed OR recently-claimed OR unclaimed OR project | NO timestamp and snapshot need to cooperate | NOT APPLICABLE need timestamp and snapshot | NOT APPLICABLE need timestamp and snapshot |
| (timestamp AND snapshot) AND project | YES | YES limited by earliest root, targets, or bin metadata expiry time | YES limited by earliest root, targets, or bin metadata expiry time |
| (timestamp AND snapshot) AND (recently-claimed OR unclaimed) | YES but only of projects not delegated by claimed | YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time | YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time |
| (timestamp AND snapshot) AND (targets OR claimed) | YES | YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time | YES limited by earliest root, targets, claimed, recently-claimed, project, or unclaimed metadata expiry time |
| root | YES | YES | YES |
Table 1: Attacks that are possible by compromising certain combinations of role keys. In September 2013 [26], it was shown how the latest version (at the time) of pip was susceptible to these attacks and how TUF could protect users against them [8]. Roles signed by offline keys are in bold.
Note that compromising targets or any delegated role (except for project targets metadata) does not immediately allow an attacker to serve malicious updates. The attacker must also compromise the timestamp and snapshot roles (which are both online and therefore more likely to be compromised). This means that in order to launch any attack, one must not only be able to act as a man-in-the-middle, but also compromise the timestamp key (or compromise the root keys and sign a new timestamp key). To launch any attack other than a freeze attack, one must also compromise the snapshot key. Finally, a compromise of the PyPI infrastructure MAY introduce malicious updates to recently-claimed projects because the keys for these roles are online.
In the Event of a Key Compromise
A key compromise means that a threshold of keys belonging to developers or the roles on PyPI, as well as the PyPI infrastructure, have been compromised and used to sign new metadata on PyPI.
If a threshold number of developer keys of a project have been compromised, the project MUST take the following steps:
- The project metadata and targets MUST be restored to the last known good consistent snapshot where the project was not known to be compromised. This can be done by developers repackaging and resigning all targets with the new keys.
- The project's metadata MUST have its version numbers incremented, expiry times suitably extended, and signatures renewed.
Whereas PyPI MUST take the following steps:
- Revoke the compromised developer keys from the recently-claimed or claimed role. This is done by replacing the compromised developer keys with newly issued developer keys.
- A new timestamped consistent snapshot MUST be issued.
If a threshold number of timestamp, snapshot, recently-claimed, or unclaimed keys have been compromised, then PyPI MUST take the following steps:
- Revoke the timestamp, snapshot, and targets role keys from the root role. This is done by replacing the compromised timestamp, snapshot, and targets keys with newly issued keys.
- Revoke the recently-claimed and unclaimed keys from the targets role by replacing their keys with newly issued keys. Sign the new targets role metadata and discard the new keys (because, as we explained earlier, this increases the security of targets metadata).
- Clear all targets or delegations in the recently-claimed role and delete all associated delegated targets metadata. Recently registered projects SHOULD register their developer keys again with PyPI.
- All targets of the recently-claimed and unclaimed roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, recently-claimed, or unclaimed keys were known to have been compromised. Added, updated, or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot SHOULD be restored to their previous versions. After ensuring the integrity of all unclaimed targets, the unclaimed metadata MUST be regenerated.
- The recently-claimed and unclaimed metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.
- A new timestamped consistent snapshot MUST be issued.
This would preemptively protect all of these roles even though only one of them may have been compromised.
If a threshold number of the targets or claimed keys have been compromised, then there is little that an attacker would be able do without the timestamp and snapshot keys. In this case, PyPI MUST simply revoke the compromised targets or claimed keys by replacing them with new keys in the root and targets roles, respectively.
If a threshold number of the timestamp, snapshot, and claimed keys have been compromised, then PyPI MUST take the following steps in addition to the steps taken when either the timestamp or snapshot keys are compromised:
- Revoke the claimed role keys from the targets role and replace them with newly issued keys.
- All project targets of the claimed roles SHOULD be compared with the last known good consistent snapshot where none of the timestamp, snapshot, or claimed keys were known to have been compromised. Added, updated, or deleted targets in the compromised consistent snapshot that do not match the last known good consistent snapshot MAY be restored to their previous versions. After ensuring the integrity of all claimed project targets, the claimed metadata MUST be regenerated.
- The claimed metadata MUST have their version numbers incremented, expiry times suitably extended, and signatures renewed.
Following these steps would preemptively protect all of these roles even though only one of them may have been compromised.
If a threshold number of root keys have been compromised, then PyPI MUST take the steps taken when the targets role has been compromised. All of the root keys must also be replaced.
It is also RECOMMENDED that PyPI sufficiently document compromises with security bulletins. These security bulletins will be most informative when users of pip-with-TUF are unable to install or update a project because the keys for the timestamp, snapshot, or root roles are no longer valid. Users could then visit the PyPI web site to consult security bulletins that would help to explain why users are no longer able to install or update, and then take action accordingly. When a threshold number of root keys have not been revoked due to a compromise, then new root metadata may be safely updated because a threshold number of existing root keys will be used to sign for the integrity of the new root metadata. TUF clients will be able to verify the integrity of the new root metadata with a threshold number of previously known root keys. This will be the common case. In the worst case, where a threshold number of root keys have been revoked due to a compromise, an end-user may choose to update new root metadata with out-of-band [27] mechanisms.
Appendix A: PyPI Build Farm and End-to-End Signing
PyPI administrators intend to support a central build farm. The PyPI build farm will auto-generate a Wheel [28], for each distribution that is uploaded by developers, on PyPI infrastructure and on supported platforms. Package managers will likely install projects by downloading these PyPI Wheels (which can be installed much faster than source distributions) rather than the source distributions signed by developers. The implications of having a central build farm with end-to-end signing SHOULD be investigated before the maximum security model is implemented.
An issue with a central build farm and end-to-end signing is that developers are unlikely to sign Wheel distributions once they have been generated on PyPI infrastructure. However, generating wheels from source distributions that are signed by developers can still be beneficial, provided that building Wheels is a deterministic process. If deterministic builds are infeasible, developers may delegate trust of these wheels to a PyPI role that signs for wheels with an online key.
References
| [1] | (1, 2, 3) https://www.python.org/dev/peps/pep-0458/ |
| [2] | (1, 2, 3) https://isis.poly.edu/~jcappos/papers/samuel_tuf_ccs_2010.pdf |
| [3] | (1, 2) https://github.com/theupdateframework/tuf/blob/develop/docs/tuf-spec.txt |
| [4] | (1, 2, 3) http://www.python.org/dev/peps/pep-0426/ |
| [5] | https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories |
| [6] | https://mail.python.org/pipermail/distutils-sig/2013-September/022773.html |
| [7] | https://isis.poly.edu/~jcappos/papers/cappos_mirror_ccs_08.pdf |
| [8] | https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html |
| [9] | https://pypi.python.org/security |
| [10] | https://mail.python.org/pipermail/distutils-sig/2013-August/022154.html |
| [11] | https://en.wikipedia.org/wiki/RSA_%28algorithm%29 |
| [12] | http://ed25519.cr.yp.to/ |
| [13] | http://www.ietf.org/rfc/rfc2119.txt |
| [14] | http://ed25519.cr.yp.to/ |
| [15] | https://github.com/pyca/ed25519 |
| [16] | https://github.com/kaepora/miniLock#-minilock |
| [17] | https://github.com/pypa/twine |
| [18] | https://docs.python.org/2/distutils/index.html#distutils-index |
| [19] | https://en.wikipedia.org/wiki/Transparency_%28human%E2%80%93computer_interaction%29 |
| [20] | https://pypi.python.org/pypi?:action=register_form |
| [21] | https://github.com/theupdateframework/tuf/blob/develop/tuf/README.md |
| [22] | https://github.com/theupdateframework/tuf/blob/develop/tuf/README-developer-tools.md |
| [23] | https://en.wikipedia.org/wiki/Transaction_log |
| [24] | http://docs.python.org/2/library/hashlib.html#hashlib.hash.hexdigest |
| [25] | https://en.wikipedia.org/wiki/SHA-2 |
| [26] | https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html |
| [27] | https://en.wikipedia.org/wiki/Out-of-band#Authentication |
| [28] | http://wheel.readthedocs.org/en/latest/ |
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grants No. CNS-1345049 and CNS-0959138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
We thank Nick Coghlan, Daniel Holth and the distutils-sig community in general for helping us to think about how to usably and efficiently integrate TUF with PyPI.
Roger Dingledine, Sebastian Hahn, Nick Mathewson, Martin Peck and Justin Samuel helped us to design TUF from its predecessor Thandy of the Tor project.
We appreciate the efforts of Konstantin Andrianov, Geremy Condra, Zane Fisher, Justin Samuel, Tian Tian, Santiago Torres, John Ward, and Yuyu Zheng to develop TUF.
Copyright
This document has been placed in the public domain.
pep-0481 Migrate CPython to Git, Github, and Phabricator
| PEP: | 481 |
|---|---|
| Title: | Migrate CPython to Git, Github, and Phabricator |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Donald Stufft <donald at stufft.io> |
| Status: | Draft |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 29-Nov-2014 |
| Post-History: | 29-Nov-2014 |
Contents
Abstract
This PEP proposes migrating the repository hosting of CPython and the supporting repositories to Git and Github. It also proposes adding Phabricator as an alternative to Github Pull Requests to handle reviewing changes. This particular PEP is offered as an alternative to PEP 474 and PEP 462 which aims to achieve the same overall benefits but restricts itself to tools that support Mercurial and are completely Open Source.
Rationale
CPython is an open source project which relies on a number of volunteers donating their time. As an open source project it relies on attracting new volunteers as well as retaining existing ones in order to continue to have a healthy amount of manpower available. In addition to increasing the amount of manpower that is available to the project, it also needs to allow for effective use of what manpower is available.
The current toolchain of the CPython project is a custom and unique combination of tools which mandates a workflow that is similar to one found in a lot of older projects, but which is becoming less and less popular as time goes on.
The one-off nature of the CPython toolchain and workflow means that any new contributor is going to need spend time learning the tools and workflow before they can start contributing to CPython. Once a new contributor goes through the process of learning the CPython workflow they also are unlikely to be able to take that knowledge and apply it to future projects they wish to contribute to. This acts as a barrier to contribution which will scare off potential new contributors.
In addition the tooling that CPython uses is under-maintained, antiquated, and it lacks important features that enable committers to more effectively use their time when reviewing and approving changes. The fact that it is under-maintained means that bugs are likely to last for longer, if they ever get fixed, as well as it's more likely to go down for extended periods of time. The fact that it is antiquated means that it doesn't effectively harness the capabilities of the modern web platform. Finally the fact that it lacks several important features such as a lack of pre-testing commits and the lack of an automatic merge tool means that committers have to do needless busy work to commit even the simplest of changes.
Version Control System
The first decision that needs to be made is the VCS of the primary server side repository. Currently the CPython repository, as well as a number of supporting repositories, uses Mercurial. When evaluating the VCS we must consider the capabilities of the VCS itself as well as the network effect and mindshare of the community around that VCS.
There are really only two real options for this, Mercurial and Git. Between the two of them the technical capabilities are largely equivilant. For this reason this PEP will largely ignore the technical arguments about the VCS system and will instead focus on the social aspects.
It is not possible to get exact numbers for the number of projects or people which are using a particular VCS, however we can infer this by looking at several sources of information for what VCS projects are using.
The Open Hub (previously Ohloh) statistics [1] show that 37% of the repositories indexed by The Open Hub are using Git (second only to SVN which has 48%) while Mercurial has just 2% (beating only bazaar which has 1%). This has Git being just over 18 times as popular as Mercurial on The Open Hub.
Another source of information on the popular of the difference VCSs is PyPI itself. This source is more targeted at the Python community itself since it represents projects developed for Python. Unfortunately PyPI does not have a standard location for representing this information, so this requires manual processing. If we limit our search to the top 100 projects on PyPI (ordered by download counts) we can see that 62% of them use Git while 22% of them use Mercurial while 13% use something else. This has Git being just under 3 times as popular as Mercurial for the top 100 projects on PyPI.
Obviously from these numbers Git is by far the more popular DVCS for open source projects and choosing the more popular VCS has a number of positive benefits.
For new contributors it increases the likelihood that they will have already learned the basics of Git as part of working with another project or if they are just now learning Git, that they'll be able to take that knowledge and apply it to other projects. Additionally a larger community means more people writing how to guides, answering questions, and writing articles about Git which makes it easier for a new user to find answers and information about the tool they are trying to learn.
Another benefit is that by nature of having a larger community, there will be more tooling written around it. This increases options for everything from GUI clients, helper scripts, repository hosting, etc.
Repository Hosting
This PEP proposes allowing GitHub Pull Requests to be submitted, however GitHub does not have a way to submit Pull Requests against a repository that is not hosted on GitHub. This PEP also proposes that in addition to GitHub Pull Requests Phabricator's Differential app can also be used to submit proposed changes and Phabricator does allow submitting changes against a repository that is not hosted on Phabricator.
For this reason this PEP proposes using GitHub as the canonical location of the repository with a read-only mirror located in Phabricator. If at some point in the future GitHub is no longer desired, then repository hosting can easily be moved to solely in Phabricator and the ability to accept GitHub Pull Requests dropped.
In addition to hosting the repositories on Github, a read only copy of all repositories will also be mirrored onto the PSF Infrastructure.
Code Review
Currently CPython uses a custom fork of Rietveld which has been modified to not run on Google App Engine which is really only able to be maintained currently by one person. In addition it is missing out on features that are present in many modern code review tools.
This PEP proposes allowing both Github Pull Requests and Phabricator changes to propose changes and review code. It suggests both so that contributors can select which tool best enables them to submit changes, and reviewers can focus on reviewing changes in the tooling they like best.
GitHub Pull Requests
GitHub is a very popular code hosting site and is increasingly becoming the primary place people look to contribute to a project. Enabling users to contribute through GitHub is enabling contributors to contribute using tooling that they are likely already familiar with and if they are not they are likely to be able to apply to another project.
GitHub Pull Requests have a fairly major advantage over the older "submit a patch to a bug tracker" model. It allows developers to work completely within their VCS using standard VCS tooling so it does not require creating a patch file and figuring out what the right location is to upload it to. This lowers the barrier for sending a change to be reviewed.
On the reviewing side, GitHub Pull Requests are far easier to review, they have nice syntax highlighted diffs which can operate in either unified or side by side views. They allow expanding the context on a diff up to and including the entire file. Finally they allow commenting inline and on the pull request as a whole and they present that in a nice unified way which will also hide comments which no longer apply. Github also provides a "rendered diff" view which enables easily viewing a diff of rendered markup (such as rst) instead of needing to review the diff of the raw markup.
The Pull Request work flow also makes it trivial to enable the ability to pre-test a change before actually merging it. Any particular pull request can have any number of different types of "commit statuses" applied to it, marking the commit (and thus the pull request) as either in a pending, successful, errored, or failure state. This makes it easy to see inline if the pull request is passing all of the tests, if the contributor has signed a CLA, etc.
Actually merging a Github Pull Request is quite simple, a core reviewer simply needs to press the "Merge" button once the status of all the checks on the Pull Request are green for successful.
GitHub also has a good workflow for submitting pull requests to a project completely through their web interface. This would enable the Python documentation to have "Edit on GitHub" buttons on every page and people who discover things like typos, inaccuracies, or just want to make improvements to the docs they are currently writing can simply hit that button and get an in browser editor that will let them make changes and submit a pull request all from the comfort of their browser.
Phabricator
In addition to GitHub Pull Requests this PEP also proposes setting up a Phabricator instance and pointing it at the GitHub hosted repositories. This will allow utilizing the Phabricator review applications of Differential and Audit.
Differential functions similarly to GitHub pull requests except that they require installing the arc command line tool to upload patches to Phabricator.
Whether to enable Phabricator for any particular repository can be chosen on a case by case basis, this PEP only proposes that it must be enabled for the CPython repository, however for smaller repositories such as the PEP repository it may not be worth the effort.
Criticism
X is not written in Python
One feature that the current tooling (Mercurial, Rietveld) has is that the primary language for all of the pieces are written in Python. It is this PEPs belief that we should focus on the best tools for the job and not the best tools that happen to be written in Python. Volunteer time is a precious resource to any open source project and we can best respect and utilize that time by focusing on the benefits and downsides of the tools themselves rather than what language their authors happened to write them in.
One concern is the ability to modify tools to work for us, however one of the Goals here is to not modify software to work for us and instead adapt ourselves to a more standard workflow. This standardization pays off in the ability to re-use tools out of the box freeing up developer time to actually work on Python itself as well as enabling knowledge sharing between projects.
However if we do need to modify the tooling, Git itself is largely written in C the same as CPython itself is. It can also have commands written for it using any language, including Python. Phabricator is written in PHP which is a fairly common language in the web world and fairly easy to pick up. GitHub itself is largely written in Ruby but given that it's not Open Source there is no ability to modify it so it's implementation language is completely meaningless.
GitHub is not Free/Open Source
GitHub is a big part of this proposal and someone who tends more to ideology rather than practicality may be opposed to this PEP on that grounds alone. It is this PEPs belief that while using entirely Free/Open Source software is an attractive idea and a noble goal, that valuing the time of the contributors by giving them good tooling that is well maintained and that they either already know or if they learn it they can apply to other projects is a more important concern than treating whether something is Free/Open Source is a hard requirement.
However, history has shown us that sometimes benevolent proprietary companies can stop being benevolent. This is hedged against in a few ways:
- We are not utilizing the GitHub Issue Tracker, both because it is not powerful enough for CPython but also because for the primary CPython repository the ability to take our issues and put them somewhere else if we ever need to leave GitHub relies on GitHub continuing to allow API access.
- We are utilizing the GitHub Pull Request workflow, however all of those changes live inside of Git. So a mirror of the GitHub repositories can easily contain all of those Pull Requests. We would potentially lose any comments if GitHub suddenly turned "evil", but the changes themselves would still exist.
- We are utilizing the GitHub repository hosting feature, however since this is just git moving away from GitHub is as simple as pushing the repository to a different location. Data portability for the repository itself is extremely high.
- We are also utilizing Phabricator to provide an alternative for people who do not wish to use GitHub. This also acts as a fallback option which will already be in place if we ever need to stop using GitHub.
Relying on GitHub comes with a number of benefits beyond just the benefits of the platform itself. Since it is a commercially backed venture it has a full time staff responsible for maintaining its services. This includes making sure they stay up, making sure they stay patched for various security vulnerabilities, and further improving the software and infrastructure as time goes on.
Mercurial is better than Git
Whether Mercurial or Git is better on a technical level is a highly subjective opinion. This PEP does not state whether the mechanics of Git or Mercurial is better and instead focuses on the network effect that is available for either option. Since this PEP proposes switching to Git this leaves the people who prefer Mercurial out, however those users can easily continue to work with Mercurial by using the hg-git [2] extension for Mercurial which will let it work with a repository which is Git on the serverside.
CPython Workflow is too Complicated
One sentiment that came out of previous discussions was that the multi branch model of CPython was too complicated for Github Pull Requests. It is the belief of this PEP that statement is not accurate.
Currently any particular change requires manually creating a patch for 2.7 and 3.x which won't change at all in this regards.
If someone submits a fix for the current stable branch (currently 3.4) the GitHub Pull Request workflow can be used to create, in the browser, a Pull Request to merge the current stable branch into the master branch (assuming there is no merge conflicts). If there is a merge conflict that would need to be handled locally. This provides an improvement over the current situation where the merge must always happen locally.
Finally if someone submits a fix for the current development branch currently then this has to be manually applied to the stable branch if it desired to include it there as well. This must also happen locally as well in the new workflow, however for minor changes it could easily be accomplished in the GitHub web editor.
Looking at this, I do not believe that any system can hide the complexities involved in maintaining several long running branches. The only thing that the tooling can do is make it as easy as possible to submit changes.
Example: Scientific Python
One of the key ideas behind the move to both git and Github is that a feature of a DVCS, the repository hosting, and the workflow used is the social network and size of the community using said tools. We can see this is true by looking at an example from a sub-community of the Python community: The Scientific Python community. They have already migrated most of the key pieces of the SciPy stack onto Github using the Pull Request based workflow. This process started with IPython, and as more projects moved over it became a natural default for new projects in the community.
They claim to have seen a great benefit from this move, in that it enables casual contributors to easily move between different projects within their sub-community without having to learn a special, bespoke workflow and a different toolchain for each project. They've found that when people can use their limited time on actually contributing instead of learning the different tools and workflows, not only do they contribute more to one project, but that they also expand out and contribute to other projects. This move has also been attributed to the increased tendency for members of that community to go so far as publishing their research and educational materials on Github as well.
This example showcases the real power behind moving to a highly popular toolchain and workflow, as each variance introduces yet another hurdle for new and casual contributors to get past and it makes the time spent learning that workflow less reusable with other projects.
References
| [1] | Open Hub Statistics <https://www.openhub.net/repositories/compare> |
| [2] | Hg-Git mercurial plugin <https://hg-git.github.io/> |
Copyright
This document has been placed in the public domain.
pep-0482 Literature Overview for Type Hints
| PEP: | 482 |
|---|---|
| Title: | Literature Overview for Type Hints |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Łukasz Langa <lukasz at langa.pl> |
| Discussions-To: | Python-Ideas <python-ideas at python.org> |
| Status: | Draft |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 08-Jan-2015 |
| Post-History: | |
| Resolution: |
Contents
Abstract
This PEP is one of three related to type hinting. This PEP gives a literature overview of related work. The main spec is PEP 484.
Existing Approaches for Python
Reticulated Python
Reticulated Python [reticulated] by Michael Vitousek is an example of a slightly different approach to gradual typing for Python. It is described in an actual academic paper [reticulated-paper] written by Vitousek with Jeremy Siek and Jim Baker (the latter of Jython fame).
PyCharm
PyCharm by JetBrains has been providing a way to specify and check types for about four years. The type system suggested by PyCharm [pycharm] grew from simple class types to tuple types, generic types, function types, etc. based on feedback of many users who shared their experience of using type hints in their code.
Others
TBD: Add sections on pyflakes [pyflakes], pylint [pylint], numpy [numpy], Argument Clinic [argumentclinic], pytypedecl [pytypedecl], numba [numba], obiwan [obiwan].
Existing Approaches in Other Languages
ActionScript
ActionScript [actionscript] is a class-based, single inheritance, object-oriented superset of ECMAScript. It supports inferfaces and strong runtime-checked static typing. Compilation supports a “strict dialect” where type mismatches are reported at compile-time.
Example code with types:
package {
import flash.events.Event;
public class BounceEvent extends Event {
public static const BOUNCE:String = "bounce";
private var _side:String = "none";
public function get side():String {
return _side;
}
public function BounceEvent(type:String, side:String){
super(type, true);
_side = side;
}
public override function clone():Event {
return new BounceEvent(type, _side);
}
}
}
Dart
Dart [dart] is a class-based, single inheritance, object-oriented language with C-style syntax. It supports interfaces, abstract classes, reified generics, and optional typing.
Types are inferred when possible. The runtime differentiates between two modes of execution: checked mode aimed for development (catching type errors at runtime) and production mode recommended for speed execution (ignoring types and asserts).
Example code with types:
class Point {
final num x, y;
Point(this.x, this.y);
num distanceTo(Point other) {
var dx = x - other.x;
var dy = y - other.y;
return math.sqrt(dx * dx + dy * dy);
}
}
Hack
Hack [hack] is a programming language that interoperates seamlessly with PHP. It provides opt-in static type checking, type aliasing, generics, nullable types, and lambdas.
Example code with types:
<?hh
class MyClass {
private ?string $x = null;
public function alpha(): int {
return 1;
}
public function beta(): string {
return 'hi test';
}
}
function f(MyClass $my_inst): string {
// Will generate a hh_client error
return $my_inst->alpha();
}
TypeScript
TypeScript [typescript] is a typed superset of JavaScript that adds interfaces, classes, mixins and modules to the language.
Type checks are duck typed. Multiple valid function signatures are specified by supplying overloaded function declarations. Functions and classes can use generics as type parametrization. Interfaces can have optional fields. Interfaces can specify array and dictionary types. Classes can have constructors that implicitly add arguments as fields. Classes can have static fields. Classes can have private fields. Classes can have getters/setters for fields (like property). Types are inferred.
Example code with types:
interface Drivable {
start(): void;
drive(distance: number): boolean;
getPosition(): number;
}
class Car implements Drivable {
private _isRunning: boolean;
private _distanceFromStart: number;
constructor() {
this._isRunning = false;
this._distanceFromStart = 0;
}
public start() {
this._isRunning = true;
}
public drive(distance: number): boolean {
if (this._isRunning) {
this._distanceFromStart += distance;
return true;
}
return false;
}
public getPosition(): number {
return this._distanceFromStart;
}
}
References
| [mypy] | http://mypy-lang.org |
| [reticulated] | https://github.com/mvitousek/reticulated |
| [reticulated-paper] | http://wphomes.soic.indiana.edu/jsiek/files/2014/03/retic-python.pdf |
| [pycharm] | https://github.com/JetBrains/python-skeletons#types |
| [obiwan] | http://pypi.python.org/pypi/obiwan |
| [numba] | http://numba.pydata.org |
| [pytypedecl] | https://github.com/google/pytypedecl |
| [argumentclinic] | https://docs.python.org/3/howto/clinic.html |
| [numpy] | http://www.numpy.org |
| [typescript] | http://www.typescriptlang.org |
| [hack] | http://hacklang.org |
| [dart] | https://www.dartlang.org |
| [actionscript] | http://livedocs.adobe.com/specs/actionscript/3/ |
| [pyflakes] | https://github.com/pyflakes/pyflakes/ |
| [pylint] | http://www.pylint.org |
Copyright
This document has been placed in the public domain.
pep-0483 The Theory of Type Hints
| PEP: | 483 |
|---|---|
| Title: | The Theory of Type Hints |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Discussions-To: | Python-Ideas <python-ideas at python.org> |
| Status: | Draft |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 19-Dec-2014 |
| Post-History: | |
| Resolution: |
Contents
Introduction
This document lays out the theory of the new type hinting proposal for Python 3.5. It's not quite a full proposal or specification because there are many details that need to be worked out, but it lays out the theory without which it is hard to discuss more detailed specifications. We start by explaining gradual typing; then we state some conventions and general rules; then we define the new special types (such as Union) that can be used in annotations; and finally we define the approach to generic types. (TODO: The latter section needs more fleshing out; sorry!)
Specification
Summary of gradual typing
We define a new relationship, is-consistent-with, which is similar to is-subclass-of, except it is not transitive when the new type Any is involved. (Neither relationship is symmetric.) Assigning x to y is OK if the type of x is consistent with the type of y. (Compare this to "... if the type of x is a subclass of the type of y," which states one of the fundamentals of OO programming.) The is-consistent-with relationship is defined by three rules:
- A type t1 is consistent with a type t2 if t1 is a subclass of t2. (But not the other way around.)
- Any is consistent with every type. (But Any is not a subclass of every type.)
- Every type is a subclass of Any. (Which also makes every type consistent with Any, via rule 1.)
That's all! See Jeremy Siek's blog post What is Gradual Typing for a longer explanation and motivation. Note that rule 3 places Any at the root of the class graph. This makes it very similar to object. The difference is that object is not consistent with most types (e.g. you can't use an object() instance where an int is expected). IOW both Any and object mean "any type is allowed" when used to annotate an argument, but only Any can be passed no matter what type is expected (in essence, Any shuts up complaints from the static checker).
Here's an example showing how these rules work out in practice:
Say we have an Employee class, and a subclass Manager:
- class Employee: ...
- class Manager(Employee): ...
Let's say variable e is declared with type Employee:
- e = Employee() # type: Employee
Now it's okay to assign a Manager instance to e (rule 1):
- e = Manager()
It's not okay to assign an Employee instance to a variable declared with type Manager:
- m = Manager() # type: Manager
- m = Employee() # Fails static check
However, suppose we have a variable whose type is Any:
- a = some_func() # type: Any
Now it's okay to assign a to e (rule 2):
- e = a # OK
Of course it's also okay to assign e to a (rule 3), but we didn't need the concept of consistency for that:
- a = e # OK
Notational conventions
- t1, t2 etc. and u1, u2 etc. are types or classes. Sometimes we write ti or tj to refer to "any of t1, t2, etc."
- X, Y etc. are type variables (defined with TypeVar(), see below).
- C, D etc. are classes defined with a class statement.
- x, y etc. are objects or instances.
- We use the terms type and class interchangeably. Note that PEP 484 makes a distinction (a type is a concept for the type checker, while a class is a runtime concept). In this PEP we're only interested in the types anyway, and if this bothers you, you can reinterpret this PEP with every occurrence of "class" replaced by "type".
General rules
- Instance-ness is derived from class-ness, e.g. x is an instance of t1 if the type of x is a subclass of t1.
- No types defined below (i.e. Any, Union etc.) can be instantiated. (But non-abstract subclasses of Generic can be.)
- No types defined below can be subclassed, except for Generic and classes derived from it.
- Where a type is expected, None can be substituted for type(None); e.g. Union[t1, None] == Union[t1, type(None)].
Types
- Any. Every class is a subclass of Any; however, to the static type checker it is also consistent with every class (see above).
- Union[t1, t2, ...]. Classes that are subclass of at least one of t1 etc. are subclasses of this. So are unions whose components are all subclasses of t1 etc. (Example: Union[int, str] is a subclass of Union[int, float, str].) The order of the arguments doesn't matter. (Example: Union[int, str] == Union[str, int].) If ti is itself a Union the result is flattened. (Example: Union[int, Union[float, str]] == Union[int, float, str].) If ti and tj have a subclass relationship, the less specific type survives. (Example: Union[Employee, Manager] == Union[Employee].) Union[t1] returns just t1. Union[] is illegal, so is Union[()]. Corollary: Union[..., Any, ...] returns Any; Union[..., object, ...] returns object; to cut a tie, Union[Any, object] == Union[object, Any] == Any.
- Optional[t1]. Alias for Union[t1, None], i.e. Union[t1, type(None)].
- Tuple[t1, t2, ..., tn]. A tuple whose items are instances of t1 etc.. Example: Tuple[int, float] means a tuple of two items, the first is an int, the second a float; e.g., (42, 3.14). Tuple[u1, u2, ..., um] is a subclass of Tuple[t1, t2, ..., tn] if they have the same length (n==m) and each ui is a subclass of ti. To spell the type of the empty tuple, use Tuple[()]. A variadic homogeneous tuple type can be written Tuple[t1, ...]. (That's three dots, a literal ellipsis; and yes, that's a valid token in Python's syntax.)
- Callable[[t1, t2, ..., tn], tr]. A function with positional argument types t1 etc., and return type tr. The argument list may be empty (n==0). There is no way to indicate optional or keyword arguments, nor varargs, but you can say the argument list is entirely unchecked by writing Callable[..., tr] (again, a literal ellipsis). This is covariant in the return type, but contravariant in the arguments. "Covariant" here means that for two callable types that differ only in the return type, the subclass relationship for the callable types follows that of the return types. (Example: Callable[[], Manager] is a subclass of Callable[[], Employee].) "Contravariant" here means that for two callable types that differ only in the type of one argument, the subclass relationship for the callable types goes in the opposite direction as for the argument types. (Example: Callable[[Employee], None] is a subclass of Callable[[Mananger], None]. Yes, you read that right.)
We might add:
- Intersection[t1, t2, ...]. Classes that are subclass of each of t1, etc are subclasses of this. (Compare to Union, which has at least one instead of each in its definition.) The order of the arguments doesn't matter. Nested intersections are flattened, e.g. Intersection[int, Intersection[float, str]] == Intersection[int, float, str]. An intersection of fewer types is a subclass of an intersection of more types, e.g. Intersection[int, str] is a subclass of Intersection[int, float, str]. An intersection of one argument is just that argument, e.g. Intersection[int] is int. When argument have a subclass relationship, the more specific class survives, e.g. Intersection[str, Employee, Manager] is Intersection[str, Manager]. Intersection[] is illegal, so is Intersection[()]. Corollary: Any disappears from the argument list, e.g. Intersection[int, str, Any] == Intersection[int, str]. Intersection[Any, object] is object. The interaction between Intersection and Union is complex but should be no surprise if you understand the interaction between intersections and unions in set theory (note that sets of types can be infinite in size, since there is no limit on the number of new subclasses).
Pragmatics
Some things are irrelevant to the theory but make practical use more convenient. (This is not a full list; I probably missed a few and some are still controversial or not fully specified.)
- Type aliases, e.g.
- Point = Tuple[float, float]
- def distance(p: Point) -> float: ...
- Forward references via strings, e.g.
- class C:
- def compare(self, other: 'C') -> int: ...
- class C:
- If a default of None is specified, the type is implicitly Optional, e.g.
- def get(key: KT, default: VT = None) -> VT: ...
- Don't use dynamic type expressions; use builtins and imported types
only. No 'if'.
- def display(message: str if WINDOWS else bytes): # NOT OK
- Type declaration in comments, e.g.
- x = [] # type: Sequence[int]
- Type declarations using Undefined, e.g.
- x = Undefined(str)
- Casts using cast(T, x), e.g.
- x = cast(Any, frobozz())
- Other things, e.g. overloading and stub modules; best left to an actual PEP.
Generic types
(TODO: Explain more. See also the mypy docs on generics.)
- X = TypeVar('X'). Declares a unique type variable. The name must match the variable name.
- Y = TypeVar('Y', t1, t2, ...). Ditto, constrained to t1 etc. Behaves like Union[t1, t2, ...] for most purposes, but when used as a type variable, subclasses of t1 etc. are replaced by the most-derived base class among t1 etc.
- Example of constrained type variables:
- AnyStr = TypeVar('AnyStr', str, bytes)
- def longest(a: AnyStr, b: AnyStr) -> AnyStr:
- return a if len(a) >= len(b) else b
- x = longest('a', 'abc') # The inferred type for x is str
- y = longest('a', b'abc') # Fails static type check
- In this example, both arguments to longest() must have the same type (str or bytes), and moreover, even if the arguments are instances of a common str subclass, the return type is still str, not that subclass (see next example).
- For comparison, if the type variable was unconstrained, the common
subclass would be chosen as the return type, e.g.:
- S = TypeVar('S')
- def longest(a: S, b: S) -> S:
- return a if len(a) >= b else b
- class MyStr(str): ...
- x = longest(MyStr('a'), MyStr('abc'))
- The inferred type of x is MyStr (whereas in the AnyStr example it would be str).
- Also for comparison, if a Union is used, the return type also has to be
a Union:
- U = Union[str, bytes]
- def longest(a: U, b: U) -> U:
- return a if len(a) >= b else b
- x = longest('a', 'abc')
- The inferred type of x is still Union[str, bytes], even though both arguments are str.
- class C(Generic[X, Y, ...]): ... Define a generic class C over type variables X etc. C itself becomes parameterizable, e.g. C[int, str, ...] is a specific class with substitutions X->int etc.
- TODO: Explain use of generic types in function signatures. E.g. Sequence[X], Sequence[int], Sequence[Tuple[X, Y, Z]], and mixtures. Think about co*variance. No gimmicks like deriving from Sequence[Union[int, str]] or Sequence[Union[int, X]].
Predefined generic types and Protocols in typing.py
(See also the typing.py module.)
- Everything from collections.abc (but Set renamed to AbstractSet).
- Dict, List, Set, FrozenSet, a few more.
- re.Pattern[AnyStr], re.Match[AnyStr].
- re.IO[AnyStr], re.TextIO ~ re.IO[str], re.BinaryIO ~ re.IO[bytes].
Copyright
This document is licensed under the Open Publication License [1].
pep-0484 Type Hints
| PEP: | 484 |
|---|---|
| Title: | Type Hints |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org>, Jukka Lehtosalo <jukka.lehtosalo at iki.fi>, Ĺukasz Langa <lukasz at langa.pl> |
| BDFL-Delegate: | Mark Shannon |
| Discussions-To: | Python-Dev <python-dev at python.org> |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Sep-2014 |
| Post-History: | 16-Jan-2015,20-Mar-2015,17-Apr-2015,20-May-2015,22-May-2015 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-May/140104.html |
Contents
- Abstract
- Rationale and Goals
- The meaning of annotations
- Type Definition Syntax
- Acceptable type hints
- Using None
- Type aliases
- Callable
- Generics
- User-defined generic types
- Instantiating generic classes and type erasure
- Arbitrary generic types as base classes
- Abstract generic types
- Type variables with an upper bound
- Covariance and contravariance
- The numeric tower
- The bytes types
- Forward references
- Union types
- The Any type
- Version and platform checking
- Default argument values
- Compatibility with other uses of function annotations
- Type comments
- Casts
- Stub Files
- Exceptions
- The typing Module
- Rejected Alternatives
- PEP Development Process
- Acknowledgements
- References
- Copyright
Abstract
PEP 3107 introduced syntax for function annotations, but the semantics were deliberately left undefined. There has now been enough 3rd party usage for static type analysis that the community would benefit from a standard vocabulary and baseline tools within the standard library.
This PEP introduces a provisional module to provide these standard definitions and tools, along with some conventions for situations where annotations are not available.
Note that this PEP still explicitly does NOT prevent other uses of annotations, nor does it require (or forbid) any particular processing of annotations, even when they conform to this specification. It simply enables better coordination, as PEP 333 did for web frameworks.
For example, here is a simple function whose argument and return type are declared in the annotations:
def greeting(name: str) -> str:
return 'Hello ' + name
While these annotations are available at runtime through the usual __annotations__ attribute, no type checking happens at runtime. Instead, the proposal assumes the existence of a separate off-line type checker which users can run over their source code voluntarily. Essentially, such a type checker acts as a very powerful linter. (While it would of course be possible for individual users to employ a similar checker at run time for Design By Contract enforcement or JIT optimization, those tools are not yet as mature.)
The proposal is strongly inspired by mypy [mypy]. For example, the type "sequence of integers" can be written as Sequence[int]. The square brackets mean that no new syntax needs to be added to the language. The example here uses a custom type Sequence, imported from a pure-Python module typing. The Sequence[int] notation works at runtime by implementing __getitem__() in the metaclass (but its significance is primarily to an offline type checker).
The type system supports unions, generic types, and a special type named Any which is consistent with (i.e. assignable to and from) all types. This latter feature is taken from the idea of gradual typing. Gradual typing and the full type system are explained in PEP 483.
Other approaches from which we have borrowed or to which ours can be compared and contrasted are described in PEP 482.
Rationale and Goals
PEP 3107 added support for arbitrary annotations on parts of a function definition. Although no meaning was assigned to annotations then, there has always been an implicit goal to use them for type hinting [gvr-artima], which is listed as the first possible use case in said PEP.
This PEP aims to provide a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.
Of these goals, static analysis is the most important. This includes support for off-line type checkers such as mypy, as well as providing a standard notation that can be used by IDEs for code completion and refactoring.
Non-goals
While the proposed typing module will contain some building blocks for runtime type checking -- in particular the get_type_hints() function -- third party packages would have to be developed to implement specific runtime type checking functionality, for example using decorators or metaclasses. Using type hints for performance optimizations is left as an exercise for the reader.
It should also be emphasized that Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.
The meaning of annotations
Any function without annotations should be treated as having the most general type possible, or ignored, by any type checker. Functions with the @no_type_check decorator or with a # type: ignore comment should be treated as having no annotations.
It is recommended but not required that checked functions have annotations for all arguments and the return type. For a checked function, the default annotation for arguments and for the return type is Any. An exception is that the first argument of instance and class methods does not need to be annotated; it is assumed to have the type of the containing class for instance methods, and a type object type corresponding to the containing class object for class methods. For example, in class A the first argument of an instance method has the implicit type A. In a class method, the precise type of the first argument cannot be represented using the available type notation.
(Note that the return type of __init__ ought to be annotated with -> None. The reason for this is subtle. If __init__ assumed a return annotation of -> None, would that mean that an argument-less, un-annotated __init__ method should still be type-checked? Rather than leaving this ambiguous or introducing an exception to the exception, we simply say that __init__ ought to have a return annotation; the default behavior is thus the same as for other methods.)
A type checker is expected to check the body of a checked function for consistency with the given annotations. The annotations may also used to check correctness of calls appearing in other checked functions.
Type checkers are expected to attempt to infer as much information as necessary. The minimum requirement is to handle the builtin decorators @property, @staticmethod and @classmethod.
Type Definition Syntax
The syntax leverages PEP 3107-style annotations with a number of extensions described in sections below. In its basic form, type hinting is used by filling function annotation slots with classes:
def greeting(name: str) -> str:
return 'Hello ' + name
This states that the expected type of the name argument is str. Analogically, the expected return type is str.
Expressions whose type is a subtype of a specific argument type are also accepted for that argument.
Acceptable type hints
Type hints may be built-in classes (including those defined in standard library or third-party extension modules), abstract base classes, types available in the types module, and user-defined classes (including those defined in the standard library or third-party modules).
While annotations are normally the best format for type hints, there are times when it is more appropriate to represent them by a special comment, or in a separately distributed stub file. (See below for examples.)
Annotations must be valid expressions that evaluate without raising exceptions at the time the function is defined (but see below for forward references).
Annotations should be kept simple or static analysis tools may not be able to interpret the values. For example, dynamically computed types are unlikely to be understood. (This is an intentionally somewhat vague requirement, specific inclusions and exclusions may be added to future versions of this PEP as warranted by the discussion.)
In addition to the above, the following special constructs defined below may be used: None, Any, Union, Tuple, Callable, all ABCs and stand-ins for concrete classes exported from typing (e.g. Sequence and Dict), type variables, and type aliases.
All newly introduced names used to support features described in following sections (such as Any and Union) are available in the typing module.
Using None
When used in a type hint, the expression None is considered equivalent to type(None).
Type aliases
Type aliases are defined by simple variable assignments:
Url = str def retry(url: Url, retry_count: int) -> None: ...
Note that we recommend capitalizing alias names, since they represent user-defined types, which (like user-defined classes) are typically spelled that way.
Type aliases may be as complex as type hints in annotations -- anything that is acceptable as a type hint is acceptable in a type alias:
from typing import TypeVar, Iterable, Tuple
T = TypeVar('T', int, float, complex)
Vector = Iterable[Tuple[T, T]]
def inproduct(v: Vector) -> T:
return sum(x*y for x, y in v)
This is equivalent to:
from typing import TypeVar, Iterable, Tuple
T = TypeVar('T', int, float, complex)
def inproduct(v: Iterable[Tuple[T, T]]) -> T:
return sum(x*y for x, y in v)
Callable
Frameworks expecting callback functions of specific signatures might be type hinted using Callable[[Arg1Type, Arg2Type], ReturnType]. Examples:
from typing import Callable
def feeder(get_next_item: Callable[[], str]) -> None:
# Body
def async_query(on_success: Callable[[int], None],
on_error: Callable[[int, Exception], None]) -> None:
# Body
It is possible to declare the return type of a callable without specifying the call signature by substituting a literal ellipsis (three dots) for the list of arguments:
def partial(func: Callable[..., str], *args) -> Callable[..., str]:
# Body
Note that there are no square brackets around the ellipsis. The arguments of the callback are completely unconstrained in this case (and keyword arguments are acceptable).
Since using callbacks with keyword arguments is not perceived as a common use case, there is currently no support for specifying keyword arguments with Callable. Similarly, there is no support for specifying callback signatures with a variable number of argument of a specific type.
Because typing.Callable does double-duty as a replacement for collections.abc.Callable, isinstance(x, typing.Callable) is implemented by deferring to isinstance(x, collections.abc.Callable). However, isinstance(x, typing.Callable[...]) is not supported.
Generics
Since type information about objects kept in containers cannot be statically inferred in a generic way, abstract base classes have been extended to support subscription to denote expected types for container elements. Example:
from typing import Mapping, Set def notify_by_email(employees: Set[Employee], overrides: Mapping[str, str]) -> None: ...
Generics can be parametrized by using a new factory available in typing called TypeVar. Example:
from typing import Sequence, TypeVar
T = TypeVar('T') # Declare type variable
def first(l: Sequence[T]) -> T: # Generic function
return l[0]
In this case the contract is that the returned value is consistent with the elements held by the collection.
A TypeVar() expression must always directly be assigned to a variable (it should not be used as part of a larger expression). The argument to TypeVar() must be a string equal to the variable name to which it is assigned. Type variables must not be redefined.
TypeVar supports constraining parametric types to a fixed set of possible types. For example, we can define a type variable that ranges over just str and bytes. By default, a type variable ranges over all possible types. Example of constraining a type variable:
from typing import TypeVar
AnyStr = TypeVar('AnyStr', str, bytes)
def concat(x: AnyStr, y: AnyStr) -> AnyStr:
return x + y
The function concat can be called with either two str arguments or two bytes arguments, but not with a mix of str and bytes arguments.
There should be at least two constraints, if any; specifying a single constraint is disallowed.
Subtypes of types constrained by a type variable should be treated as their respective explicitly listed base types in the context of the type variable. Consider this example:
class MyStr(str): ...
x = concat(MyStr('apple'), MyStr('pie'))
The call is valid but the type variable AnyStr will be set to str and not MyStr. In effect, the inferred type of the return value assigned to x will also be str.
Additionally, Any is a valid value for every type variable. Consider the following:
def count_truthy(elements: List[Any]) -> int:
return sum(1 for elem in elements if element)
This is equivalent to omitting the generic notation and just saying elements: List.
User-defined generic types
You can include a Generic base class to define a user-defined class as generic. Example:
from typing import TypeVar, Generic
T = TypeVar('T')
class LoggedVar(Generic[T]):
def __init__(self, value: T, name: str, logger: Logger) -> None:
self.name = name
self.logger = logger
self.value = value
def set(self, new: T) -> None:
self.log('Set ' + repr(self.value))
self.value = new
def get(self) -> T:
self.log('Get ' + repr(self.value))
return self.value
def log(self, message: str) -> None:
self.logger.info('{}: {}'.format(self.name message))
Generic[T] as a base class defines that the class LoggedVar takes a single type parameter T. This also makes T valid as a type within the class body.
The Generic base class uses a metaclass that defines __getitem__ so that LoggedVar[t] is valid as a type:
from typing import Iterable
def zero_all_vars(vars: Iterable[LoggedVar[int]]) -> None:
for var in vars:
var.set(0)
A generic type can have any number of type variables, and type variables may be constrained. This is valid:
from typing import TypeVar, Generic
...
T = TypeVar('T')
S = TypeVar('S')
class Pair(Generic[T, S]):
...
Each type variable argument to Generic must be distinct. This is thus invalid:
from typing import TypeVar, Generic
...
T = TypeVar('T')
class Pair(Generic[T, T]): # INVALID
...
You can use multiple inheritance with Generic:
from typing import TypeVar, Generic, Sized
T = TypeVar('T')
class LinkedList(Sized, Generic[T]):
...
Subclassing a generic class without specifying type parameters assumes Any for each position. In the following example, MyIterable is not generic but implicitly inherits from Iterable[Any]:
from typing import Iterable
- class MyIterable(Iterable): # Same as Iterable[Any]
- ...
Generic metaclasses are not supported.
Instantiating generic classes and type erasure
Generic types like List or Sequence cannot be instantiated. However, user-defined classes derived from them can be instantiated. Suppose we write a Node class inheriting from Generic[T]:
from typing import TypeVar, Generic
T = TypeVar('T')
class Node(Generic[T]):
...
Now there are two ways we can instantiate this class; the type inferred by a type checker may be different depending on the form we use. The first way is to give the value of the type parameter explicitly -- this overrides whatever type inference the type checker would otherwise perform:
x = Node[T]() # The type inferred for x is Node[T].
y = Node[int]() # The type inferred for y is Node[int].
If no explicit types are given, the type checker is given some freedom. Consider this code:
x = Node()
The inferred type could be Node[Any], as there isn't enough context to infer a more precise type. Alternatively, a type checker may reject the line and require an explicit annotation, like this:
x = Node() # type: Node[int] # Inferred type is Node[int].
A type checker with more powerful type inference could look at how x is used elsewhere in the file and try to infer a more precise type such as Node[int] even without an explicit type annotation. However, it is probably impossible to make such type inference work well in all cases, since Python programs can be very dynamic.
This PEP doesn't specify the details of how type inference should work. We allow different tools to experiment with various approaches. We may give more explicit rules in future revisions.
At runtime the type is not preserved, and the class of x is just Node in all cases. This behavior is called "type erasure"; it is common practice in languages with generics (e.g. Java, TypeScript).
Arbitrary generic types as base classes
Generic[T] is only valid as a base class -- it's not a proper type. However, user-defined generic types such as LinkedList[T] from the above example and built-in generic types and ABCs such as List[T] and Iterable[T] are valid both as types and as base classes. For example, we can define a subclass of Dict that specializes type arguments:
from typing import Dict, List, Optional
class Node:
...
class SymbolTable(Dict[str, List[Node]]):
def push(self, name: str, node: Node) -> None:
self.setdefault(name, []).append(node)
def pop(self, name: str) -> Node:
return self[name].pop()
def lookup(self, name: str) -> Optional[Node]:
nodes = self.get(name)
if nodes:
return nodes[-1]
return None
SymbolTable is a subclass of dict and a subtype of Dict[str, List[Node]].
If a generic base class has a type variable as a type argument, this makes the defined class generic. For example, we can define a generic LinkedList class that is iterable and a container:
from typing import TypeVar, Iterable, Container
T = TypeVar('T')
class LinkedList(Iterable[T], Container[T]):
...
Now LinkedList[int] is a valid type. Note that we can use T multiple times in the base class list, as long as we don't use the same type variable T multiple times within Generic[...].
Also consider the following example:
from typing import TypeVar, Mapping
T = TypeVar('T')
class MyDict(Mapping[str, T]):
...
In this case MyDict has a single parameter, T.
Abstract generic types
The metaclass used by Generic is a subclass of abc.ABCMeta. A generic class can be an ABC by including abstract methods or properties, and generic classes can also have ABCs as base classes without a metaclass conflict.
Type variables with an upper bound
A type variable may specify an upper bound using bound=<type>. This means that an actual type substituted (explicitly or implictly) for the type variable must be a subclass of the boundary type. A common example is the definition of a Comparable type that works well enough to catch the most common errors:
from typing import TypeVar
class Comparable(metaclass=ABCMeta):
@abstractmethod
def __lt__(self, other: Any) -> bool: ...
... # __gt__ etc. as well
CT = TypeVar('CT', bound=Comparable)
def min(x: CT, y: CT) -> CT:
if x < y:
return x
else:
return y
min(1, 2) # ok, return type int
min('x', 'y') # ok, return type str
(Note that this is not ideal -- for example min('x', 1) is invalid at runtime but a type checker would simply infer the return type Comparable. Unfortunately, addressing this would require introducing a much more powerful and also much more complicated concept, F-bounded polymorphism. We may revisit this in the future.)
An upper bound cannot be combined with type constraints (as in used AnyStr, see the example earlier); type constraints cause the inferred type to be _exactly_ one of the constraint types, while an upper bound just requires that the actual type is a subclass of the boundary type.
Covariance and contravariance
Consider a class Employee with a subclass Manager. Now suppose we have a function with an argument annotated with List[Employee]. Should we be allowed to call this function with a variable of type List[Manager] as its argument? Many people would answer "yes, of course" without even considering the consequences. But unless we know more about the function, a type checker should reject such a call: the function might append an Employee instance to the list, which would violate the variable's type in the caller.
It turns out such an argument acts _contravariantly_, whereas the intuitive answer (which is correct in case the function doesn't mutate its argument!) requires the argument to act _covariantly_. A longer introduction to these concepts can be found on Wikipedia [wiki-variance]; here we just show how to control a type checker's behavior.
By default type variables are considered _invariant_, which means that arguments for arguments annotated with types like List[Employee] must exactly match the type annotation -- no subclasses or superclasses of the type parameter (in this example Employee) are allowed.
To facilitate the declaration of container types where covariant type checking is acceptable, a type variable can be declared using covariant=True. For the (rare) case where contravariant behavior is desirable, pass contravariant=True. At most one of these may be passed.
A typical example involves defining an immutable (or read-only) container class:
from typing import TypeVar, Generic, Iterable, Iterator
T = TypeVar('T', covariant=True)
class ImmutableList(Generic[T]):
def __init__(self, items: Iterable[T]) -> None: ...
def __iter__(self) -> Iterator[T]: ...
...
class Employee: ...
class Manager(Employee): ...
def dump_employees(emps: ImmutableList[Employee]) -> None:
for emp in emps:
...
mgrs = ImmutableList([Manager()]) # type: ImmutableList[Manager]
dump_employees(mgrs) # OK
The read-only collection classes in typing are all defined using a covariant type variable (e.g. Mapping and Sequence). The mutable collection classes (e.g. MutableMapping and MutableSequence) are defined using regular invariant type variables. The one example of a contravariant type variable is the Generator type, which is contravariant in the send() argument type (see below).
Note: variance affects type parameters for generic types -- it does not affect regular parameters. For example, the following example is fine:
from typing import TypeVar
class Employee: ...
class Manager(Employee): ...
E = TypeVar('E', bound=Employee) # Invariant
def dump_employee(e: E) -> None: ...
dump_employee(Manager()) # OK
The numeric tower
PEP 3141 defines Python's numeric tower, and the stdlib module numbers implements the corresponding ABCs (Number, Complex, Real, Rational and Integral). There are some issues with these ABCs, but the built-in concrete numeric classes complex, float and int are ubiquitous (especially the latter two :-).
Rather than requiring that users write import numbers and then use numbers.Float etc., this PEP proposes a straightforward shortcut that is almost as effective: when an argument is annotated as having type float, an argument of type int is acceptable; similar, for an argument annotated as having type complex, arguments of type float or int are acceptable. This does not handle classes implementing the corresponding ABCs or the fractions.Fraction class, but we believe those use cases are exceedingly rare.
The bytes types
There are three different builtin classes used for arrays of bytes (not counting the classes available in the array module): bytes, bytearray and memoryview. Of these, bytes and bytearray have many behaviors in common (though not all -- bytearray is mutable).
While there is an ABC ByteString defined in collections.abc and a corresponding type in typing, functions accepting bytes (of some form) are so common that it would be cumbersome to have to write typing.ByteString everywhere. So, as a shortcut similar to that for the builtin numeric classes, when an argument is annotated as having type bytes, arguments of type bytearray or memoryview are acceptable. (Again, there are situations where this isn't sound, but we believe those are exceedingly rare in practice.)
Forward references
When a type hint contains names that have not been defined yet, that definition may be expressed as a string literal, to be resolved later.
A situation where this occurs commonly is the definition of a container class, where the class being defined occurs in the signature of some of the methods. For example, the following code (the start of a simple binary tree implementation) does not work:
class Tree:
def __init__(self, left: Tree, right: Tree):
self.left = left
self.right = right
To address this, we write:
class Tree:
def __init__(self, left: 'Tree', right: 'Tree'):
self.left = left
self.right = right
The string literal should contain a valid Python expression (i.e., compile(lit, '', 'eval') should be a valid code object) and it should evaluate without errors once the module has been fully loaded. The local and global namespace in which it is evaluated should be the same namespaces in which default arguments to the same function would be evaluated.
Moreover, the expression should be parseable as a valid type hint, i.e., it is constrained by the rules from the section Acceptable type hints above.
It is allowable to use string literals as part of a type hint, for example:
class Tree:
...
def leaves(self) -> List['Tree']:
...
A common use for forward references is when e.g. Django models are needed in the signatures. Typically, each model is in a separate file, and has methods that arguments whose type involves other models. Because of the way circular imports work in Python, it is often not possible to import all the needed models directly:
# File models/a.py
from models.b import B
class A(Model):
def foo(self, b: B): ...
# File models/b.py
from models.a import A
class B(Model):
def bar(self, a: A): ...
# File main.py
from models.a import A
from models.b import B
Assuming main is imported first, this will fail with an ImportError at the line from models.a import A in models/b.py, which is being imported from models/a.py before a has defined class A. The solution is to switch to module-only imports and reference the models by their _module_._class_ name:
# File models/a.py
from models import b
class A(Model):
def foo(self, b: 'b.B'): ...
# File models/b.py
from models import a
class B(Model):
def bar(self, a: 'a.A'): ...
# File main.py
from models.a import A
from models.b import B
Union types
Since accepting a small, limited set of expected types for a single argument is common, there is a new special factory called Union. Example:
from typing import Union
def handle_employees(e: Union[Employee, Sequence[Employee]]) -> None:
if isinstance(e, Employee):
e = [e]
...
A type factored by Union[T1, T2, ...] responds True to issubclass checks for T1 and any of its subtypes, T2 and any of its subtypes, and so on.
One common case of union types are optional types. By default, None is an invalid value for any type, unless a default value of None has been provided in the function definition. Examples:
def handle_employee(e: Union[Employee, None]) -> None: ...
As a shorthand for Union[T1, None] you can write Optional[T1]; for example, the above is equivalent to:
from typing import Optional def handle_employee(e: Optional[Employee]) -> None: ...
An optional type is also automatically assumed when the default value is None, for example:
def handle_employee(e: Employee = None): ...
This is equivalent to:
def handle_employee(e: Optional[Employee] = None) -> None: ...
The Any type
A special kind of type is Any. Every type is a subtype of Any. This is also true for the builtin type object. However, to the static type checker these are completely different.
When the type of a value is object, the type checker will reject almost all operations on it, and assigning it to a variable (or using it as a return value) of a more specialized type is a type error. On the other hand, when a value has type Any, the type checker will allow all operations on it, and a value of type Any can be assigned to a variable (or used as a return value) of a more constrained type.
Version and platform checking
Type checkers are expected to understand simple version and platform checks, e.g.:
import sys
if sys.version_info[0] >= 3:
# Python 3 specific definitions
else:
# Python 2 specific definitions
if sys.platform == 'win32':
# Windows specific definitions
else:
# Posix specific definitions
Don't expect a checker to understand obfuscations like "".join(reversed(sys.platform)) == "xunil".
Default argument values
In stubs it may be useful to declare an argument as having a default without specifying the actual default value. For example:
def foo(x: AnyStr, y: AnyStr = ...) -> AnyStr: ...
What should the default value look like? Any of the options "", b"" or None fails to satisfy the type constraint (actually, None will modify the type to become Optional[AnyStr]).
In such cases the default value may be specified as a literal ellipsis, i.e. the above example is literally what you would write.
Compatibility with other uses of function annotations
A number of existing or potential use cases for function annotations exist, which are incompatible with type hinting. These may confuse a static type checker. However, since type hinting annotations have no runtime behavior (other than evaluation of the annotation expression and storing annotations in the __annotations__ attribute of the function object), this does not make the program incorrect -- it just may cause a type checker to emit spurious warnings or errors.
To mark portions of the program that should not be covered by type hinting, you can use one or more of the following:
- a # type: ignore comment;
- a @no_type_check decorator on a class or function;
- a custom class or function decorator marked with @no_type_check_decorator.
For more details see later sections.
In order for maximal compatibility with offline type checking it may eventually be a good idea to change interfaces that rely on annotations to switch to a different mechanism, for example a decorator. In Python 3.5 there is no pressure to do this, however. See also the longer discussion under Rejected alternatives below.
Type comments
No first-class syntax support for explicitly marking variables as being of a specific type is added by this PEP. To help with type inference in complex cases, a comment of the following format may be used:
x = [] # type: List[Employee] x, y, z = [], [], [] # type: List[int], List[int], List[str] x, y, z = [], [], [] # type: (List[int], List[int], List[str]) x = [ 1, 2, ] # type: List[int]
Type comments should be put on the last line of the statement that contains the variable definition. They can also be placed on with statements and for statements, right after the colon.
Examples of type comments on with and for statements:
with frobnicate() as foo: # type: int
# Here foo is an int
...
for x, y in points: # type: float, float
# Here x and y are floats
...
In stubs it may be useful to declare the existence of a variable without giving it an initial value. This can be done using a literal ellipsis:
from typing import IO stream = ... # type: IO[str]
In non-stub code, there is a similar special case:
from typing import IO
stream = None # type: IO[str]
Type checkers should not complain about this (despite the value None not matching the given type), nor should they change the inferred type to Optional[...] (despite the rule that does this for annotated arguments with a default value of None). The assumption here is that other code will ensure that the variable is given a value of the proper type, and all uses can assume that the variable has the given type.
The # type: ignore comment should be put on the line that the error refers to:
import http.client
errors = {
'not_found': http.client.NOT_FOUND # type: ignore
}
A # type: ignore comment on a line by itself disables all type checking for the rest of the file.
If type hinting proves useful in general, a syntax for typing variables may be provided in a future Python version.
Casts
Occasionally the type checker may need a different kind of hint: the programmer may know that an expression is of a more constrained type than a type checker may be able to infer. For example:
from typing import List, cast
def find_first_str(a: List[object]) -> str:
index = next(i for i, x in enumerate(a) if isinstance(x, str))
# We only get here if there's at least one string in a
return cast(str, a[index])
Some type checkers may not be able to infer that the type of a[index] is str and only infer object or Any", but we know that (if the code gets to that point) it must be a string. The cast(t, x) call tells the type checker that we are confident that the type of x is t. At runtime a cast always returns the expression unchanged -- it does not check the type, and it does not convert or coerce the value.
Casts differ from type comments (see the previous section). When using a type comment, the type checker should still verify that the inferred type is consistent with the stated type. When using a cast, the type checker should blindly believe the programmer. Also, casts can be used in expressions, while type comments only apply to assignments.
Stub Files
Stub files are files containing type hints that are only for use by the type checker, not at runtime. There are several use cases for stub files:
- Extension modules
- Third-party modules whose authors have not yet added type hints
- Standard library modules for which type hints have not yet been written
- Modules that must be compatible with Python 2 and 3
- Modules that use annotations for other purposes
Stub files have the same syntax as regular Python modules. There is one feature of the typing module that may only be used in stub files: the @overload decorator described below.
The type checker should only check function signatures in stub files; It is recommended that function bodies in stub files just be a single ellipsis (...).
The type checker should have a configurable search path for stub files. If a stub file is found the type checker should not read the corresponding "real" module.
While stub files are syntactically valid Python modules, they use the .pyi extension to make it possible to maintain stub files in the same directory as the corresponding real module. This also reinforces the notion that no runtime behavior should be expected of stub files.
Additional notes on stub files:
- Modules and variables imported into the stub are not considered exported from the stub unless the import uses the import ... as ... form.
Function overloading
The @overload decorator allows describing functions that support multiple different combinations of argument types. This pattern is used frequently in builtin modules and types. For example, the __getitem__() method of the bytes type can be described as follows:
from typing import overload
class bytes:
...
@overload
def __getitem__(self, i: int) -> int: ...
@overload
def __getitem__(self, s: slice) -> bytes: ...
This description is more precise than would be possible using unions (which cannot express the relationship between the argument and return types):
from typing import Union
class bytes:
...
def __getitem__(self, a: Union[int, slice]) -> Union[int, bytes]: ...
Another example where @overload comes in handy is the type of the builtin map() function, which takes a different number of arguments depending on the type of the callable:
from typing import Callable, Iterable, Iterator, Tuple, TypeVar, overload
T1 = TypeVar('T1')
T2 = TypeVar('T2)
S = TypeVar('S')
@overload
def map(func: Callable[[T1], S], iter1: Iterable[T1]) -> Iterator[S]: ...
@overload
def map(func: Callable[[T1, T2], S],
iter1: Iterable[T1], iter2: Iterable[T2]) -> Iterator[S]: ...
# ... and we could add more items to support more than two iterables
Note that we could also easily add items to support map(None, ...):
@overload
def map(func: None, iter1: Iterable[T1]) -> Iterable[T1]: ...
@overload
def map(func: None,
iter1: Iterable[T1],
iter2: Iterable[T2]) -> Iterable[Tuple[T1, T2]]: ...
The @overload decorator may only be used in stub files. While it would be possible to provide a multiple dispatch implementation using this syntax, its implementation would require using sys._getframe(), which is frowned upon. Also, designing and implementing an efficient multiple dispatch mechanism is hard, which is why previous attempts were abandoned in favor of functools.singledispatch(). (See PEP 443, especially its section "Alternative approaches".) In the future we may come up with a satisfactory multiple dispatch design, but we don't want such a design to be constrained by the overloading syntax defined for type hints in stub files. In the meantime, using the @overload decorator or calling overload() directly raises RuntimeError.
A constrained TypeVar type can often be used instead of using the @overload decorator. For example, the definitions of concat1 and concat2 in this stub file are equivalent:
from typing import TypeVar
AnyStr = TypeVar('AnyStr', str, bytes)
def concat1(x: AnyStr, y: AnyStr) -> AnyStr: ...
@overload def concat2(x: str, y: str) -> str: ... @overload def concat2(x: bytes, y: bytes) -> bytes: ...
Some functions, such as map or bytes.__getitem__ above, can't be represented precisely using type variables. However, unlike @overload, type variables can also be used outside stub files. We recommend that @overload is only used in cases where a type variable is not sufficient, due to its special stub-only status.
Another important difference between type variables such as AnyStr and using @overload is that the prior can also be used to define constraints for generic class type parameters. For example, the type parameter of the generic class typing.IO is constrained (only IO[str], IO[bytes] and IO[Any] are valid):
class IO(Generic[AnyStr]): ...
Storing and distributing stub files
The easiest form of stub file storage and distribution is to put them alongside Python modules in the same directory. This makes them easy to find by both programmers and the tools. However, since package maintainers are free not to add type hinting to their packages, third-party stubs installable by pip from PyPI are also supported. In this case we have to consider three issues: naming, versioning, installation path.
This PEP does not provide a recommendation on a naming scheme that should be used for third-party stub file packages. Discoverability will hopefully be based on package popularity, like with Django packages for example.
Third-party stubs have to be versioned using the lowest version of the source package that is compatible. Example: FooPackage has versions 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2. There are API changes in versions 1.1, 2.0 and 2.2. The stub file package maintainer is free to release stubs for all versions but at least 1.0, 1.1, 2.0 and 2.2 are needed to enable the end user type check all versions. This is because the user knows that the closest lower or equal version of stubs is compatible. In the provided example, for FooPackage 1.3 the user would choose stubs version 1.1.
Note that if the user decides to use the "latest" available source package, using the "latest" stub files should generally also work if they're updated often.
Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH. A default fallback directory that is always checked is shared/typehints/python3.5/ (or 3.6, etc.). Since there can only be one package installed for a given Python version per environment, no additional versioning is performed under that directory (just like bare directory installs by pip in site-packages). Stub file package authors might use the following snippet in setup.py:
...
data_files=[
(
'shared/typehints/python{}.{}'.format(*sys.version_info[:2]),
pathlib.Path(SRC_PATH).glob('**/*.pyi'),
),
],
...
The Typeshed Repo
There is a shared repository where useful stubs are being collected [typeshed]. Note that stubs for a given package will not be included here without the explicit consent of the package owner. Further policies regarding the stubs collected here will be decided at a later time, after discussion on python-dev, and reported in the typeshed repo's README.
Exceptions
No syntax for listing explicitly raised exceptions is proposed. Currently the only known use case for this feature is documentational, in which case the recommendation is to put this information in a docstring.
The typing Module
To open the usage of static type checking to Python 3.5 as well as older versions, a uniform namespace is required. For this purpose, a new module in the standard library is introduced called typing.
It defines the fundamental building blocks for constructing types (e.g. Any), types representing generic variants of builtin collections (e.g. List), types representing generic collection ABCs (e.g. Sequence), and a small collection of convenience definitions.
Fundamental building blocks:
- Any, used as def get(key: str) -> Any: ...
- Union, used as Union[Type1, Type2, Type3]
- Callable, used as Callable[[Arg1Type, Arg2Type], ReturnType]
- Tuple, used by listing the element types, for example Tuple[int, int, str]. Arbitrary-length homogeneous tuples can be expressed using one type and ellipsis, for example Tuple[int, ...]. (The ... here are part of the syntax, a literal ellipsis.)
- TypeVar, used as X = TypeVar('X', Type1, Type2, Type3) or simply Y = TypeVar('Y') (see above for more details)
- Generic, used to create user-defined generic classes
Generic variants of builtin collections:
- Dict, used as Dict[key_type, value_type]
- List, used as List[element_type]
- Set, used as Set[element_type]. See remark for AbstractSet below.
- FrozenSet, used as FrozenSet[element_type]
Note: Dict, List, Set and FrozenSet are mainly useful for annotating return values. For arguments, prefer the abstract collection types defined below, e.g. Mapping, Sequence or AbstractSet.
Generic variants of container ABCs (and a few non-containers):
- ByteString
- Callable (see above, listed here for completeness)
- Container
- Generator, used as Generator[yield_type, send_type, return_type]. This represents the return value of generator functions. It is a subtype of Iterable and it has additional type variables for the type accepted by the send() method (which is contravariant -- a generator that accepts sending it Employee instance is valid in a context where a generator is required that accepts sending it Manager instances) and the return type of the generator.
- Hashable (not generic, but present for completeness)
- ItemsView
- Iterable
- Iterator
- KeysView
- Mapping
- MappingView
- MutableMapping
- MutableSequence
- MutableSet
- Sequence
- Set, renamed to AbstractSet. This name change was required because Set in the typing module means set() with generics.
- Sized (not generic, but present for completeness)
- ValuesView
A few one-off types are defined that test for single special methods (similar to Hashable or Sized):
- Reversible, to test for __reversed__
- SupportsAbs, to test for __abs__
- SupportsComplex, to test for __complex__
- SupportsFloat, to test for __float__
- SupportsInt, to test for __int__
- SupportsRound, to test for __round__
- SupportsBytes, to test for __bytes__
Convenience definitions:
- Optional, defined by Optional[t] == Union[t, type(None)]
- AnyStr, defined as TypeVar('AnyStr', str, bytes)
- NamedTuple, used as NamedTuple(type_name, [(field_name, field_type), ...]) and equivalent to collections.namedtuple(type_name, [field_name, ...]). This is useful to declare the types of the fields of a a named tuple type.
- cast(), described earlier
- @no_type_check, a decorator to disable type checking per class or function (see below)
- @no_type_check_decorator, a decorator to create your own decorators with the same meaning as @no_type_check (see below)
- @overload, described earlier
- get_type_hints(), a utility function to retrieve the type hints from a function or method. Given a function or method object, it returns a dict with the same format as __annotations__, but evaluating forward references (which are given as string literals) as expressions in the context of the original function or method definition.
Types available in the typing.io submodule:
- IO (generic over AnyStr)
- BinaryIO (a simple subtype of IO[bytes])
- TextIO (a simple subtype of IO[str])
Types available in the typing.re submodule:
- Match and Pattern, types of re.match() and re.compile() results (generic over AnyStr)
Rejected Alternatives
During discussion of earlier drafts of this PEP, various objections were raised and alternatives were proposed. We discuss some of these here and explain why we reject them.
Several main objections were raised.
Which brackets for generic type parameters?
Most people are familiar with the use of angular brackets (e.g. List<int>) in languages like C++, Java, C# and Swift to express the parametrization of generic types. The problem with these is that they are really hard to parse, especially for a simple-minded parser like Python. In most languages the ambiguities are usually dealt with by only allowing angular brackets in specific syntactic positions, where general expressions aren't allowed. (And also by using very powerful parsing techniques that can backtrack over an arbitrary section of code.)
But in Python, we'd like type expressions to be (syntactically) the same as other expressions, so that we can use e.g. variable assignment to create type aliases. Consider this simple type expression:
List<int>
From the Python parser's perspective, the expression begins with the same four tokens (NAME, LESS, NAME, GREATER) as a chained comparison:
a < b > c # I.e., (a < b) and (b > c)
We can even make up an example that could be parsed both ways:
a < b > [ c ]
Assuming we had angular brackets in the language, this could be interpreted as either of the following two:
(a<b>)[c] # I.e., (a<b>).__getitem__(c) a < b > ([c]) # I.e., (a < b) and (b > [c])
It would surely be possible to come up with a rule to disambiguate such cases, but to most users the rules would feel arbitrary and complex. It would also require us to dramatically change the CPython parser (and every other parser for Python). It should be noted that Python's current parser is intentionally "dumb" -- a simple grammar is easier for users to reason about.
For all these reasons, square brackets (e.g. List[int]) are (and have long been) the preferred syntax for generic type parameters. They can be implemented by defining the __getitem__() method on the metaclass, and no new syntax is required at all. This option works in all recent versions of Python (starting with Python 2.2). Python is not alone in this syntactic choice -- generic classes in Scala also use square brackets.
What about existing uses of annotations?
One line of argument points out that PEP 3107 explicitly supports the use of arbitrary expressions in function annotations. The new proposal is then considered incompatible with the specification of PEP 3107.
Our response to this is that, first of all, the current proposal does not introduce any direct incompatibilities, so programs using annotations in Python 3.4 will still work correctly and without prejudice in Python 3.5.
We do hope that type hints will eventually become the sole use for annotations, but this will require additional discussion and a deprecation period after the initial roll-out of the typing module with Python 3.5. The current PEP will have provisional status (see PEP 411) until Python 3.6 is released. The fastest conceivable scheme would introduce silent deprecation of non-type-hint annotations in 3.6, full deprecation in 3.7, and declare type hints as the only allowed use of annotations in Python 3.8. This should give authors of packages that use annotations plenty of time to devise another approach, even if type hints become an overnight success.
Another possible outcome would be that type hints will eventually become the default meaning for annotations, but that there will always remain an option to disable them. For this purpose the current proposal defines a decorator @no_type_check which disables the default interpretation of annotations as type hints in a given class or function. It also defines a meta-decorator @no_type_check_decorator which can be used to decorate a decorator (!), causing annotations in any function or class decorated with the latter to be ignored by the type checker.
There are also # type: ignore comments, and static checkers should support configuration options to disable type checking in selected packages.
Despite all these options, proposals have been circulated to allow type hints and other forms of annotations to coexist for individual arguments. One proposal suggests that if an annotation for a given argument is a dictionary literal, each key represents a different form of annotation, and the key 'type' would be use for type hints. The problem with this idea and its variants is that the notation becomes very "noisy" and hard to read. Also, in most cases where existing libraries use annotations, there would be little need to combine them with type hints. So the simpler approach of selectively disabling type hints appears sufficient.
The problem of forward declarations
The current proposal is admittedly sub-optimal when type hints must contain forward references. Python requires all names to be defined by the time they are used. Apart from circular imports this is rarely a problem: "use" here means "look up at runtime", and with most "forward" references there is no problem in ensuring that a name is defined before the function using it is called.
The problem with type hints is that annotations (per PEP 3107, and similar to default values) are evaluated at the time a function is defined, and thus any names used in an annotation must be already defined when the function is being defined. A common scenario is a class definition whose methods need to reference the class itself in their annotations. (More general, it can also occur with mutually recursive classes.) This is natural for container types, for example:
class Node:
"""Binary tree node."""
def __init__(self, left: Node, right: Node):
self.left = left
self.right = right
As written this will not work, because of the peculiarity in Python that class names become defined once the entire body of the class has been executed. Our solution, which isn't particularly elegant, but gets the job done, is to allow using string literals in annotations. Most of the time you won't have to use this though -- most uses of type hints are expected to reference builtin types or types defined in other modules.
A counterproposal would change the semantics of type hints so they aren't evaluated at runtime at all (after all, type checking happens off-line, so why would type hints need to be evaluated at runtime at all). This of course would run afoul of backwards compatibility, since the Python interpreter doesn't actually know whether a particular annotation is meant to be a type hint or something else.
A compromise is possible where a __future__ import could enable turning all annotations in a given module into string literals, as follows:
from __future__ import annotations
class ImSet:
def add(self, a: ImSet) -> List[ImSet]: ...
assert ImSet.add.__annotations__ == {'a': 'ImSet', 'return': 'List[ImSet]'}
Such a __future__ import statement may be proposed in a separate PEP.
The double colon
A few creative souls have tried to invent solutions for this problem. For example, it was proposed to use a double colon (::) for type hints, solving two problems at once: disambiguating between type hints and other annotations, and changing the semantics to preclude runtime evaluation. There are several things wrong with this idea, however.
- It's ugly. The single colon in Python has many uses, and all of them look familiar because they resemble the use of the colon in English text. This is a general rule of thumb by which Python abides for most forms of punctuation; the exceptions are typically well known from other programming languages. But this use of :: is unheard of in English, and in other languages (e.g. C++) it is used as a scoping operator, which is a very different beast. In contrast, the single colon for type hints reads naturally -- and no wonder, since it was carefully designed for this purpose (the idea long predates PEP 3107 [gvr-artima]). It is also used in the same fashion in other languages from Pascal to Swift.
- What would you do for return type annotations?
- It's actually a feature that type hints are evaluated at runtime.
- Making type hints available at runtime allows runtime type checkers to be built on top of type hints.
- It catches mistakes even when the type checker is not run. Since it is a separate program, users may choose not to run it (or even install it), but might still want to use type hints as a concise form of documentation. Broken type hints are no use even for documentation.
- Because it's new syntax, using the double colon for type hints would limit them to code that works with Python 3.5 only. By using existing syntax, the current proposal can easily work for older versions of Python 3. (And in fact mypy supports Python 3.2 and newer.)
- If type hints become successful we may well decide to add new syntax in the future to declare the type for variables, for example var age: int = 42. If we were to use a double colon for argument type hints, for consistency we'd have to use the same convention for future syntax, perpetuating the ugliness.
Other forms of new syntax
A few other forms of alternative syntax have been proposed, e.g. the introduction of a where keyword [roberge], and Cobra-inspired requires clauses. But these all share a problem with the double colon: they won't work for earlier versions of Python 3. The same would apply to a new __future__ import.
Other backwards compatible conventions
The ideas put forward include:
- A decorator, e.g. @typehints(name=str, returns=str). This could work, but it's pretty verbose (an extra line, and the argument names must be repeated), and a far cry in elegance from the PEP 3107 notation.
- Stub files. We do want stub files, but they are primarily useful for adding type hints to existing code that doesn't lend itself to adding type hints, e.g. 3rd party packages, code that needs to support both Python 2 and Python 3, and especially extension modules. For most situations, having the annotations in line with the function definitions makes them much more useful.
- Docstrings. There is an existing convention for docstrings, based on the Sphinx notation (:type arg1: description). This is pretty verbose (an extra line per parameter), and not very elegant. We could also make up something new, but the annotation syntax is hard to beat (because it was designed for this very purpose).
It's also been proposed to simply wait another release. But what problem would that solve? It would just be procrastination.
PEP Development Process
A live draft for this PEP lives on GitHub [github]. There is also an issue tracker [issues], where much of the technical discussion takes place.
The draft on GitHub is updated regularly in small increments. The official PEPS repo [peps] is (usually) only updated when a new draft is posted to python-dev.
Acknowledgements
This document could not be completed without valuable input, encouragement and advice from Jim Baker, Jeremy Siek, Michael Matson Vitousek, Andrey Vlasovskikh, Radomir Dopieralski, Peter Ludemann, and the BDFL-Delegate, Mark Shannon.
Influences include existing languages, libraries and frameworks mentioned in PEP 482. Many thanks to their creators, in alphabetical order: Stefan Behnel, William Edwards, Greg Ewing, Larry Hastings, Anders Hejlsberg, Alok Menghrajani, Travis E. Oliphant, Joe Pamer, Raoul-Gabriel Urma, and Julien Verlaguet.
References
| [mypy] | http://mypy-lang.org |
| [gvr-artima] | (1, 2) http://www.artima.com/weblogs/viewpost.jsp?thread=85551 |
| [wiki-variance] | http://en.wikipedia.org/wiki/Covariance_and_contravariance_%28computer_science%29 |
| [typeshed] | https://github.com/JukkaL/typeshed/ |
| [pyflakes] | https://github.com/pyflakes/pyflakes/ |
| [pylint] | http://www.pylint.org |
| [roberge] | http://aroberge.blogspot.com/2015/01/type-hinting-in-python-focus-on.html |
| [github] | https://github.com/ambv/typehinting |
| [issues] | https://github.com/ambv/typehinting/issues |
| [peps] | https://hg.python.org/peps/file/tip/pep-0484.txt |
Copyright
This document has been placed in the public domain.
pep-0485 A Function for testing approximate equality
| PEP: | 485 |
|---|---|
| Title: | A Function for testing approximate equality |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Christopher Barker <Chris.Barker at noaa.gov> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Jan-2015 |
| Python-Version: | 3.5 |
| Post-History: | |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-February/138598.html |
Contents
Abstract
This PEP proposes the addition of an is_close() function to the standard library math module that determines whether one value is approximately equal or "close" to another value.
Rationale
Floating point values contain limited precision, which results in their being unable to exactly represent some values, and for errors to accumulate with repeated computation. As a result, it is common advice to only use an equality comparison in very specific situations. Often a inequality comparison fits the bill, but there are times (often in testing) where the programmer wants to determine whether a computed value is "close" to an expected value, without requiring them to be exactly equal. This is common enough, particularly in testing, and not always obvious how to do it, that it would be useful addition to the standard library.
Existing Implementations
The standard library includes the unittest.TestCase.assertAlmostEqual method, but it:
- Is buried in the unittest.TestCase class
- Is an assertion, so you can't use it as a general test at the command line, etc. (easily)
- Is an absolute difference test. Often the measure of difference requires, particularly for floating point numbers, a relative error, i.e. "Are these two values within x% of each-other?", rather than an absolute error. Particularly when the magnitude of the values is unknown a priori.
The numpy package has the allclose() and isclose() functions, but they are only available with numpy.
The statistics package tests include an implementation, used for its unit tests.
One can also find discussion and sample implementations on Stack Overflow and other help sites.
Many other non-python systems provide such a test, including the Boost C++ library and the APL language [4].
These existing implementations indicate that this is a common need and not trivial to write oneself, making it a candidate for the standard library.
Proposed Implementation
NOTE: this PEP is the result of extended discussions on the python-ideas list [1].
The new function will go into the math module, and have the following signature:
isclose(a, b, rel_tol=1e-9, abs_tol=0.0)
a and b: are the two values to be tested to relative closeness
rel_tol: is the relative tolerance -- it is the amount of error allowed, relative to the larger absolute value of a or b. For example, to set a tolerance of 5%, pass tol=0.05. The default tolerance is 1e-9, which assures that the two values are the same within about 9 decimal digits. rel_tol must be greater than 0.0
abs_tol: is an minimum absolute tolerance level -- useful for comparisons near zero.
Modulo error checking, etc, the function will return the result of:
abs(a-b) <= max( rel_tol * max(abs(a), abs(b)), abs_tol )
The name, isclose, is selected for consistency with the existing isnan and isinf.
Handling of non-finite numbers
The IEEE 754 special values of NaN, inf, and -inf will be handled according to IEEE rules. Specifically, NaN is not considered close to any other value, including NaN. inf and -inf are only considered close to themselves.
Non-float types
The primary use-case is expected to be floating point numbers. However, users may want to compare other numeric types similarly. In theory, it should work for any type that supports abs(), multiplication, comparisons, and subtraction. However, the implimentation in the math module is written in C, and thus can not (easily) use python's duck typing. Rather, the values passed into the funciton will be converted to the float type before the calculation is performed. Passing in types (or values) that cannot be converted to floats will raise an appropirate Exception (TypeError, ValueError, or OverflowError).
The code will be tested to accommodate at least some values of these types:
- Decimal
- int
- Fraction
- complex: For complex, a companion function will be added to the cmath module. In cmath.isclose(), the tolerances are specified as floats, and the absolute value of the complex values will be used for scaling and comparison. If a complex tolerance is passed in, the absolute value will be used as the tolerance.
NOTE: it may make sense to add a Decimal.isclose() that works properly and completely with the decimal type, but that is not included as part of this PEP.
Behavior near zero
Relative comparison is problematic if either value is zero. By definition, no value is small relative to zero. And computationally, if either value is zero, the difference is the absolute value of the other value, and the computed absolute tolerance will be rel_tol times that value. When rel_tol is less than one, the difference will never be less than the tolerance.
However, while mathematically correct, there are many use cases where a user will need to know if a computed value is "close" to zero. This calls for an absolute tolerance test. If the user needs to call this function inside a loop or comprehension, where some, but not all, of the expected values may be zero, it is important that both a relative tolerance and absolute tolerance can be tested for with a single function with a single set of parameters.
There is a similar issue if the two values to be compared straddle zero: if a is approximately equal to -b, then a and b will never be computed as "close".
To handle this case, an optional parameter, abs_tol can be used to set a minimum tolerance used in the case of very small or zero computed relative tolerance. That is, the values will be always be considered close if the difference between them is less than abs_tol
The default absolute tolerance value is set to zero because there is no value that is appropriate for the general case. It is impossible to know an appropriate value without knowing the likely values expected for a given use case. If all the values tested are on order of one, then a value of about 1e-9 might be appropriate, but that would be far too large if expected values are on order of 1e-9 or smaller.
Any non-zero default might result in user's tests passing totally inappropriately. If, on the other hand, a test against zero fails the first time with defaults, a user will be prompted to select an appropriate value for the problem at hand in order to get the test to pass.
NOTE: that the author of this PEP has resolved to go back over many of his tests that use the numpy allclose() function, which provides a default absolute tolerance, and make sure that the default value is appropriate.
If the user sets the rel_tol parameter to 0.0, then only the absolute tolerance will effect the result. While not the goal of the function, it does allow it to be used as a purely absolute tolerance check as well.
Implementation
A sample implementation in python is available (as of Jan 22, 2015) on gitHub:
https://github.com/PythonCHB/close_pep/blob/master/is_close.py
This implementation has a flag that lets the user select which relative tolerance test to apply -- this PEP does not suggest that that be retained, but rather that the weak test be selected.
There are also drafts of this PEP and test code, etc. there:
Relative Difference
There are essentially two ways to think about how close two numbers are to each-other:
Absolute difference: simply abs(a-b)
Relative difference: abs(a-b)/scale_factor [2].
The absolute difference is trivial enough that this proposal focuses on the relative difference.
Usually, the scale factor is some function of the values under consideration, for instance:
- The absolute value of one of the input values
- The maximum absolute value of the two
- The minimum absolute value of the two.
- The absolute value of the arithmetic mean of the two
These leads to the following possibilities for determining if two values, a and b, are close to each other.
- abs(a-b) <= tol*abs(a)
- abs(a-b) <= tol * max( abs(a), abs(b) )
- abs(a-b) <= tol * min( abs(a), abs(b) )
- abs(a-b) <= tol * (a + b)/2
NOTE: (2) and (3) can also be written as:
- (abs(a-b) <= abs(tol*a)) or (abs(a-b) <= abs(tol*b))
- (abs(a-b) <= abs(tol*a)) and (abs(a-b) <= abs(tol*b))
(Boost refers to these as the "weak" and "strong" formulations [3]) These can be a tiny bit more computationally efficient, and thus are used in the example code.
Each of these formulations can lead to slightly different results. However, if the tolerance value is small, the differences are quite small. In fact, often less than available floating point precision.
How much difference does it make?
When selecting a method to determine closeness, one might want to know how much of a difference it could make to use one test or the other -- i.e. how many values are there (or what range of values) that will pass one test, but not the other.
The largest difference is between options (2) and (3) where the allowable absolute difference is scaled by either the larger or smaller of the values.
Define delta to be the difference between the allowable absolute tolerance defined by the larger value and that defined by the smaller value. That is, the amount that the two input values need to be different in order to get a different result from the two tests. tol is the relative tolerance value.
Assume that a is the larger value and that both a and b are positive, to make the analysis a bit easier. delta is therefore:
delta = tol * (a-b)
or:
delta / tol = (a-b)
The largest absolute difference that would pass the test: (a-b), equals the tolerance times the larger value:
(a-b) = tol * a
Substituting into the expression for delta:
delta / tol = tol * a
so:
delta = tol**2 * a
For example, for a = 10, b = 9, tol = 0.1 (10%):
maximum tolerance tol * a == 0.1 * 10 == 1.0
minimum tolerance tol * b == 0.1 * 9.0 == 0.9
delta = (1.0 - 0.9) = 0.1 or tol**2 * a = 0.1**2 * 10 = .1
The absolute difference between the maximum and minimum tolerance tests in this case could be substantial. However, the primary use case for the proposed function is testing the results of computations. In that case a relative tolerance is likely to be selected of much smaller magnitude.
For example, a relative tolerance of 1e-8 is about half the precision available in a python float. In that case, the difference between the two tests is 1e-8**2 * a or 1e-16 * a, which is close to the limit of precision of a python float. If the relative tolerance is set to the proposed default of 1e-9 (or smaller), the difference between the two tests will be lost to the limits of precision of floating point. That is, each of the four methods will yield exactly the same results for all values of a and b.
In addition, in common use, tolerances are defined to 1 significant figure -- that is, 1e-9 is specifying about 9 decimal digits of accuracy. So the difference between the various possible tests is well below the precision to which the tolerance is specified.
Symmetry
A relative comparison can be either symmetric or non-symmetric. For a symmetric algorithm:
isclose(a,b) is always the same as isclose(b,a)
If a relative closeness test uses only one of the values (such as (1) above), then the result is asymmetric, i.e. isclose(a,b) is not necessarily the same as isclose(b,a).
Which approach is most appropriate depends on what question is being asked. If the question is: "are these two numbers close to each other?", there is no obvious ordering, and a symmetric test is most appropriate.
However, if the question is: "Is the computed value within x% of this known value?", then it is appropriate to scale the tolerance to the known value, and an asymmetric test is most appropriate.
From the previous section, it is clear that either approach would yield the same or similar results in the common use cases. In that case, the goal of this proposal is to provide a function that is least likely to produce surprising results.
The symmetric approach provide an appealing consistency -- it mirrors the symmetry of equality, and is less likely to confuse people. A symmetric test also relieves the user of the need to think about the order in which to set the arguments. It was also pointed out that there may be some cases where the order of evaluation may not be well defined, for instance in the case of comparing a set of values all against each other.
There may be cases when a user does need to know that a value is within a particular range of a known value. In that case, it is easy enough to simply write the test directly:
if a-b <= tol*a:
(assuming a > b in this case). There is little need to provide a function for this particular case.
This proposal uses a symmetric test.
Which symmetric test?
There are three symmetric tests considered:
The case that uses the arithmetic mean of the two values requires that the value be either added together before dividing by 2, which could result in extra overflow to inf for very large numbers, or require each value to be divided by two before being added together, which could result in underflow to zero for very small numbers. This effect would only occur at the very limit of float values, but it was decided there was no benefit to the method worth reducing the range of functionality or adding the complexity of checking values to determine the order of computation.
This leaves the boost "weak" test (2)-- or using the smaller value to scale the tolerance, or the Boost "strong" (3) test, which uses the smaller of the values to scale the tolerance. For small tolerance, they yield the same result, but this proposal uses the boost "weak" test case: it is symmetric and provides a more useful result for very large tolerances.
Large Tolerances
The most common use case is expected to be small tolerances -- on order of the default 1e-9. However there may be use cases where a user wants to know if two fairly disparate values are within a particular range of each other: "is a within 200% (rel_tol = 2.0) of b? In this case, the strong test would never indicate that two values are within that range of each other if one of them is zero. The weak case, however would use the larger (non-zero) value for the test, and thus return true if one value is zero. For example: is 0 within 200% of 10? 200% of ten is 20, so the range within 200% of ten is -10 to +30. Zero falls within that range, so it will return True.
Defaults
Default values are required for the relative and absolute tolerance.
Relative Tolerance Default
The relative tolerance required for two values to be considered "close" is entirely use-case dependent. Nevertheless, the relative tolerance needs to be greater than 1e-16 (approximate precision of a python float). The value of 1e-9 was selected because it is the largest relative tolerance for which the various possible methods will yield the same result, and it is also about half of the precision available to a python float. In the general case, a good numerical algorithm is not expected to lose more than about half of available digits of accuracy, and if a much larger tolerance is acceptable, the user should be considering the proper value in that case. Thus 1-e9 is expected to "just work" for many cases.
Absolute tolerance default
The absolute tolerance value will be used primarily for comparing to zero. The absolute tolerance required to determine if a value is "close" to zero is entirely use-case dependent. There is also essentially no bounds to the useful range -- expected values would conceivably be anywhere within the limits of a python float. Thus a default of 0.0 is selected.
If, for a given use case, a user needs to compare to zero, the test will be guaranteed to fail the first time, and the user can select an appropriate value.
It was suggested that comparing to zero is, in fact, a common use case (evidence suggest that the numpy functions are often used with zero). In this case, it would be desirable to have a "useful" default. Values around 1-e8 were suggested, being about half of floating point precision for values of around value 1.
However, to quote The Zen: "In the face of ambiguity, refuse the temptation to guess." Guessing that users will most often be concerned with values close to 1.0 would lead to spurious passing tests when used with smaller values -- this is potentially more damaging than requiring the user to thoughtfully select an appropriate value.
Expected Uses
The primary expected use case is various forms of testing -- "are the results computed near what I expect as a result?" This sort of test may or may not be part of a formal unit testing suite. Such testing could be used one-off at the command line, in an iPython notebook, part of doctests, or simple asserts in an if __name__ == "__main__" block.
It would also be an appropriate function to use for the termination criteria for a simple iterative solution to an implicit function:
guess = something
while True:
new_guess = implicit_function(guess, *args)
if isclose(new_guess, guess):
break
guess = new_guess
Inappropriate uses
One use case for floating point comparison is testing the accuracy of a numerical algorithm. However, in this case, the numerical analyst ideally would be doing careful error propagation analysis, and should understand exactly what to test for. It is also likely that ULP (Unit in the Last Place) comparison may be called for. While this function may prove useful in such situations, It is not intended to be used in that way without careful consideration.
Other Approaches
unittest.TestCase.assertAlmostEqual
(https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertAlmostEqual)
Tests that values are approximately (or not approximately) equal by computing the difference, rounding to the given number of decimal places (default 7), and comparing to zero.
This method is purely an absolute tolerance test, and does not address the need for a relative tolerance test.
numpy isclose()
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.isclose.html
The numpy package provides the vectorized functions isclose() and allclose(), for similar use cases as this proposal:
isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns a boolean array where two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b
In this approach, the absolute and relative tolerance are added together, rather than the or method used in this proposal. This is computationally more simple, and if relative tolerance is larger than the absolute tolerance, then the addition will have no effect. However, if the absolute and relative tolerances are of similar magnitude, then the allowed difference will be about twice as large as expected.
This makes the function harder to understand, with no computational advantage in this context.
Even more critically, if the values passed in are small compared to the absolute tolerance, then the relative tolerance will be completely swamped, perhaps unexpectedly.
This is why, in this proposal, the absolute tolerance defaults to zero -- the user will be required to choose a value appropriate for the values at hand.
Boost floating-point comparison
The Boost project ( [3] ) provides a floating point comparison function. It is a symmetric approach, with both "weak" (larger of the two relative errors) and "strong" (smaller of the two relative errors) options. This proposal uses the Boost "weak" approach. There is no need to complicate the API by providing the option to select different methods when the results will be similar in most cases, and the user is unlikely to know which to select in any case.
Alternate Proposals
A Recipe
The primary alternate proposal was to not provide a standard library function at all, but rather, provide a recipe for users to refer to. This would have the advantage that the recipe could provide and explain the various options, and let the user select that which is most appropriate. However, that would require anyone needing such a test to, at the very least, copy the function into their code base, and select the comparison method to use.
zero_tol
One possibility was to provide a zero tolerance parameter, rather than the absolute tolerance parameter. This would be an absolute tolerance that would only be applied in the case of one of the arguments being exactly zero. This would have the advantage of retaining the full relative tolerance behavior for all non-zero values, while allowing tests against zero to work. However, it would also result in the potentially surprising result that a small value could be "close" to zero, but not "close" to an even smaller value. e.g., 1e-10 is "close" to zero, but not "close" to 1e-11.
No absolute tolerance
Given the issues with comparing to zero, another possibility would have been to only provide a relative tolerance, and let comparison to zero fail. In this case, the user would need to do a simple absolute test: abs(val) < zero_tol in the case where the comparison involved zero.
However, this would not allow the same call to be used for a sequence of values, such as in a loop or comprehension. Making the function far less useful. It is noted that the default abs_tol=0.0 achieves the same effect if the default is not overridden.
Other tests
The other tests considered are all discussed in the Relative Error section above.
References
| [2] | Wikipedia page on relative difference |
| [3] | (1, 2) Boost project floating-point comparison algorithms |
| [4] | 1976. R. H. Lathwell. APL comparison tolerance. Proceedings of the eighth international conference on APL Pages 255 - 258 |
Copyright
This document has been placed in the public domain.
pep-0486 Make the Python Launcher aware of virtual environments
| PEP: | 486 |
|---|---|
| Title: | Make the Python Launcher aware of virtual environments |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Paul Moore <p.f.moore at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Feb-2015 |
| Python-Version: | 3.5 |
| Post-History: | 12-Feb-2015 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-February/138579.html |
Contents
Abstract
The Windows installers for Python include a launcher that locates the correct Python interpreter to run (see PEP 397). However, the launcher is not aware of virtual environments (virtualenv [1] or PEP 405 based), and so cannot be used to run commands from the active virtualenv.
This PEP proposes making the launcher "virtualenv aware". This means that when run without specifying an explicit Python interpreter to use, the launcher will use the currently active virtualenv, if any, before falling back to the configured default Python.
Rationale
Windows users with multiple copies of Python installed need a means of selecting which one to use. The Python launcher provides this facility by means of a py command that can be used to run either a configured "default" Python or a specific interpreter, by means of command line arguments. So typical usage would be:
# Run the Python interactive interpreter py # Execute an installed module py -m pip install pytest py -m pytest
When using virtual environments, the py launcher is unaware that a virtualenv is active, and will continue to use the system Python. So different command invocations are needed to run the same commands in a virtualenv:
# Run the Python interactive interpreter python # Execute an installed module (these could use python -m, # which is longer to type but is a little more similar to the # launcher approach) pip install pytest py.test
Having to use different commands is is error-prone, and in many cases the error is difficult to spot immediately. The PEP proposes making the py command usable with virtual environments, so that the first form of command can be used in all cases.
Implementation
Both virtualenv and the core venv module set an environment variable VIRTUAL_ENV when activating a virtualenv. This PEP proposes that the launcher checks for the VIRTUAL_ENV environment variable whenever it would run the "default" Python interpreter for the system (i.e., when no specific version flags such as py -2.7 are used) and if present, run the Python interpreter for the virtualenv rather than the default system Python.
The "default" Python interpreter referred to above is (as per PEP 397) either the latest version of Python installed on the system, or a version configured via the py.ini configuration file. When the user specifies an explicit Python version on the command line, this will always be used (as at present).
Impact on Script Launching
As well as interactive use, the launcher is used as the Windows file association for Python scripts. In that case, a "shebang" (#!) line at the start of the script is used to identify the interpreter to run. A fully-qualified path can be used, or a version-specific Python (python3 or python2, or even python3.5), or the generic python, which means to use the default interpreter.
The launcher also looks for the specific shebang line #!/usr/bin/env python. On Unix, the env program searches for a command on $PATH and runs the command so located. Similarly, with this shebang line, the launcher will look for a copy of python.exe on the user's current %PATH% and will run that copy.
As activating a virtualenv means that it is added to PATH, no special handling is needed to run scripts with the active virtualenv - they just need to use the #!/usr/bin/env python shebang line, exactly as on Unix. (If there is no activated virtualenv, and no python.exe on PATH, the launcher will look for a default Python exactly as if the shebang line had said #!python).
Exclusions
The PEP makes no attempt to promote the use of the launcher for running Python on Windows. Most existing documentation assumes the user of python as the command to run Python, and (for example) pip to run an installed Python command. This documentation is not expected to change, and users who choose to manage their PATH environment variable can continue to use this form. The focus of this PEP is purely on allowing users who prefer to use the launcher when dealing with their system Python installations, to be able to continue to do so when using virtual environments.
Reference Implementation
A patch implementing the proposed behaviour is available at http://bugs.python.org/issue23465
Copyright
This document has been placed in the public domain.
pep-0487 Simpler customisation of class creation
| PEP: | 487 |
|---|---|
| Title: | Simpler customisation of class creation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin Teichmann <lkb.teichmann at gmail.com>, |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Feb-2015 |
| Python-Version: | 3.5 |
| Post-History: | 27-Feb-2015 |
| Replaces: | 422 |
Contents
Abstract
Currently, customising class creation requires the use of a custom metaclass. This custom metaclass then persists for the entire lifecycle of the class, creating the potential for spurious metaclass conflicts.
This PEP proposes to instead support a wide range of customisation scenarios through a new namespace parameter in the class header, and a new __init_subclass__ hook in the class body.
The new mechanism should be easier to understand and use than implementing a custom metaclass, and thus should provide a gentler introduction to the full power Python's metaclass machinery.
Connection to other PEP
This is a competing proposal to PEP 422 by Nick Coughlan and Daniel Urban. It shares both most of the PEP text and proposed code, but has major differences in how to achieve its goals.
Background
For an already created class cls, the term "metaclass" has a clear meaning: it is the value of type(cls).
During class creation, it has another meaning: it is also used to refer to the metaclass hint that may be provided as part of the class definition. While in many cases these two meanings end up referring to one and the same object, there are two situations where that is not the case:
- If the metaclass hint refers to an instance of type, then it is considered as a candidate metaclass along with the metaclasses of all of the parents of the class being defined. If a more appropriate metaclass is found amongst the candidates, then it will be used instead of the one given in the metaclass hint.
- Otherwise, an explicit metaclass hint is assumed to be a factory function and is called directly to create the class object. In this case, the final metaclass will be determined by the factory function definition. In the typical case (where the factory functions just calls type, or, in Python 3.3 or later, types.new_class) the actual metaclass is then determined based on the parent classes.
It is notable that only the actual metaclass is inherited - a factory function used as a metaclass hook sees only the class currently being defined, and is not invoked for any subclasses.
In Python 3, the metaclass hint is provided using the metaclass=Meta keyword syntax in the class header. This allows the __prepare__ method on the metaclass to be used to create the locals() namespace used during execution of the class body (for example, specifying the use of collections.OrderedDict instead of a regular dict).
In Python 2, there was no __prepare__ method (that API was added for Python 3 by PEP 3115). Instead, a class body could set the __metaclass__ attribute, and the class creation process would extract that value from the class namespace to use as the metaclass hint. There is published code [1] that makes use of this feature.
Another new feature in Python 3 is the zero-argument form of the super() builtin, introduced by PEP 3135. This feature uses an implicit __class__ reference to the class being defined to replace the "by name" references required in Python 2. Just as code invoked during execution of a Python 2 metaclass could not call methods that referenced the class by name (as the name had not yet been bound in the containing scope), similarly, Python 3 metaclasses cannot call methods that rely on the implicit __class__ reference (as it is not populated until after the metaclass has returned control to the class creation machinery).
Finally, when a class uses a custom metaclass, it can pose additional challenges to the use of multiple inheritance, as a new class cannot inherit from parent classes with unrelated metaclasses. This means that it is impossible to add a metaclass to an already published class: such an addition is a backwards incompatible change due to the risk of metaclass conflicts.
Proposal
This PEP proposes that a new mechanism to customise class creation be added to Python 3.5 that meets the following criteria:
- Integrates nicely with class inheritance structures (including mixins and multiple inheritance),
- Integrates nicely with the implicit __class__ reference and zero-argument super() syntax introduced by PEP 3135,
- Can be added to an existing base class without a significant risk of introducing backwards compatibility problems, and
- Restores the ability for class namespaces to have some influence on the class creation process (above and beyond populating the namespace itself), but potentially without the full flexibility of the Python 2 style __metaclass__ hook.
Those goals can be achieved by adding two functionalities:
- A __init_subclass__ hook that initializes all subclasses of a given class, and
- A new keyword parameter namespace to the class creation statement, that gives an initialization of the namespace.
As an example, the first proposal looks as follows:
class SpamBase:
# this is implicitly a @classmethod
def __init_subclass__(cls, ns, **kwargs):
# This is invoked after a subclass is created, but before
# explicit decorators are called.
# The usual super() mechanisms are used to correctly support
# multiple inheritance.
# ns is the classes namespace
# **kwargs are the keyword arguments to the subclasses'
# class creation statement
super().__init_subclass__(cls, ns, **kwargs)
class Spam(SpamBase):
pass
# the new hook is called on Spam
To simplify the cooperative multiple inheritance case, object will gain a default implementation of the hook that does nothing:
class object:
def __init_subclass__(cls, ns):
pass
Note that this method has no keyword arguments, meaning that all methods which are more specialized have to process all keyword arguments.
This general proposal is not a new idea (it was first suggested for inclusion in the language definition more than 10 years ago [2], and a similar mechanism has long been supported by Zope's ExtensionClass [3]), but the situation has changed sufficiently in recent years that the idea is worth reconsidering for inclusion.
The second part of the proposal is to have a namespace keyword argument to the class declaration statement. If present, its value will be called without arguments to initialize a subclasses namespace, very much like a metaclass __prepare__ method would do.
In addition, the introduction of the metaclass __prepare__ method in PEP 3115 allows a further enhancement that was not possible in Python 2: this PEP also proposes that type.__prepare__ be updated to accept a factory function as a namespace keyword-only argument. If present, the value provided as the namespace argument will be called without arguments to create the result of type.__prepare__ instead of using a freshly created dictionary instance. For example, the following will use an ordered dictionary as the class namespace:
class OrderedBase(namespace=collections.OrderedDict):
pass
class Ordered(OrderedBase):
# cls.__dict__ is still a read-only proxy to the class namespace,
# but the underlying storage is an OrderedDict instance
Note
This PEP, along with the existing ability to use __prepare__ to share a single namespace amongst multiple class objects, highlights a possible issue with the attribute lookup caching: when the underlying mapping is updated by other means, the attribute lookup cache is not invalidated correctly (this is a key part of the reason class __dict__ attributes produce a read-only view of the underlying storage).
Since the optimisation provided by that cache is highly desirable, the use of a preexisting namespace as the class namespace may need to be declared as officially unsupported (since the observed behaviour is rather strange when the caches get out of sync).
Key Benefits
Easier use of custom namespaces for a class
Currently, to use a different type (such as collections.OrderedDict) for a class namespace, or to use a pre-populated namespace, it is necessary to write and use a custom metaclass. With this PEP, using a custom namespace becomes as simple as specifying an appropriate factory function in the class header.
Easier inheritance of definition time behaviour
Understanding Python's metaclasses requires a deep understanding of the type system and the class construction process. This is legitimately seen as challenging, due to the need to keep multiple moving parts (the code, the metaclass hint, the actual metaclass, the class object, instances of the class object) clearly distinct in your mind. Even when you know the rules, it's still easy to make a mistake if you're not being extremely careful.
Understanding the proposed implicit class initialization hook only requires ordinary method inheritance, which isn't quite as daunting a task. The new hook provides a more gradual path towards understanding all of the phases involved in the class definition process.
Reduced chance of metaclass conflicts
One of the big issues that makes library authors reluctant to use metaclasses (even when they would be appropriate) is the risk of metaclass conflicts. These occur whenever two unrelated metaclasses are used by the desired parents of a class definition. This risk also makes it very difficult to add a metaclass to a class that has previously been published without one.
By contrast, adding an __init_subclass__ method to an existing type poses a similar level of risk to adding an __init__ method: technically, there is a risk of breaking poorly implemented subclasses, but when that occurs, it is recognised as a bug in the subclass rather than the library author breaching backwards compatibility guarantees.
Integrates cleanly with PEP 3135
Given that the method is called on already existing classes, the new hook will be able to freely invoke class methods that rely on the implicit __class__ reference introduced by PEP 3135, including methods that use the zero argument form of super().
Replaces many use cases for dynamic setting of __metaclass__
For use cases that don't involve completely replacing the defined class, Python 2 code that dynamically set __metaclass__ can now dynamically set __init_subclass__ instead. For more advanced use cases, introduction of an explicit metaclass (possibly made available as a required base class) will still be necessary in order to support Python 3.
A path of introduction into Python
Most of the benefits of this PEP can already be implemented using a simple metaclass. For the __init_subclass__ hook this works all the way down to python 2.7, while the namespace needs python 3.0 to work. Such a class has been uploaded to PyPI [4].
The only drawback of such a metaclass are the mentioned problems with metaclasses and multiple inheritance. Two classes using such a metaclass can only be combined, if they use exactly the same such metaclass. This fact calls for the inclusion of such a class into the standard library, let's call it SubclassMeta, with a base class using it called SublassInit. Once all users use this standard library metaclass, classes from different packages can easily be combined.
But still such classes cannot be easily combined with other classes using other metaclasses. Authors of metaclasses should bear that in mind and inherit from the standard metaclass if it seems useful for users of the metaclass to add more functionality. Ultimately, if the need for combining with other metaclasses is strong enough, the proposed functionality may be introduced into python's type.
Those arguments strongly hint to the following procedure to include the proposed functionality into python:
- The metaclass implementing this proposal is put onto PyPI, so that it can be used and scrutinized.
- Once the code is properly mature, it can be added to the python standard library. There should be a new module called metaclass which collects tools for metaclass authors, as well as a documentation of the best practices of how to write metaclasses.
- If the need of combining this metaclass with other metaclasses is strong enough, it may be included into python itself.
New Ways of Using Classes
This proposal has many usecases like the following. In the examples, we still inherit from the SubclassInit base class. This would become unnecessary once this PEP is included in Python directly.
Subclass registration
Especially when writing a plugin system, one likes to register new subclasses of a plugin baseclass. This can be done as follows:
class PluginBase(SubclassInit):
subclasses = []
def __init_subclass__(cls, ns, **kwargs):
super().__init_subclass__(ns, **kwargs)
cls.subclasses.append(cls)
One should note that this also works nicely as a mixin class.
Trait descriptors
There are many designs of python descriptors in the wild which, for example, check boundaries of values. Often those "traits" need some support of a metaclass to work. This is how this would look like with this PEP:
class Trait:
def __get__(self, instance, owner):
return instance.__dict__[self.key]
def __set__(self, instance, value):
instance.__dict__[self.key] = value
class Int(Trait):
def __set__(self, instance, value):
# some boundary check code here
super().__set__(instance, value)
class HasTraits(SubclassInit):
def __init_subclass__(cls, ns, **kwargs):
super().__init_subclass__(ns, **kwargs)
for k, v in ns.items():
if isinstance(v, Trait):
v.key = k
The new namespace keyword in the class header enables a number of interesting options for controlling the way a class is initialised, including some aspects of the object models of both Javascript and Ruby.
Order preserving classes
class OrderedClassBase(namespace=collections.OrderedDict):
pass
class OrderedClass(OrderedClassBase):
a = 1
b = 2
c = 3
Prepopulated namespaces
seed_data = dict(a=1, b=2, c=3)
class PrepopulatedClass(namespace=seed_data.copy):
pass
Cloning a prototype class
class NewClass(namespace=Prototype.__dict__.copy):
pass
Rejected Design Options
Calling the hook on the class itself
Adding an __autodecorate__ hook that would be called on the class itself was the proposed idea of PEP 422. Most examples work the same way or even better if the hook is called on the subclass. In general, it is much easier to explicitly call the hook on the class in which it is defined (to opt-in to such a behavior) than to opt-out, meaning that one does not want the hook to be called on the class it is defined in.
This becomes most evident if the class in question is designed as a mixin: it is very unlikely that the code of the mixin is to be executed for the mixin class itself, as it is not supposed to be a complete class on its own.
The original proposal also made major changes in the class initialization process, rendering it impossible to back-port the proposal to older python versions.
Other variants of calling the hook
Other names for the hook were presented, namely __decorate__ or __autodecorate__. This proposal opts for __init_subclass__ as it is very close to the __init__ method, just for the subclass, while it is not very close to decorators, as it does not return the class.
Requiring an explicit decorator on __init_subclass__
One could require the explicit use of @classmethod on the __init_subclass__ decorator. It was made implicit since there's no sensible interpretation for leaving it out, and that case would need to be detected anyway in order to give a useful error message.
This decision was reinforced after noticing that the user experience of defining __prepare__ and forgetting the @classmethod method decorator is singularly incomprehensible (particularly since PEP 3115 documents it as an ordinary method, and the current documentation doesn't explicitly say anything one way or the other).
Passing in the namespace directly rather than a factory function
At one point, PEP 422 proposed that the class namespace be passed directly as a keyword argument, rather than passing a factory function. However, this encourages an unsupported behaviour (that is, passing the same namespace to multiple classes, or retaining direct write access to a mapping used as a class namespace), so the API was switched to the factory function version.
Possible Extensions
Some extensions to this PEP are imaginable, which are postponed to a later pep:
- A __new_subclass__ method could be defined which acts like a __new__ for classes. This would be very close to __autodecorate__ in PEP 422.
- __subclasshook__ could be made a classmethod in a class instead of a method in the metaclass.
References
| [1] | http://mail.python.org/pipermail/python-dev/2012-June/119878.html |
| [2] | http://mail.python.org/pipermail/python-dev/2001-November/018651.html |
| [3] | http://docs.zope.org/zope_secrets/extensionclass.html |
| [4] | https://pypi.python.org/pypi/metaclass |
Copyright
This document has been placed in the public domain.
pep-0488 Elimination of PYO files
| PEP: | 488 |
|---|---|
| Title: | Elimination of PYO files |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 20-Feb-2015 |
| Python-Version: | 3.5 |
| Post-History: | 2015-03-06 2015-03-13 2015-03-20 |
Contents
Abstract
This PEP proposes eliminating the concept of PYO files from Python. To continue the support of the separation of bytecode files based on their optimization level, this PEP proposes extending the PYC file name to include the optimization level in the bytecode repository directory when there are optimizations applied.
Rationale
As of today, bytecode files come in two flavours: PYC and PYO. A PYC file is the bytecode file generated and read from when no optimization level is specified at interpreter startup (i.e., -O is not specified). A PYO file represents the bytecode file that is read/written when any optimization level is specified (i.e., when -O or -OO is specified). This means that while PYC files clearly delineate the optimization level used when they were generated -- namely no optimizations beyond the peepholer -- the same is not true for PYO files. To put this in terms of optimization levels and the file extension:
- 0: .pyc
- 1 (-O): .pyo
- 2 (-OO): .pyo
The reuse of the .pyo file extension for both level 1 and 2 optimizations means that there is no clear way to tell what optimization level was used to generate the bytecode file. In terms of reading PYO files, this can lead to an interpreter using a mixture of optimization levels with its code if the user was not careful to make sure all PYO files were generated using the same optimization level (typically done by blindly deleting all PYO files and then using the compileall module to compile all-new PYO files [1]). This issue is only compounded when people optimize Python code beyond what the interpreter natively supports, e.g., using the astoptimizer project [2].
In terms of writing PYO files, the need to delete all PYO files every time one either changes the optimization level they want to use or are unsure of what optimization was used the last time PYO files were generated leads to unnecessary file churn. The change proposed by this PEP also allows for all optimization levels to be pre-compiled for bytecode files ahead of time, something that is currently impossible thanks to the reuse of the .pyo file extension for multiple optimization levels.
As for distributing bytecode-only modules, having to distribute both .pyc and .pyo files is unnecessary for the common use-case of code obfuscation and smaller file deployments. This means that bytecode-only modules will only load from their non-optimized .pyc file name.
Proposal
To eliminate the ambiguity that PYO files present, this PEP proposes eliminating the concept of PYO files and their accompanying .pyo file extension. To allow for the optimization level to be unambiguous as well as to avoid having to regenerate optimized bytecode files needlessly in the __pycache__ directory, the optimization level used to generate the bytecode file will be incorporated into the bytecode file name. When no optimization level is specified, the pre-PEP .pyc file name will be used (i.e., no optimization level will be specified in the file name). For example, a source file named foo.py in CPython 3.5 could have the following bytecode files based on the interpreter's optimization level (none, -O, and -OO):
- 0: foo.cpython-35.pyc (i.e., no change)
- 1: foo.cpython-35.opt-1.pyc
- 2: foo.cpython-35.opt-2.pyc
Currently bytecode file names are created by importlib.util.cache_from_source(), approximately using the following expression defined by PEP 3147 [3], [4], [5]:
'{name}.{cache_tag}.pyc'.format(name=module_name,
cache_tag=sys.implementation.cache_tag)
This PEP proposes to change the expression when an optimization level is specified to:
'{name}.{cache_tag}.opt-{optimization}.pyc'.format(
name=module_name,
cache_tag=sys.implementation.cache_tag,
optimization=str(sys.flags.optimize))
The "opt-" prefix was chosen so as to provide a visual separator from the cache tag. The placement of the optimization level after the cache tag was chosen to preserve lexicographic sort order of bytecode file names based on module name and cache tag which will not vary for a single interpreter. The "opt-" prefix was chosen over "o" so as to be somewhat self-documenting. The "opt-" prefix was chosen over "O" so as to not have any confusion in case "0" was the leading prefix of the optimization level.
A period was chosen over a hyphen as a separator so as to distinguish clearly that the optimization level is not part of the interpreter version as specified by the cache tag. It also lends to the use of the period in the file name to delineate semantically different concepts.
For example, if -OO had been passed to the interpreter then instead of importlib.cpython-35.pyo the file name would be importlib.cpython-35.opt-2.pyc.
Leaving out the new opt- tag when no optimization level is applied should increase backwards-compatibility. This is also more understanding of Python implementations which have no use for optimization levels (e.g., PyPy[10]_).
It should be noted that this change in no way affects the performance of import. Since the import system looks for a single bytecode file based on the optimization level of the interpreter already and generates a new bytecode file if it doesn't exist, the introduction of potentially more bytecode files in the __pycache__ directory has no effect in terms of stat calls. The interpreter will continue to look for only a single bytecode file based on the optimization level and thus no increase in stat calls will occur.
The only potentially negative result of this PEP is the probable increase in the number of .pyc files and thus increase in storage use. But for platforms where this is an issue, sys.dont_write_bytecode exists to turn off bytecode generation so that it can be controlled offline.
Implementation
An implementation of this PEP is available [11].
importlib
As importlib.util.cache_from_source() is the API that exposes bytecode file paths as well as being directly used by importlib, it requires the most critical change. As of Python 3.4, the function's signature is:
importlib.util.cache_from_source(path, debug_override=None)
This PEP proposes changing the signature in Python 3.5 to:
importlib.util.cache_from_source(path, debug_override=None, *, optimization=None)
The introduced optimization keyword-only parameter will control what optimization level is specified in the file name. If the argument is None then the current optimization level of the interpreter will be assumed (including no optimization). Any argument given for optimization will be passed to str() and must have str.isalnum() be true, else ValueError will be raised (this prevents invalid characters being used in the file name). If the empty string is passed in for optimization then the addition of the optimization will be suppressed, reverting to the file name format which predates this PEP.
It is expected that beyond Python's own two optimization levels, third-party code will use a hash of optimization names to specify the optimization level, e.g. hashlib.sha256(','.join(['no dead code', 'const folding'])).hexdigest(). While this might lead to long file names, it is assumed that most users never look at the contents of the __pycache__ directory and so this won't be an issue.
The debug_override parameter will be deprecated. A False value will be equivalent to optimization=1 while a True value will represent optimization='' (a None argument will continue to mean the same as for optimization). A deprecation warning will be raised when debug_override is given a value other than None, but there are no plans for the complete removal of the parameter at this time (but removal will be no later than Python 4).
The various module attributes for importlib.machinery which relate to bytecode file suffixes will be updated [7]. The DEBUG_BYTECODE_SUFFIXES and OPTIMIZED_BYTECODE_SUFFIXES will both be documented as deprecated and set to the same value as BYTECODE_SUFFIXES (removal of DEBUG_BYTECODE_SUFFIXES and OPTIMIZED_BYTECODE_SUFFIXES is not currently planned, but will be not later than Python 4).
All various finders and loaders will also be updated as necessary, but updating the previous mentioned parts of importlib should be all that is required.
Rest of the standard library
The various functions exposed by the py_compile and compileall functions will be updated as necessary to make sure they follow the new bytecode file name semantics [6], [1]. The CLI for the compileall module will not be directly affected (the -b flag will be implicit as it will no longer generate .pyo files when -O is specified).
Compatibility Considerations
Any code directly manipulating bytecode files from Python 3.2 on will need to consider the impact of this change on their code (prior to Python 3.2 -- including all of Python 2 -- there was no __pycache__ which already necessitates bifurcating bytecode file handling support). If code was setting the debug_override argument to importlib.util.cache_from_source() then care will be needed if they want the path to a bytecode file with an optimization level of 2. Otherwise only code not using importlib.util.cache_from_source() will need updating.
As for people who distribute bytecode-only modules (i.e., use a bytecode file instead of a source file), they will have to choose which optimization level they want their bytecode files to be since distributing a .pyo file with a .pyc file will no longer be of any use. Since people typically only distribute bytecode files for code obfuscation purposes or smaller distribution size then only having to distribute a single .pyc should actually be beneficial to these use-cases. And since the magic number for bytecode files changed in Python 3.5 to support PEP 465 there is no need to support pre-existing .pyo files [8].
Rejected Ideas
Completely dropping optimization levels from CPython
Some have suggested that instead of accommodating the various optimization levels in CPython, we should instead drop them entirely. The argument is that significant performance gains would occur from runtime optimizations through something like a JIT and not through pre-execution bytecode optimizations.
This idea is rejected for this PEP as that ignores the fact that there are people who do find the pre-existing optimization levels for CPython useful. It also assumes that no other Python interpreter would find what this PEP proposes useful.
Alternative formatting of the optimization level in the file name
Using the "opt-" prefix and placing the optimization level between the cache tag and file extension is not critical. All options which have been considered are:
- importlib.cpython-35.opt-1.pyc
- importlib.cpython-35.opt1.pyc
- importlib.cpython-35.o1.pyc
- importlib.cpython-35.O1.pyc
- importlib.cpython-35.1.pyc
- importlib.cpython-35-O1.pyc
- importlib.O1.cpython-35.pyc
- importlib.o1.cpython-35.pyc
- importlib.1.cpython-35.pyc
These were initially rejected either because they would change the sort order of bytecode files, possible ambiguity with the cache tag, or were not self-documenting enough. An informal poll was taken and people clearly preferred the formatting proposed by the PEP [9]. Since this topic is non-technical and of personal choice, the issue is considered solved.
Embedding the optimization level in the bytecode metadata
Some have suggested that rather than embedding the optimization level of bytecode in the file name that it be included in the file's metadata instead. This would mean every interpreter had a single copy of bytecode at any time. Changing the optimization level would thus require rewriting the bytecode, but there would also only be a single file to care about.
This has been rejected due to the fact that Python is often installed as a root-level application and thus modifying the bytecode file for modules in the standard library are always possible. In this situation integrators would need to guess at what a reasonable optimization level was for users for any/all situations. By allowing multiple optimization levels to co-exist simultaneously it frees integrators from having to guess what users want and allows users to utilize the optimization level they want.
References
| [1] | (1, 2) The compileall module (https://docs.python.org/3/library/compileall.html#module-compileall) |
| [2] | The astoptimizer project (https://pypi.python.org/pypi/astoptimizer) |
| [3] | importlib.util.cache_from_source() (https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source) |
| [4] | Implementation of importlib.util.cache_from_source() from CPython 3.4.3rc1 (https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437) |
| [5] | PEP 3147, PYC Repository Directories, Warsaw (http://www.python.org/dev/peps/pep-3147) |
| [6] | The py_compile module (https://docs.python.org/3/library/compileall.html#module-compileall) |
| [7] | The importlib.machinery module (https://docs.python.org/3/library/importlib.html#module-importlib.machinery) |
| [8] | importlib.util.MAGIC_NUMBER (https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER) |
| [9] | Informal poll of file name format options on Google+ (https://plus.google.com/u/0/+BrettCannon/posts/fZynLNwHWGm) |
| [10] | The PyPy Project (http://pypy.org/) |
| [11] | Implementation of PEP 488 (http://bugs.python.org/issue23731) |
Copyright
This document has been placed in the public domain.
pep-0489 Multi-phase extension module initialization
| PEP: | 489 |
|---|---|
| Title: | Multi-phase extension module initialization |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Petr Viktorin <encukou at gmail.com>, Stefan Behnel <stefan_ml at behnel.de>, Nick Coghlan <ncoghlan at gmail.com> |
| BDFL-Delegate: | Eric Snow <ericsnowcurrently@gmail.com> |
| Discussions-To: | import-sig at python.org |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 11-Aug-2013 |
| Python-Version: | 3.5 |
| Post-History: | 23-Aug-2013, 20-Feb-2015, 16-Apr-2015, 7-May-2015, 18-May-2015 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2015-May/140108.html |
Contents
- Abstract
- Motivation
- The current process
- The proposal
- Pseudo-code Overview
- Module Creation Phase
- Module Execution Phase
- Legacy Init
- Built-In modules
- Subinterpreters and Interpreter Reloading
- Functions incompatible with multi-phase initialization
- Module state and C-level callbacks
- New Functions
- Export Hook Name
- Module Reloading
- Multiple modules in one library
- Testing and initial implementations
- Summary of API Changes and Additions
- Possible Future Extensions
- Implementation
- Previous Approaches
- References
- Copyright
Abstract
This PEP proposes a redesign of the way in which built-in and extension modules interact with the import machinery. This was last revised for Python 3.0 in PEP 3121, but did not solve all problems at the time. The goal is to solve import-related problems by bringing extension modules closer to the way Python modules behave; specifically to hook into the ModuleSpec-based loading mechanism introduced in PEP 451.
This proposal draws inspiration from PyType_Spec of PEP 384 to allow extension authors to only define features they need, and to allow future additions to extension module declarations.
Extensions modules are created in a two-step process, fitting better into the ModuleSpec architecture, with parallels to __new__ and __init__ of classes.
Extension modules can safely store arbitrary C-level per-module state in the module that is covered by normal garbage collection and supports reloading and sub-interpreters. Extension authors are encouraged to take these issues into account when using the new API.
The proposal also allows extension modules with non-ASCII names.
Not all problems tackled in PEP 3121 are solved in this proposal. In particular, problems with run-time module lookup (PyState_FindModule) are left to a future PEP.
Motivation
Python modules and extension modules are not being set up in the same way. For Python modules, the module object is created and set up first, then the module code is being executed (PEP 302). A ModuleSpec object (PEP 451) is used to hold information about the module, and passed to the relevant hooks.
For extensions (i.e. shared libraries) and built-in modules, the module init function is executed straight away and does both the creation and initialization. The initialization function is not passed the ModuleSpec, or any information it contains, such as the __file__ or fully-qualified name. This hinders relative imports and resource loading.
In Py3, modules are also not being added to sys.modules, which means that a (potentially transitive) re-import of the module will really try to re-import it and thus run into an infinite loop when it executes the module init function again. Without access to the fully-qualified module name, it is not trivial to correctly add the module to sys.modules either. This is specifically a problem for Cython generated modules, for which it's not uncommon that the module init code has the same level of complexity as that of any 'regular' Python module. Also, the lack of __file__ and __name__ information hinders the compilation of "__init__.py" modules, i.e. packages, especially when relative imports are being used at module init time.
Furthermore, the majority of currently existing extension modules has problems with sub-interpreter support and/or interpreter reloading, and, while it is possible with the current infrastructure to support these features, it is neither easy nor efficient. Addressing these issues was the goal of PEP 3121, but many extensions, including some in the standard library, took the least-effort approach to porting to Python 3, leaving these issues unresolved. This PEP keeps backwards compatibility, which should reduce pressure and give extension authors adequate time to consider these issues when porting.
The current process
Currently, extension and built-in modules export an initialization function named "PyInit_modulename", named after the file name of the shared library. This function is executed by the import machinery and must return a fully initialized module object. The function receives no arguments, so it has no way of knowing about its import context.
During its execution, the module init function creates a module object based on a PyModuleDef object. It then continues to initialize it by adding attributes to the module dict, creating types, etc.
In the back, the shared library loader keeps a note of the fully qualified module name of the last module that it loaded, and when a module gets created that has a matching name, this global variable is used to determine the fully qualified name of the module object. This is not entirely safe as it relies on the module init function creating its own module object first, but this assumption usually holds in practice.
The proposal
The initialization function (PyInit_modulename) will be allowed to return a pointer to a PyModuleDef object. The import machinery will be in charge of constructing the module object, calling hooks provided in the PyModuleDef in the relevant phases of initialization (as described below).
This multi-phase initialization is an additional possibility. Single-phase initialization, the current practice of returning a fully initialized module object, will still be accepted, so existing code will work unchanged, including binary compatibility.
The PyModuleDef structure will be changed to contain a list of slots, similarly to PEP 384's PyType_Spec for types. To keep binary compatibility, and avoid needing to introduce a new structure (which would introduce additional supporting functions and per-module storage), the currently unused m_reload pointer of PyModuleDef will be changed to hold the slots. The structures are defined as:
typedef struct {
int slot;
void *value;
} PyModuleDef_Slot;
typedef struct PyModuleDef {
PyModuleDef_Base m_base;
const char* m_name;
const char* m_doc;
Py_ssize_t m_size;
PyMethodDef *m_methods;
PyModuleDef_Slot *m_slots; /* changed from `inquiry m_reload;` */
traverseproc m_traverse;
inquiry m_clear;
freefunc m_free;
} PyModuleDef;
The m_slots member must be either NULL, or point to an array of PyModuleDef_Slot structures, terminated by a slot with id set to 0 (i.e. {0, NULL}).
To specify a slot, a unique slot ID must be provided. New Python versions may introduce new slot IDs, but slot IDs will never be recycled. Slots may get deprecated, but will continue to be supported throughout Python 3.x.
A slot's value pointer may not be NULL, unless specified otherwise in the slot's documentation.
The following slots are currently available, and described later:
- Py_mod_create
- Py_mod_exec
Unknown slot IDs will cause the import to fail with SystemError.
When using multi-phase initialization, the m_name field of PyModuleDef will not be used during importing; the module name will be taken from the ModuleSpec.
Before it is returned from PyInit_*, the PyModuleDef object must be initialized using the newly added PyModuleDef_Init function. This sets the object type (which cannot be done statically on certain compilers), refcount, and internal bookkeeping data (m_index). For example, an extension module "example" would be exported as:
static PyModuleDef example_def = {...}
PyMODINIT_FUNC
PyInit_example(void)
{
return PyModuleDef_Init(&example_def);
}
The PyModuleDef object must be available for the lifetime of the module created from it – usually, it will be declared statically.
Pseudo-code Overview
Here is an overview of how the modified importers will operate. Details such as logging or handling of errors and invalid states are left out, and C code is presented with a concise Python-like syntax.
The framework that calls the importers is explained in PEP 451 [8].
importlib/_bootstrap.py:
class BuiltinImporter:
def create_module(self, spec):
module = _imp.create_builtin(spec)
def exec_module(self, module):
_imp.exec_dynamic(module)
def load_module(self, name):
# use a backwards compatibility shim
_load_module_shim(self, name)
importlib/_bootstrap_external.py:
class ExtensionFileLoader:
def create_module(self, spec):
module = _imp.create_dynamic(spec)
def exec_module(self, module):
_imp.exec_dynamic(module)
def load_module(self, name):
# use a backwards compatibility shim
_load_module_shim(self, name)
Python/import.c (the _imp module):
def create_dynamic(spec):
name = spec.name
path = spec.origin
# Find an already loaded module that used single-phase init.
# For multi-phase initialization, mod is NULL, so a new module
# is always created.
mod = _PyImport_FindExtensionObject(name, name)
if mod:
return mod
return _PyImport_LoadDynamicModuleWithSpec(spec)
def exec_dynamic(module):
if not isinstance(module, types.ModuleType):
# non-modules are skipped -- PyModule_GetDef fails on them
return
def = PyModule_GetDef(module)
state = PyModule_GetState(module)
if state is NULL:
PyModule_ExecDef(module, def)
def create_builtin(spec):
name = spec.name
# Find an already loaded module that used single-phase init.
# For multi-phase initialization, mod is NULL, so a new module
# is always created.
mod = _PyImport_FindExtensionObject(name, name)
if mod:
return mod
for initname, initfunc in PyImport_Inittab:
if name == initname:
m = initfunc()
if isinstance(m, PyModuleDef):
def = m
return PyModule_FromDefAndSpec(def, spec)
else:
# fall back to single-phase initialization
module = m
_PyImport_FixupExtensionObject(module, name, name)
return module
Python/importdl.c:
def _PyImport_LoadDynamicModuleWithSpec(spec):
path = spec.origin
package, dot, name = spec.name.rpartition('.')
# see the "Non-ASCII module names" section for export_hook_name
hook_name = export_hook_name(name)
# call platform-specific function for loading exported function
# from shared library
exportfunc = _find_shared_funcptr(hook_name, path)
m = exportfunc()
if isinstance(m, PyModuleDef):
def = m
return PyModule_FromDefAndSpec(def, spec)
module = m
# fall back to single-phase initialization
....
Objects/moduleobject.c:
def PyModule_FromDefAndSpec(def, spec):
name = spec.name
create = None
for slot, value in def.m_slots:
if slot == Py_mod_create:
create = value
if create:
m = create(spec, def)
else:
m = PyModule_New(name)
if isinstance(m, types.ModuleType):
m.md_state = None
m.md_def = def
if def.m_methods:
PyModule_AddFunctions(m, def.m_methods)
if def.m_doc:
PyModule_SetDocString(m, def.m_doc)
def PyModule_ExecDef(module, def):
if isinstance(module, types.module_type):
if module.md_state is NULL:
# allocate a block of zeroed-out memory
module.md_state = _alloc(module.md_size)
if def.m_slots is NULL:
return
for slot, value in def.m_slots:
if slot == Py_mod_exec:
value(module)
Module Creation Phase
Creation of the module object – that is, the implementation of ExecutionLoader.create_module – is governed by the Py_mod_create slot.
The Py_mod_create slot
The Py_mod_create slot is used to support custom module subclasses. The value pointer must point to a function with the following signature:
PyObject* (*PyModuleCreateFunction)(PyObject *spec, PyModuleDef *def)
The function receives a ModuleSpec instance, as defined in PEP 451, and the PyModuleDef structure. It should return a new module object, or set an error and return NULL.
This function is not responsible for setting import-related attributes specified in PEP 451 [1] (such as __name__ or __loader__) on the new module.
There is no requirement for the returned object to be an instance of types.ModuleType. Any type can be used, as long as it supports setting and getting attributes, including at least the import-related attributes. However, only ModuleType instances support module-specific functionality such as per-module state and processing of execution slots. If something other than a ModuleType subclass is returned, no execution slots may be defined; if any are, a SystemError is raised.
Note that when this function is called, the module's entry in sys.modules is not populated yet. Attempting to import the same module again (possibly transitively), may lead to an infinite loop. Extension authors are advised to keep Py_mod_create minimal, an in particular to not call user code from it.
Multiple Py_mod_create slots may not be specified. If they are, import will fail with SystemError.
If Py_mod_create is not specified, the import machinery will create a normal module object using PyModule_New. The name is taken from spec.
Post-creation steps
If the Py_mod_create function returns an instance of types.ModuleType or a subclass (or if a Py_mod_create slot is not present), the import machinery will associate the PyModuleDef with the module. This also makes the PyModuleDef accessible to execution phase, the PyModule_GetDef function, and garbage collection routines (traverse, clear, free).
If the Py_mod_create function does not return a module subclass, then m_size must be 0, and m_traverse, m_clear and m_free must all be NULL. Otherwise, SystemError is raised.
Additionally, initial attributes specified in the PyModuleDef are set on the module object, regardless of its type:
- The docstring is set from m_doc, if non-NULL.
- The module's functions are initialized from m_methods, if any.
Module Execution Phase
Module execution -- that is, the implementation of ExecutionLoader.exec_module -- is governed by "execution slots". This PEP only adds one, Py_mod_exec, but others may be added in the future.
The execution phase is done on the PyModuleDef associated with the module object. For objects that are not a subclass of PyModule_Type (for which PyModule_GetDef would fail), the execution phase is skipped.
Execution slots may be specified multiple times, and are processed in the order they appear in the slots array. When using the default import machinery, they are processed after import-related attributes specified in PEP 451 [1] (such as __name__ or __loader__) are set and the module is added to sys.modules.
Pre-Execution steps
Before processing the execution slots, per-module state is allocated for the module. From this point on, per-module state is accessible through PyModule_GetState.
The Py_mod_exec slot
The entry in this slot must point to a function with the following signature:
int (*PyModuleExecFunction)(PyObject* module)
It will be called to initialize a module. Usually, this amounts to setting the module's initial attributes. The "module" argument receives the module object to initialize.
The function must return 0 on success, or, on error, set an exception and return -1.
If PyModuleExec replaces the module's entry in sys.modules, the new object will be used and returned by importlib machinery after all execution slots are processed. This is a feature of the import machinery itself. The slots themselves are all processed using the module returned from the creation phase; sys.modules is not consulted during the execution phase. (Note that for extension modules, implementing Py_mod_create is usually a better solution for using custom module objects.)
Legacy Init
The backwards-compatible single-phase initialization continues to be supported. In this scheme, the PyInit function returns a fully initialized module rather than a PyModuleDef object. In this case, the PyInit hook implements the creation phase, and the execution phase is a no-op.
Modules that need to work unchanged on older versions of Python should stick to single-phase initialization, because the benefits it brings can't be back-ported. Here is an example of a module that supports multi-phase initialization, and falls back to single-phase when compiled for an older version of CPython. It is included mainly as an illustration of the changes needed to enable multi-phase init:
#include <Python.h>
static int spam_exec(PyObject *module) {
PyModule_AddStringConstant(module, "food", "spam");
return 0;
}
#ifdef Py_mod_exec
static PyModuleDef_Slot spam_slots[] = {
{Py_mod_exec, spam_exec},
{0, NULL}
};
#endif
static PyModuleDef spam_def = {
PyModuleDef_HEAD_INIT, /* m_base */
"spam", /* m_name */
PyDoc_STR("Utilities for cooking spam"), /* m_doc */
0, /* m_size */
NULL, /* m_methods */
#ifdef Py_mod_exec
spam_slots, /* m_slots */
#else
NULL,
#endif
NULL, /* m_traverse */
NULL, /* m_clear */
NULL, /* m_free */
};
PyMODINIT_FUNC
PyInit_spam(void) {
#ifdef Py_mod_exec
return PyModuleDef_Init(&spam_def);
#else
PyObject *module;
module = PyModule_Create(&spam_def);
if (module == NULL) return NULL;
if (spam_exec(module) != 0) {
Py_DECREF(module);
return NULL;
}
return module;
#endif
}
Built-In modules
Any extension module can be used as a built-in module by linking it into the executable, and including it in the inittab (either at runtime with PyImport_AppendInittab, or at configuration time, using tools like freeze).
To keep this possibility, all changes to extension module loading introduced in this PEP will also apply to built-in modules. The only exception is non-ASCII module names, explained below.
Subinterpreters and Interpreter Reloading
Extensions using the new initialization scheme are expected to support subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly, avoiding the issues mentioned in Python documentation [9]. The mechanism is designed to make this easy, but care is still required on the part of the extension author. No user-defined functions, methods, or instances may leak to different interpreters. To achieve this, all module-level state should be kept in either the module dict, or in the module object's storage reachable by PyModule_GetState. A simple rule of thumb is: Do not define any static data, except built-in types with no mutable or user-settable class attributes.
Functions incompatible with multi-phase initialization
The PyModule_Create function will fail when used on a PyModuleDef structure with a non-NULL m_slots pointer. The function doesn't have access to the ModuleSpec object necessary for multi-phase initialization.
The PyState_FindModule function will return NULL, and PyState_AddModule and PyState_RemoveModule will also fail on modules with non-NULL m_slots. PyState registration is disabled because multiple module objects may be created from the same PyModuleDef.
Module state and C-level callbacks
Due to the unavailability of PyState_FindModule, any function that needs access to module-level state (including functions, classes or exceptions defined at the module level) must receive a reference to the module object (or the particular object it needs), either directly or indirectly. This is currently difficult in two situations:
- Methods of classes, which receive a reference to the class, but not to the class's module
- Libraries with C-level callbacks, unless the callbacks can receive custom data set at callback registration
Fixing these cases is outside of the scope of this PEP, but will be needed for the new mechanism to be useful to all modules. Proper fixes have been discussed on the import-sig mailing list [7].
As a rule of thumb, modules that rely on PyState_FindModule are, at the moment, not good candidates for porting to the new mechanism.
New Functions
A new function and macro implementing the module creation phase will be added. These are similar to PyModule_Create and PyModule_Create2, except they take an additional ModuleSpec argument, and handle module definitions with non-NULL slots:
PyObject * PyModule_FromDefAndSpec(PyModuleDef *def, PyObject *spec)
PyObject * PyModule_FromDefAndSpec2(PyModuleDef *def, PyObject *spec,
int module_api_version)
A new function implementing the module execution phase will be added. This allocates per-module state (if not allocated already), and always processes execution slots. The import machinery calls this method when a module is executed, unless the module is being reloaded:
PyAPI_FUNC(int) PyModule_ExecDef(PyObject *module, PyModuleDef *def)
Another function will be introduced to initialize a PyModuleDef object. This idempotent function fills in the type, refcount, and module index. It returns its argument cast to PyObject*, so it can be returned directly from a PyInit function:
PyObject * PyModuleDef_Init(PyModuleDef *);
Additionally, two helpers will be added for setting the docstring and methods on a module:
int PyModule_SetDocString(PyObject *, const char *) int PyModule_AddFunctions(PyObject *, PyMethodDef *)
Export Hook Name
As portable C identifiers are limited to ASCII, module names must be encoded to form the PyInit hook name.
For ASCII module names, the import hook is named PyInit_<modulename>, where <modulename> is the name of the module.
For module names containing non-ASCII characters, the import hook is named PyInitU_<encodedname>, where the name is encoded using CPython's "punycode" encoding (Punycode [4] with a lowercase suffix), with hyphens ("-") replaced by underscores ("_").
In Python:
def export_hook_name(name):
try:
suffix = b'_' + name.encode('ascii')
except UnicodeEncodeError:
suffix = b'U_' + name.encode('punycode').replace(b'-', b'_')
return b'PyInit' + suffix
Examples:
| Module name | Init hook name |
|---|---|
| spam | PyInit_spam |
| lančmít | PyInitU_lanmt_2sa6t |
| スパム | PyInitU_zck5b2b |
For modules with non-ASCII names, single-phase initialization is not supported.
In the initial implementation of this PEP, built-in modules with non-ASCII names will not be supported.
Module Reloading
Reloading an extension module using importlib.reload() will continue to have no effect, except re-setting import-related attributes.
Due to limitations in shared library loading (both dlopen on POSIX and LoadModuleEx on Windows), it is not generally possible to load a modified library after it has changed on disk.
Use cases for reloading other than trying out a new version of the module are too rare to require all module authors to keep reloading in mind. If reload-like functionality is needed, authors can export a dedicated function for it.
Multiple modules in one library
To support multiple Python modules in one shared library, the library can export additional PyInit* symbols besides the one that corresponds to the library's filename.
Note that this mechanism can currently only be used to load extra modules, but not to find them. (This is a limitation of the loader mechanism, which this PEP does not try to modify.) To work around the lack of a suitable finder, code like the following can be used:
import importlib.machinery import importlib.util loader = importlib.machinery.ExtensionFileLoader(name, path) spec = importlib.util.spec_from_loader(name, loader) module = importlib.util.module_from_spec(spec) loader.exec_module(module) return module
On platforms that support symbolic links, these may be used to install one library under multiple names, exposing all exported modules to normal import machinery.
Testing and initial implementations
For testing, a new built-in module _testmultiphase will be created. The library will export several additional modules using the mechanism described in "Multiple modules in one library".
The _testcapi module will be unchanged, and will use single-phase initialization indefinitely (or until it is no longer supported).
The array and xx* modules will be converted to use multi-phase initialization as part of the initial implementation.
Summary of API Changes and Additions
New functions:
- PyModule_FromDefAndSpec (macro)
- PyModule_FromDefAndSpec2
- PyModule_ExecDef
- PyModule_SetDocString
- PyModule_AddFunctions
- PyModuleDef_Init
New macros:
- Py_mod_create
- Py_mod_exec
New types:
- PyModuleDef_Type will be exposed
New structures:
- PyModuleDef_Slot
Other changes:
PyModuleDef.m_reload changes to PyModuleDef.m_slots.
BuiltinImporter and ExtensionFileLoader will now implement create_module and exec_module.
The internal _imp module will have backwards incompatible changes: create_builtin, create_dynamic, and exec_dynamic will be added; init_builtin, load_dynamic will be removed.
The undocumented functions imp.load_dynamic and imp.init_builtin will be replaced by backwards-compatible shims.
Backwards Compatibility
Existing modules will continue to be source- and binary-compatible with new versions of Python. Modules that use multi-phase initialization will not be compatible with versions of Python that do not implement this PEP.
The functions init_builtin and load_dynamic will be removed from the _imp module (but not from the imp module).
All changed loaders (BuiltinImporter and ExtensionFileLoader) will remain backwards-compatible; the load_module method will be replaced by a shim.
Internal functions of Python/import.c and Python/importdl.c will be removed. (Specifically, these are _PyImport_GetDynLoadFunc, _PyImport_GetDynLoadWindows, and _PyImport_LoadDynamicModule.)
Possible Future Extensions
The slots mechanism, inspired by PyType_Slot from PEP 384, allows later extensions.
Some extension modules exports many constants; for example _ssl has a long list of calls in the form:
PyModule_AddIntConstant(m, "SSL_ERROR_ZERO_RETURN",
PY_SSL_ERROR_ZERO_RETURN);
Converting this to a declarative list, similar to PyMethodDef, would reduce boilerplate, and provide free error-checking which is often missing.
String constants and types can be handled similarly. (Note that non-default bases for types cannot be portably specified statically; this case would need a Py_mod_exec function that runs before the slots are added. The free error-checking would still be beneficial, though.)
Another possibility is providing a "main" function that would be run when the module is given to Python's -m switch. For this to work, the runpy module will need to be modified to take advantage of ModuleSpec-based loading introduced in PEP 451. Also, it will be necessary to add a mechanism for setting up a module according to slots it wasn't originally defined with.
Implementation
Work-in-progress implementation is available in a Github repository [5]; a patchset is at [6].
Previous Approaches
Stefan Behnel's initial proto-PEP [2] had a "PyInit_modulename" hook that would create a module class, whose __init__ would be then called to create the module. This proposal did not correspond to the (then nonexistent) PEP 451, where module creation and initialization is broken into distinct steps. It also did not support loading an extension into pre-existing module objects.
Nick Coghlan proposed "Create" and "Exec" hooks, and wrote a prototype implementation [3]. At this time PEP 451 was still not implemented, so the prototype does not use ModuleSpec.
The original version of this PEP used Create and Exec hooks, and allowed loading into arbitrary pre-constructed objects with Exec hook. The proposal made extension module initialization closer to how Python modules are initialized, but it was later recognized that this isn't an important goal. The current PEP describes a simpler solution.
A further iteration used a "PyModuleExport" hook as an alternative to PyInit, where PyInit was used for existing scheme, and PyModuleExport for multi-phase. However, not being able to determine the hook name based on module name complicated automatic generation of PyImport_Inittab by tools like freeze. Keeping only the PyInit hook name, even if it's not entirely appropriate for exporting a definition, yielded a much simpler solution.
References
| [1] | (1, 2) https://www.python.org/dev/peps/pep-0451/#attributes |
| [2] | https://mail.python.org/pipermail/python-dev/2013-August/128087.html |
| [3] | https://mail.python.org/pipermail/python-dev/2013-August/128101.html |
| [4] | http://tools.ietf.org/html/rfc3492 |
| [5] | https://github.com/encukou/cpython/commits/pep489 |
| [6] | https://github.com/encukou/cpython/compare/master...encukou:pep489.patch |
| [7] | https://mail.python.org/pipermail/import-sig/2015-April/000959.html |
| [8] | https://www.python.org/dev/peps/pep-0451/#how-loading-will-work] |
| [9] | https://docs.python.org/3/c-api/init.html#sub-interpreter-support |
Copyright
This document has been placed in the public domain.
pep-0490 Chain exceptions at C level
| PEP: | 490 |
|---|---|
| Title: | Chain exceptions at C level |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Victor Stinner <victor.stinner at gmail.com> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 25-March-2015 |
| Python-Version: | 3.6 |
Contents
Abstract
Chain exceptions at C level, as already done at Python level.
Rationale
Python 3 introduced a new killer feature: exceptions are chained by default, PEP 3134.
Example:
try:
raise TypeError("err1")
except TypeError:
raise ValueError("err2")
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
raise TypeError("err1")
TypeError: err1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 4, in <module>
raise ValueError("err2")
ValueError: err2
Exceptions are chained by default in Python code, but not in extensions written in C.
A new private _PyErr_ChainExceptions() function was introduced in Python 3.4.3 and 3.5 to chain exceptions. Currently, it must be called explicitly to chain exceptions and its usage is not trivial.
Example of _PyErr_ChainExceptions() usage from the zipimport module to chain the previous OSError to a new ZipImportError exception:
PyObject *exc, *val, *tb; PyErr_Fetch(&exc, &val, &tb); PyErr_Format(ZipImportError, "can't open Zip file: %R", archive); _PyErr_ChainExceptions(exc, val, tb);
This PEP proposes to also chain exceptions automatically at C level to stay consistent and give more information on failures to help debugging. The previous example becomes simply:
PyErr_Format(ZipImportError, "can't open Zip file: %R", archive);
Proposal
Modify PyErr_*() functions to chain exceptions
Modify C functions raising exceptions of the Python C API to automatically chain exceptions: modify PyErr_SetString(), PyErr_Format(), PyErr_SetNone(), etc.
Modify functions to not chain exceptions
Keeping the previous exception is not always interesting when the new exception contains information of the previous exception or even more information, especially when the two exceptions have the same type.
Example of an useless exception chain with int(str):
TypeError: a bytes-like object is required, not 'type' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'
The new TypeError exception contains more information than the previous exception. The previous exception should be hidden.
The PyErr_Clear() function can be called to clear the current exception before raising a new exception, to not chain the current exception with a new exception.
Modify functions to chain exceptions
Some functions save and then restore the current exception. If a new exception is raised, the exception is currently displayed into sys.stderr or ignored depending on the function. Some of these functions should be modified to chain exceptions instead.
Examples of function ignoring the new exception(s):
- ptrace_enter_call(): ignore exception
- subprocess_fork_exec(): ignore exception raised by enable_gc()
- t_bootstrap() of the _thread module: ignore exception raised by trying to display the bootstrap function to sys.stderr
- PyDict_GetItem(), _PyDict_GetItem_KnownHash(): ignore exception raised by looking for a key in the dictionary
- _PyErr_TrySetFromCause(): ignore exception
- PyFrame_LocalsToFast(): ignore exception raised by dict_to_map()
- _PyObject_Dump(): ignore exception. _PyObject_Dump() is used to debug, to inspect a running process, it should not modify the Python state.
- Py_ReprLeave(): ignore exception "because there is no way to report them"
- type_dealloc(): ignore exception raised by remove_all_subclasses()
- PyObject_ClearWeakRefs(): ignore exception?
- call_exc_trace(), call_trace_protected(): ignore exception
- remove_importlib_frames(): ignore exception
- do_mktuple(), helper used by Py_BuildValue() for example: ignore exception?
- flush_io(): ignore exception
- sys_write(), sys_format(): ignore exception
- _PyTraceback_Add(): ignore exception
- PyTraceBack_Print(): ignore exception
Examples of function displaying the new exception to sys.stderr:
- atexit_callfuncs(): display exceptions with PyErr_Display() and return the latest exception, the function calls multiple callbacks and only returns the latest exception
- sock_dealloc(): log the ResourceWarning exception with PyErr_WriteUnraisable()
- slot_tp_del(): display exception with PyErr_WriteUnraisable()
- _PyGen_Finalize(): display gen_close() exception with PyErr_WriteUnraisable()
- slot_tp_finalize(): display exception raised by the __del__() method with PyErr_WriteUnraisable()
- PyErr_GivenExceptionMatches(): display exception raised by PyType_IsSubtype() with PyErr_WriteUnraisable()
Backward compatibility
A side effect of chaining exceptions is that exceptions store traceback objects which store frame objects which store local variables. Local variables are kept alive by exceptions. A common issue is a reference cycle between local variables and exceptions: an exception is stored in a local variable and the frame indirectly stored in the exception. The cycle only impacts applications storing exceptions.
The reference cycle can now be fixed with the new traceback.TracebackException object introduced in Python 3.5. It stores informations required to format a full textual traceback without storing local variables.
The asyncio is impacted by the reference cycle issue. This module is also maintained outside Python standard library to release a version for Python 3.3. traceback.TracebackException will maybe be backported in a private asyncio module to fix reference cycle issues.
Alternatives
No change
A new private _PyErr_ChainExceptions() function is enough to chain manually exceptions.
Exceptions will only be chained explicitly where it makes sense.
New helpers to chain exceptions
Functions like PyErr_SetString() don't chain automatically exceptions. To make the usage of _PyErr_ChainExceptions() easier, new private functions are added:
- _PyErr_SetStringChain(exc_type, message)
- _PyErr_FormatChain(exc_type, format, ...)
- _PyErr_SetNoneChain(exc_type)
- _PyErr_SetObjectChain(exc_type, exc_value)
Helper functions to raise specific exceptions like _PyErr_SetKeyError(key) or PyErr_SetImportError(message, name, path) don't chain exceptions. The generic _PyErr_ChainExceptions(exc_type, exc_value, exc_tb) should be used to chain exceptions with these helper functions.
Appendix
PEPs
- PEP 3134 -- Exception Chaining and Embedded Tracebacks (Python 3.0): new __context__ and __cause__ attributes for exceptions
- PEP 415 - Implement context suppression with exception attributes (Python 3.3): raise exc from None
- PEP 409 - Suppressing exception context (superseded by the PEP 415)
Python C API
The header file Include/pyerror.h declares functions related to exceptions.
Functions raising exceptions:
- PyErr_SetNone(exc_type)
- PyErr_SetObject(exc_type, exc_value)
- PyErr_SetString(exc_type, message)
- PyErr_Format(exc, format, ...)
Helpers to raise specific exceptions:
- PyErr_BadArgument()
- PyErr_BadInternalCall()
- PyErr_NoMemory()
- PyErr_SetFromErrno(exc)
- PyErr_SetFromWindowsErr(err)
- PyErr_SetImportError(message, name, path)
- _PyErr_SetKeyError(key)
- _PyErr_TrySetFromCause(prefix_format, ...)
Manage the current exception:
- PyErr_Clear(): clear the current exception, like except: pass
- PyErr_Fetch(exc_type, exc_value, exc_tb)
- PyErr_Restore(exc_type, exc_value, exc_tb)
- PyErr_GetExcInfo(exc_type, exc_value, exc_tb)
- PyErr_SetExcInfo(exc_type, exc_value, exc_tb)
Others function to handle exceptions:
- PyErr_ExceptionMatches(exc): check to implement except exc: ...
- PyErr_GivenExceptionMatches(exc1, exc2)
- PyErr_NormalizeException(exc_type, exc_value, exc_tb)
- _PyErr_ChainExceptions(exc_type, exc_value, exc_tb)
Python Issues
Chain exceptions:
- Issue #23763: Chain exceptions in C
- Issue #23696: zipimport: chain ImportError to OSError
- Issue #21715: Chaining exceptions at C level: added _PyErr_ChainExceptions()
- Issue #18488: sqlite: finalize() method of user function may be called with an exception set if a call to step() method failed
- Issue #23781: Add private _PyErr_ReplaceException() in 2.7
- Issue #23782: Leak in _PyTraceback_Add
Changes preventing to loose exceptions:
Copyright
This document has been placed in the public domain.
pep-0491 The Wheel Binary Package Format 1.9
| PEP: | 491 |
|---|---|
| Title: | The Wheel Binary Package Format 1.9 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Daniel Holth <dholth at gmail.com> |
| Discussions-To: | <distutils-sig at python.org> |
| Status: | Draft |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 16 April 2015 |
Contents
- Abstract
- Rationale
- Details
- FAQ
- Wheel defines a .data directory. Should I put all my data there?
- Why does wheel include attached signatures?
- Why does wheel allow JWS signatures?
- Why does wheel also allow S/MIME signatures?
- What's the deal with "purelib" vs. "platlib"?
- Is it possible to import Python code directly from a wheel file?
- References
- Appendix
- Copyright
Abstract
This PEP describes the second version of a built-package format for Python called "wheel". Wheel provides a Python-specific, relocatable package format that allows people to install software more quickly and predictably than re-building from source each time.
A wheel is a ZIP-format archive with a specially formatted file name and the .whl extension. It contains a single distribution nearly as it would be installed according to PEP 376 with a particular installation scheme. Simple wheels can be unpacked onto sys.path and used directly but wheels are usually installed with a specialized installer.
This version of the wheel specification adds support for installing distributions into many different directories, and adds a way to find those files after they have been installed.
Rationale
Wheel 1.0 is best at installing files into site-packages and a few other locations specified by distutils, but users would like to install files from single distribution into many directories -- perhaps separate locations for docs, data, and code. Unfortunately not everyone agrees on where these install locations should be relative to the root directory. This version of the format adds many more categories, each of which can be installed to a different destination based on policy. Since it might also be important to locate the installed files at runtime, this version of the format also adds a way to record the installed paths in a way that can be read by the installed software.
Details
Installing a wheel 'distribution-1.0-py32-none-any.whl'
Wheel installation notionally consists of two phases:
- Unpack.
- Parse distribution-1.0.dist-info/WHEEL.
- Check that installer is compatible with Wheel-Version. Warn if minor version is greater, abort if major version is greater.
- If Root-Is-Purelib == 'true', unpack archive into purelib (site-packages).
- Else unpack archive into platlib (site-packages).
- Spread.
- Unpacked archive includes distribution-1.0.dist-info/ and (if there is data) distribution-1.0.data/.
- Move each subtree of distribution-1.0.data/ onto its destination path. Each subdirectory of distribution-1.0.data/ is a key into a dict of destination directories, such as distribution-1.0.data/(purelib|platlib|headers|scripts|data).
- Update scripts starting with #!python to point to the correct interpreter. (Note: Python scripts are usually handled by package metadata, and not included verbatim in wheel.)
- Update distribution-1.0.dist.info/RECORD with the installed paths.
- If empty, remove the distribution-1.0.data directory.
- Compile any installed .py to .pyc. (Uninstallers should be smart enough to remove .pyc even if it is not mentioned in RECORD.)
In practice, installers will usually extract files directly from the archive to their destinations without writing a temporary distribution-1.0.data/ directory.
Recommended installer features
- Rewrite #!python.
In wheel, verbatim scripts are packaged in {distribution}-{version}.data/scripts/. If the first line of a file in scripts/ starts with exactly b'#!python', rewrite to point to the correct interpreter. Unix installers may need to add the +x bit to these files if the archive was created on Windows.
The b'#!pythonw' convention is allowed. b'#!pythonw' indicates a GUI script instead of a console script.
- Generate script wrappers.
- Python scripts are more commonly represented as a module:callable string in package metadata, and are not included verbatim in the wheel archive's scripts directory. This kind of script gives the installer an opportunity to generate platform specific wrappers.
Recommended archiver features
- Place .dist-info at the end of the archive.
- Archivers are encouraged to place the .dist-info files physically at the end of the archive. This enables some potentially interesting ZIP tricks including the ability to amend the metadata without rewriting the entire archive.
File Format
File name convention
The wheel filename is {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl.
- distribution
- Distribution name, e.g. 'django', 'pyramid'.
- version
- Distribution version, e.g. 1.0.
- build tag
- Optional build number. Must start with a digit. A tie breaker if two wheels have the same version. Sort as the empty string if unspecified, else sort the initial digits as a number, and the remainder lexicographically.
- language implementation and version tag
- E.g. 'py27', 'py2', 'py3'.
- abi tag
- E.g. 'cp33m', 'abi3', 'none'.
- platform tag
- E.g. 'linux_x86_64', 'any'.
For example, distribution-1.0-1-py27-none-any.whl is the first build of a package called 'distribution', and is compatible with Python 2.7 (any Python 2.7 implementation), with no ABI (pure Python), on any CPU architecture.
The last three components of the filename before the extension are called "compatibility tags." The compatibility tags express the package's basic interpreter requirements and are detailed in PEP 425.
Escaping and Unicode
Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:
re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)
The archive filename is Unicode. The packaging tools may only support ASCII package names, but Unicode filenames are supported in this specification.
The filenames inside the archive are encoded as UTF-8. Although some ZIP clients in common use do not properly display UTF-8 filenames, the encoding is supported by both the ZIP specification and Python's zipfile.
File contents
The contents of a wheel file, where {distribution} is replaced with the name of the package, e.g. beaglevote and {version} is replaced with its version, e.g. 1.0.0, consist of:
/, the root of the archive, contains all files to be installed in purelib or platlib as specified in WHEEL. purelib and platlib are usually both site-packages.
{distribution}-{version}.dist-info/ contains metadata.
{distribution}-{version}.data/ contains one subdirectory for each non-empty install scheme key not already covered, where the subdirectory name is an index into a dictionary of install paths (e.g. data, scripts, include, purelib, platlib).
Python scripts must appear in scripts and begin with exactly b'#!python' in order to enjoy script wrapper generation and #!python rewriting at install time. They may have any or no extension.
{distribution}-{version}.dist-info/METADATA is Metadata version 1.1 or greater format metadata.
{distribution}-{version}.dist-info/WHEEL is metadata about the archive itself in the same basic key: value format:
Wheel-Version: 1.9 Generator: bdist_wheel 1.9 Root-Is-Purelib: true Tag: py2-none-any Tag: py3-none-any Build: 1 Install-Paths-To: wheel/_paths.py Install-Paths-To: wheel/_paths.json
Wheel-Version is the version number of the Wheel specification.
Generator is the name and optionally the version of the software that produced the archive.
Root-Is-Purelib is true if the top level directory of the archive should be installed into purelib; otherwise the root should be installed into platlib.
Tag is the wheel's expanded compatibility tags; in the example the filename would contain py2.py3-none-any.
Build is the build number and is omitted if there is no build number.
Install-Paths-To is a location relative to the archive that will be overwritten with the install-time paths of each category in the install scheme. See the install paths section. May appear 0 or more times.
A wheel installer should warn if Wheel-Version is greater than the version it supports, and must fail if Wheel-Version has a greater major version than the version it supports.
Wheel, being an installation format that is intended to work across multiple versions of Python, does not generally include .pyc files.
Wheel does not contain setup.py or setup.cfg.
The .dist-info directory
- Wheel .dist-info directories include at a minimum METADATA, WHEEL, and RECORD.
- METADATA is the package metadata, the same format as PKG-INFO as found at the root of sdists.
- WHEEL is the wheel metadata specific to a build of the package.
- RECORD is a list of (almost) all the files in the wheel and their secure hashes. Unlike PEP 376, every file except RECORD, which cannot contain a hash of itself, must include its hash. The hash algorithm must be sha256 or better; specifically, md5 and sha1 are not permitted, as signed wheel files rely on the strong hashes in RECORD to validate the integrity of the archive.
- PEP 376's INSTALLER and REQUESTED are not included in the archive.
- RECORD.jws is used for digital signatures. It is not mentioned in RECORD.
- RECORD.p7s is allowed as a courtesy to anyone who would prefer to use S/MIME signatures to secure their wheel files. It is not mentioned in RECORD.
- During extraction, wheel installers verify all the hashes in RECORD against the file contents. Apart from RECORD and its signatures, installation will fail if any file in the archive is not both mentioned and correctly hashed in RECORD.
The .data directory
Any file that is not normally installed inside site-packages goes into the .data directory, named as the .dist-info directory but with the .data/ extension:
distribution-1.0.dist-info/ distribution-1.0.data/
The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution. During installation the contents of these subdirectories are moved onto their destination paths.
If a subdirectory is not found in the install scheme, the installer should emit a warning, and it should be installed at distribution-1.0.data/... as if the package was unpacked by a standard unzip tool.
Install paths
In addition to the distutils install paths, wheel now includes the listed categories based on GNU autotools. This expanded scheme should help installers to implement system policy, but installers may root each category at any location.
A UNIX install scheme might map the categories to their installation patnhs like this:
{
'bindir': '$eprefix/bin',
'sbindir': '$eprefix/sbin',
'libexecdir': '$eprefix/libexec',
'sysconfdir': '$prefix/etc',
'sharedstatedir': '$prefix/com',
'localstatedir': '$prefix/var',
'libdir': '$eprefix/lib',
'static_libdir': r'$prefix/lib',
'includedir': '$prefix/include',
'datarootdir': '$prefix/share',
'datadir': '$datarootdir',
'mandir': '$datarootdir/man',
'infodir': '$datarootdir/info',
'localedir': '$datarootdir/locale',
'docdir': '$datarootdir/doc/$dist_name',
'htmldir': '$docdir',
'dvidir': '$docdir',
'psdir': '$docdir',
'pdfdir': '$docdir',
'pkgdatadir': '$datadir/$dist_name'
}
If a package needs to find its files at runtime, it can request they be written to a specified file or files by the installer and included in those same files inside the archive itself, relative to their location within the archive (so a wheel is still installed correctly if unpacked with a standard unzip tool, or perhaps not unpacked at all).
If the WHEEL metadata contains these files:
Install-Paths-To: wheel/_paths.py Install-Paths-To: wheel/_paths.json
Then the wheel installer, when it is about to unpack wheel/_paths.py from the archive, replaces it with the actual paths used at install time. The paths may be absolute or relative to the generated file.
If the filename ends with .py then a Python script is written. The script MUST be executed to get the paths, but it will probably look like this:
data='../wheel-0.26.0.dev1.data/data' headers='../wheel-0.26.0.dev1.data/headers' platlib='../wheel-0.26.0.dev1.data/platlib' purelib='../wheel-0.26.0.dev1.data/purelib' scripts='../wheel-0.26.0.dev1.data/scripts' # ...
If the filename ends with .json then a JSON document is written:
{ "data": "../wheel-0.26.0.dev1.data/data", ... }
Only the categories actually used by a particular wheel must be written to this file.
These files are designed to be written to a location that can be found by the installed package without introducing any dependency on a packaging library.
Signed wheel files
Wheel files include an extended RECORD that enables digital signatures. PEP 376's RECORD is altered to include a secure hash digestname=urlsafe_b64encode_nopad(digest) (urlsafe base64 encoding with no trailing = characters) as the second column instead of an md5sum. All possible entries are hashed, including any generated files such as .pyc files, but not RECORD which cannot contain its own hash. For example:
file.py,sha256=AVTFPZpEKzuHr7OvQZmhaU3LvwKz06AJw8mT\_pNh2yI,3144 distribution-1.0.dist-info/RECORD,,
The signature file(s) RECORD.jws and RECORD.p7s are not mentioned in RECORD at all since they can only be added after RECORD is generated. Every other file in the archive must have a correct hash in RECORD or the installation will fail.
If JSON web signatures are used, one or more JSON Web Signature JSON Serialization (JWS-JS) signatures is stored in a file RECORD.jws adjacent to RECORD. JWS is used to sign RECORD by including the SHA-256 hash of RECORD as the signature's JSON payload:
{ "hash": "sha256=ADD-r2urObZHcxBW3Cr-vDCu5RJwT4CaRTHiFmbcIYY" }
(The hash value is the same format used in RECORD.)
If RECORD.p7s is used, it must contain a detached S/MIME format signature of RECORD.
A wheel installer is not required to understand digital signatures but MUST verify the hashes in RECORD against the extracted file contents. When the installer checks file hashes against RECORD, a separate signature checker only needs to establish that RECORD matches the signature.
See
Comparison to .egg
- Wheel is an installation format; egg is importable. Wheel archives do not need to include .pyc and are less tied to a specific Python version or implementation. Wheel can install (pure Python) packages built with previous versions of Python so you don't always have to wait for the packager to catch up.
- Wheel uses .dist-info directories; egg uses .egg-info. Wheel is compatible with the new world of Python packaging and the new concepts it brings.
- Wheel has a richer file naming convention for today's multi-implementation world. A single wheel archive can indicate its compatibility with a number of Python language versions and implementations, ABIs, and system architectures. Historically the ABI has been specific to a CPython release, wheel is ready for the stable ABI.
- Wheel is lossless. The first wheel implementation bdist_wheel always generates egg-info, and then converts it to a .whl. It is also possible to convert existing eggs and bdist_wininst distributions.
- Wheel is versioned. Every wheel file contains the version of the wheel specification and the implementation that packaged it. Hopefully the next migration can simply be to Wheel 2.0.
- Wheel is a reference to the other Python.
FAQ
Wheel defines a .data directory. Should I put all my data there?
This specification does not have an opinion on how you should organize your code. The .data directory is just a place for any files that are not normally installed inside site-packages or on the PYTHONPATH. In other words, you may continue to use pkgutil.get_data(package, resource) even though those files will usually not be distributed in wheel's .data directory.
Why does wheel include attached signatures?
Attached signatures are more convenient than detached signatures because they travel with the archive. Since only the individual files are signed, the archive can be recompressed without invalidating the signature or individual files can be verified without having to download the whole archive.
Why does wheel allow JWS signatures?
The JOSE specifications of which JWS is a part are designed to be easy to implement, a feature that is also one of wheel's primary design goals. JWS yields a useful, concise pure-Python implementation.
Why does wheel also allow S/MIME signatures?
S/MIME signatures are allowed for users who need or want to use existing public key infrastructure with wheel.
Signed packages are only a basic building block in a secure package update system. Wheel only provides the building block.
What's the deal with "purelib" vs. "platlib"?
Wheel preserves the "purelib" vs. "platlib" distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to '/usr/lib/pythonX.Y/site-packages' and platform dependent packages to '/usr/lib64/pythonX.Y/site-packages'.
A wheel with "Root-Is-Purelib: false" with all its files in {name}-{version}.data/purelib is equivalent to a wheel with "Root-Is-Purelib: true" with those same files in the root, and it is legal to have files in both the "purelib" and "platlib" categories.
In practice a wheel should have only one of "purelib" or "platlib" depending on whether it is pure Python or not and those files should be at the root with the appropriate setting given for "Root-is-purelib".
Is it possible to import Python code directly from a wheel file?
Technically, due to the combination of supporting installation via simple extraction and using an archive format that is compatible with zipimport, a subset of wheel files do support being placed directly on sys.path. However, while this behaviour is a natural consequence of the format design, actually relying on it is generally discouraged.
Firstly, wheel is designed primarily as a distribution format, so skipping the installation step also means deliberately avoiding any reliance on features that assume full installation (such as being able to use standard tools like pip and virtualenv to capture and manage dependencies in a way that can be properly tracked for auditing and security update purposes, or integrating fully with the standard build machinery for C extensions by publishing header files in the appropriate place).
Secondly, while some Python software is written to support running directly from a zip archive, it is still common for code to be written assuming it has been fully installed. When that assumption is broken by trying to run the software from a zip archive, the failures can often be obscure and hard to diagnose (especially when they occur in third party libraries). The two most common sources of problems with this are the fact that importing C extensions from a zip archive is not supported by CPython (since doing so is not supported directly by the dynamic loading machinery on any platform) and that when running from a zip archive the __file__ attribute no longer refers to an ordinary filesystem path, but to a combination path that includes both the location of the zip archive on the filesystem and the relative path to the module inside the archive. Even when software correctly uses the abstract resource APIs internally, interfacing with external components may still require the availability of an actual on-disk file.
Like metaclasses, monkeypatching and metapath importers, if you're not already sure you need to take advantage of this feature, you almost certainly don't need it. If you do decide to use it anyway, be aware that many projects will require a failure to be reproduced with a fully installed package before accepting it as a genuine bug.
References
| [1] | PEP acceptance (http://mail.python.org/pipermail/python-dev/2013-February/124103.html) |
Appendix
Example urlsafe-base64-nopad implementation:
# urlsafe-base64-nopad for Python 3
import base64
def urlsafe_b64encode_nopad(data):
return base64.urlsafe_b64encode(data).rstrip(b'=')
def urlsafe_b64decode_nopad(data):
pad = b'=' * (4 - (len(data) & 3))
return base64.urlsafe_b64decode(data + pad)
Copyright
This document has been placed into the public domain.
pep-0492 Coroutines with async and await syntax
| PEP: | 492 |
|---|---|
| Title: | Coroutines with async and await syntax |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Yury Selivanov <yselivanov at sprymix.com> |
| Discussions-To: | <python-dev at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 09-Apr-2015 |
| Python-Version: | 3.5 |
| Post-History: | 17-Apr-2015, 21-Apr-2015, 27-Apr-2015, 29-Apr-2015, 05-May-2015 |
Contents
- Abstract
- Rationale and Goals
- Specification
- Glossary
- List of functions and methods
- Transition Plan
- Design Considerations
- PEP 3152
- Coroutine-generators
- Why "async" and "await" keywords
- Why "__aiter__" returns awaitable
- Importance of "async" keyword
- Why "async def"
- Why not "await for" and "await with"
- Why "async def" and not "def async"
- Why not a __future__ import
- Why magic methods start with "a"
- Why not reuse existing magic names
- Why not reuse existing "for" and "with" statements
- Comprehensions
- Async lambda functions
- Performance
- Reference Implementation
- Acceptance
- Implementation
- References
- Acknowledgments
- Copyright
Abstract
The growth of Internet and general connectivity has triggered the proportionate need for responsive and scalable code. This proposal aims to answer that need by making writing explicitly asynchronous, concurrent Python code easier and more Pythonic.
It is proposed to make coroutines a proper standalone concept in Python, and introduce new supporting syntax. The ultimate goal is to help establish a common, easily approachable, mental model of asynchronous programming in Python and make it as close to synchronous programming as possible.
This PEP assumes that the asynchronous tasks are scheduled and coordinated by an Event Loop similar to that of stdlib module asyncio.events.AbstractEventLoop. While the PEP is not tied to any specific Event Loop implementation, it is relevant only to the kind of coroutine that uses yield as a signal to the scheduler, indicating that the coroutine will be waiting until an event (such as IO) is completed.
We believe that the changes proposed here will help keep Python relevant and competitive in a quickly growing area of asynchronous programming, as many other languages have adopted, or are planning to adopt, similar features: [2], [5], [6], [7], [8], [10].
Rationale and Goals
Current Python supports implementing coroutines via generators (PEP 342), further enhanced by the yield from syntax introduced in PEP 380. This approach has a number of shortcomings:
- It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.
- Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring.
- Support for asynchronous calls is limited to expressions where yield is allowed syntactically, limiting the usefulness of syntactic features, such as with and for statements.
This proposal makes coroutines a native Python language feature, and clearly separates them from generators. This removes generator/coroutine ambiguity, and makes it possible to reliably define coroutines without reliance on a specific library. This also enables linters and IDEs to improve static code analysis and refactoring.
Native coroutines and the associated new syntax features make it possible to define context manager and iteration protocols in asynchronous terms. As shown later in this proposal, the new async with statement lets Python programs perform asynchronous calls when entering and exiting a runtime context, and the new async for statement makes it possible to perform asynchronous calls in iterators.
Specification
This proposal introduces new syntax and semantics to enhance coroutine support in Python.
This specification presumes knowledge of the implementation of coroutines in Python (PEP 342 and PEP 380). Motivation for the syntax changes proposed here comes from the asyncio framework (PEP 3156) and the "Cofunctions" proposal (PEP 3152, now rejected in favor of this specification).
From this point in this document we use the word native coroutine to refer to functions declared using the new syntax. generator-based coroutine is used where necessary to refer to coroutines that are based on generator syntax. coroutine is used in contexts where both definitions are applicable.
New Coroutine Declaration Syntax
The following new syntax is used to declare a native coroutine:
async def read_data(db):
pass
Key properties of coroutines:
async def functions are always coroutines, even if they do not contain await expressions.
It is a SyntaxError to have yield or yield from expressions in an async function.
Internally, two new code object flags were introduced:
- CO_COROUTINE is used to mark native coroutines (defined with new syntax.)
- CO_ITERABLE_COROUTINE is used to make generator-based coroutines compatible with native coroutines (set by types.coroutine() function).
All coroutines have CO_GENERATOR flag set.
Regular generators, when called, return a generator object; similarly, coroutines return a coroutine object.
StopIteration exceptions are not propagated out of coroutines, and are replaced with a RuntimeError. For regular generators such behavior requires a future import (see PEP 479).
When a coroutine is garbage collected, a RuntimeWarning is raised if it was never awaited on (see also Debugging Features.)
See also Coroutine objects section.
types.coroutine()
A new function coroutine(gen) is added to the types module. It allows interoperability between existing generator-based coroutines in asyncio and native coroutines introduced by this PEP:
@types.coroutine
def process_data(db):
data = yield from read_data(db)
...
The function applies CO_ITERABLE_COROUTINE flag to generator- function's code object, making it return a coroutine object.
The function can be used as a decorator, since it modifies generator- functions in-place and returns them.
Note, that the CO_COROUTINE flag is not applied by types.coroutine() to make it possible to separate native coroutines defined with new syntax, from generator-based coroutines.
Await Expression
The following new await expression is used to obtain a result of coroutine execution:
async def read_data(db):
data = await db.fetch('SELECT ...')
...
await, similarly to yield from, suspends execution of read_data coroutine until db.fetch awaitable completes and returns the result data.
It uses the yield from implementation with an extra step of validating its argument. await only accepts an awaitable, which can be one of:
A native coroutine object returned from a native coroutine function.
A generator-based coroutine object returned from a generator function decorated with types.coroutine().
An object with an __await__ method returning an iterator.
Any yield from chain of calls ends with a yield. This is a fundamental mechanism of how Futures are implemented. Since, internally, coroutines are a special kind of generators, every await is suspended by a yield somewhere down the chain of await calls (please refer to PEP 3156 for a detailed explanation.)
To enable this behavior for coroutines, a new magic method called __await__ is added. In asyncio, for instance, to enable Future objects in await statements, the only change is to add __await__ = __iter__ line to asyncio.Future class.
Objects with __await__ method are called Future-like objects in the rest of this PEP.
Also, please note that __aiter__ method (see its definition below) cannot be used for this purpose. It is a different protocol, and would be like using __iter__ instead of __call__ for regular callables.
It is a TypeError if __await__ returns anything but an iterator.
Objects defined with CPython C API with a tp_as_async->am_await function, returning an iterator (similar to __await__ method).
It is a SyntaxError to use await outside of an async def function (like it is a SyntaxError to use yield outside of def function.)
It is a TypeError to pass anything other than an awaitable object to an await expression.
Updated operator precedence table
await keyword is defined as follows:
power ::= await ["**" u_expr] await ::= ["await"] primary
where "primary" represents the most tightly bound operations of the language. Its syntax is:
primary ::= atom | attributeref | subscription | slicing | call
See Python Documentation [12] and Grammar Updates section of this proposal for details.
The key await difference from yield and yield from operators is that await expressions do not require parentheses around them most of the times.
Also, yield from allows any expression as its argument, including expressions like yield from a() + b(), that would be parsed as yield from (a() + b()), which is almost always a bug. In general, the result of any arithmetic operation is not an awaitable object. To avoid this kind of mistakes, it was decided to make await precedence lower than [], (), and ., but higher than ** operators.
| Operator | Description |
|---|---|
| yield x, yield from x | Yield expression |
| lambda | Lambda expression |
| if -- else | Conditional expression |
| or | Boolean OR |
| and | Boolean AND |
| not x | Boolean NOT |
| in, not in, is, is not, <, <=, >, >=, !=, == | Comparisons, including membership tests and identity tests |
| | | Bitwise OR |
| ^ | Bitwise XOR |
| & | Bitwise AND |
| <<, >> | Shifts |
| +, - | Addition and subtraction |
| *, @, /, //, % | Multiplication, matrix multiplication, division, remainder |
| +x, -x, ~x | Positive, negative, bitwise NOT |
| ** | Exponentiation |
| await x | Await expression |
| x[index], x[index:index], x(arguments...), x.attribute | Subscription, slicing, call, attribute reference |
| (expressions...), [expressions...], {key: value...}, {expressions...} | Binding or tuple display, list display, dictionary display, set display |
Examples of "await" expressions
Valid syntax examples:
| Expression | Will be parsed as |
|---|---|
| if await fut: pass | if (await fut): pass |
| if await fut + 1: pass | if (await fut) + 1: pass |
| pair = await fut, 'spam' | pair = (await fut), 'spam' |
| with await fut, open(): pass | with (await fut), open(): pass |
| await foo()['spam'].baz()() | await ( foo()['spam'].baz()() ) |
| return await coro() | return ( await coro() ) |
| res = await coro() ** 2 | res = (await coro()) ** 2 |
| func(a1=await coro(), a2=0) | func(a1=(await coro()), a2=0) |
| await foo() + await bar() | (await foo()) + (await bar()) |
| -await foo() | -(await foo()) |
Invalid syntax examples:
| Expression | Should be written as |
|---|---|
| await await coro() | await (await coro()) |
| await -coro() | await (-coro()) |
Asynchronous Context Managers and "async with"
An asynchronous context manager is a context manager that is able to suspend execution in its enter and exit methods.
To make this possible, a new protocol for asynchronous context managers is proposed. Two new magic methods are added: __aenter__ and __aexit__. Both must return an awaitable.
An example of an asynchronous context manager:
class AsyncContextManager:
async def __aenter__(self):
await log('entering context')
async def __aexit__(self, exc_type, exc, tb):
await log('exiting context')
New Syntax
A new statement for asynchronous context managers is proposed:
async with EXPR as VAR:
BLOCK
which is semantically equivalent to:
mgr = (EXPR)
aexit = type(mgr).__aexit__
aenter = type(mgr).__aenter__(mgr)
exc = True
VAR = await aenter
try:
BLOCK
except:
if not await aexit(mgr, *sys.exc_info()):
raise
else:
await aexit(mgr, None, None, None)
As with regular with statements, it is possible to specify multiple context managers in a single async with statement.
It is an error to pass a regular context manager without __aenter__ and __aexit__ methods to async with. It is a SyntaxError to use async with outside of an async def function.
Example
With asynchronous context managers it is easy to implement proper database transaction managers for coroutines:
async def commit(session, data):
...
async with session.transaction():
...
await session.update(data)
...
Code that needs locking also looks lighter:
async with lock:
...
instead of:
with (yield from lock):
...
Asynchronous Iterators and "async for"
An asynchronous iterable is able to call asynchronous code in its iter implementation, and asynchronous iterator can call asynchronous code in its next method. To support asynchronous iteration:
- An object must implement an __aiter__ method returning an awaitable resulting in an asynchronous iterator object.
- An asynchronous iterator object must implement an __anext__ method returning an awaitable.
- To stop iteration __anext__ must raise a StopAsyncIteration exception.
An example of asynchronous iterable:
class AsyncIterable:
async def __aiter__(self):
return self
async def __anext__(self):
data = await self.fetch_data()
if data:
return data
else:
raise StopAsyncIteration
async def fetch_data(self):
...
New Syntax
A new statement for iterating through asynchronous iterators is proposed:
async for TARGET in ITER:
BLOCK
else:
BLOCK2
which is semantically equivalent to:
iter = (ITER)
iter = await type(iter).__aiter__(iter)
running = True
while running:
try:
TARGET = await type(iter).__anext__(iter)
except StopAsyncIteration:
running = False
else:
BLOCK
else:
BLOCK2
It is a TypeError to pass a regular iterable without __aiter__ method to async for. It is a SyntaxError to use async for outside of an async def function.
As for with regular for statement, async for has an optional else clause.
Example 1
With asynchronous iteration protocol it is possible to asynchronously buffer data during iteration:
async for data in cursor:
...
Where cursor is an asynchronous iterator that prefetches N rows of data from a database after every N iterations.
The following code illustrates new asynchronous iteration protocol:
class Cursor:
def __init__(self):
self.buffer = collections.deque()
def _prefetch(self):
...
async def __aiter__(self):
return self
async def __anext__(self):
if not self.buffer:
self.buffer = await self._prefetch()
if not self.buffer:
raise StopAsyncIteration
return self.buffer.popleft()
then the Cursor class can be used as follows:
async for row in Cursor():
print(row)
which would be equivalent to the following code:
i = await Cursor().__aiter__()
while True:
try:
row = await i.__anext__()
except StopAsyncIteration:
break
else:
print(row)
Example 2
The following is a utility class that transforms a regular iterable to an asynchronous one. While this is not a very useful thing to do, the code illustrates the relationship between regular and asynchronous iterators.
class AsyncIteratorWrapper:
def __init__(self, obj):
self._it = iter(obj)
async def __aiter__(self):
return self
async def __anext__(self):
try:
value = next(self._it)
except StopIteration:
raise StopAsyncIteration
return value
async for letter in AsyncIteratorWrapper("abc"):
print(letter)
Why StopAsyncIteration?
Coroutines are still based on generators internally. So, before PEP 479, there was no fundamental difference between
def g1():
yield from fut
return 'spam'
and
def g2():
yield from fut
raise StopIteration('spam')
And since PEP 479 is accepted and enabled by default for coroutines, the following example will have its StopIteration wrapped into a RuntimeError
async def a1():
await fut
raise StopIteration('spam')
The only way to tell the outside code that the iteration has ended is to raise something other than StopIteration. Therefore, a new built-in exception class StopAsyncIteration was added.
Moreover, with semantics from PEP 479, all StopIteration exceptions raised in coroutines are wrapped in RuntimeError.
Coroutine objects
Differences from generators
This section applies only to native coroutines with CO_COROUTINE flag, i.e. defined with the new async def syntax.
The behavior of existing *generator-based coroutines* in asyncio remains unchanged.
Great effort has been made to make sure that coroutines and generators are treated as distinct concepts:
Native coroutine objects do not implement __iter__ and __next__ methods. Therefore, they cannot be iterated over or passed to iter(), list(), tuple() and other built-ins. They also cannot be used in a for..in loop.
An attempt to use __iter__ or __next__ on a native coroutine object will result in a TypeError.
Plain generators cannot yield from native coroutines: doing so will result in a TypeError.
generator-based coroutines (for asyncio code must be decorated with @asyncio.coroutine) can yield from native coroutine objects.
inspect.isgenerator() and inspect.isgeneratorfunction() return False for native coroutine objects and native coroutine functions.
Coroutine object methods
Coroutines are based on generators internally, thus they share the implementation. Similarly to generator objects, coroutines have throw(), send() and close() methods. StopIteration and GeneratorExit play the same role for coroutines (although PEP 479 is enabled by default for coroutines). See PEP 342, PEP 380, and Python Documentation [11] for details.
throw(), send() methods for coroutines are used to push values and raise errors into Future-like objects.
Debugging Features
A common beginner mistake is forgetting to use yield from on coroutines:
@asyncio.coroutine
def useful():
asyncio.sleep(1) # this will do noting without 'yield from'
For debugging this kind of mistakes there is a special debug mode in asyncio, in which @coroutine decorator wraps all functions with a special object with a destructor logging a warning. Whenever a wrapped generator gets garbage collected, a detailed logging message is generated with information about where exactly the decorator function was defined, stack trace of where it was collected, etc. Wrapper object also provides a convenient __repr__ function with detailed information about the generator.
The only problem is how to enable these debug capabilities. Since debug facilities should be a no-op in production mode, @coroutine decorator makes the decision of whether to wrap or not to wrap based on an OS environment variable PYTHONASYNCIODEBUG. This way it is possible to run asyncio programs with asyncio's own functions instrumented. EventLoop.set_debug, a different debug facility, has no impact on @coroutine decorator's behavior.
With this proposal, coroutines is a native, distinct from generators, concept. In addition to a RuntimeWarning being raised on coroutines that were never awaited, it is proposed to add two new functions to the sys module: set_coroutine_wrapper and get_coroutine_wrapper. This is to enable advanced debugging facilities in asyncio and other frameworks (such as displaying where exactly coroutine was created, and a more detailed stack trace of where it was garbage collected).
New Standard Library Functions
- types.coroutine(gen). See types.coroutine() section for details.
- inspect.iscoroutine(obj) returns True if obj is a coroutine object.
- inspect.iscoroutinefunction(obj) returns True if obj is a coroutine function.
- inspect.isawaitable(obj) returns True if obj can be used in await expression. See Await Expression for details.
- sys.set_coroutine_wrapper(wrapper) allows to intercept creation of coroutine objects. wrapper must be either a callable that accepts one argument (a coroutine object), or None. None resets the wrapper. If called twice, the new wrapper replaces the previous one. The function is thread-specific. See Debugging Features for more details.
- sys.get_coroutine_wrapper() returns the current wrapper object. Returns None if no wrapper was set. The function is thread-specific. See Debugging Features for more details.
New Abstract Base Classes
In order to allow better integration with existing frameworks (such as Tornado, see [13]) and compilers (such as Cython, see [16]), two new Abstract Base Classes (ABC) are added:
- collections.abc.Awaitable ABC for Future-like classes, that implement __await__ method.
- collections.abc.Coroutine ABC for coroutine objects, that implement send(value), throw(type, exc, tb), close() and __await__() methods.
To allow easy testing if objects support asynchronous iteration, two more ABCs are added:
- collections.abc.AsyncIterable -- tests for __aiter__ method.
- collections.abc.AsyncIterator -- tests for __aiter__ and __anext__ methods.
Glossary
- Native coroutine function
- A coroutine function is declared with async def. It uses await and return value; see New Coroutine Declaration Syntax for details.
- Native coroutine
- Returned from a native coroutine function. See Await Expression for details.
- Generator-based coroutine function
- Coroutines based on generator syntax. Most common example are functions decorated with @asyncio.coroutine.
- Generator-based coroutine
- Returned from a generator-based coroutine function.
- Coroutine
- Either native coroutine or generator-based coroutine.
- Coroutine object
- Either native coroutine object or generator-based coroutine object.
- Future-like object
- An object with an __await__ method, or a C object with tp_as_async->am_await function, returning an iterator. Can be consumed by an await expression in a coroutine. A coroutine waiting for a Future-like object is suspended until the Future-like object's __await__ completes, and returns the result. See Await Expression for details.
- Awaitable
- A Future-like object or a coroutine object. See Await Expression for details.
- Asynchronous context manager
- An asynchronous context manager has __aenter__ and __aexit__ methods and can be used with async with. See Asynchronous Context Managers and "async with" for details.
- Asynchronous iterable
- An object with an __aiter__ method, which must return an asynchronous iterator object. Can be used with async for. See Asynchronous Iterators and "async for" for details.
- Asynchronous iterator
- An asynchronous iterator has an __anext__ method. See Asynchronous Iterators and "async for" for details.
List of functions and methods
| Method | Can contain | Can't contain |
|---|---|---|
| async def func | await, return value | yield, yield from |
| async def __a*__ | await, return value | yield, yield from |
| def __a*__ | return awaitable | await |
| def __await__ | yield, yield from, return iterable | await |
| generator | yield, yield from, return value | await |
Where:
- "async def func": native coroutine;
- "async def __a*__": __aiter__, __anext__, __aenter__, __aexit__ defined with the async keyword;
- "def __a*__": __aiter__, __anext__, __aenter__, __aexit__ defined without the async keyword, must return an awaitable;
- "def __await__": __await__ method to implement Future-like objects;
- generator: a "regular" generator, function defined with def and which contains a least one yield or yield from expression.
Transition Plan
To avoid backwards compatibility issues with async and await keywords, it was decided to modify tokenizer.c in such a way, that it:
- recognizes async def NAME tokens combination;
- keeps track of regular def and async def indented blocks;
- while tokenizing async def block, it replaces 'async' NAME token with ASYNC, and 'await' NAME token with AWAIT;
- while tokenizing def block, it yields 'async' and 'await' NAME tokens as is.
This approach allows for seamless combination of new syntax features (all of them available only in async functions) with any existing code.
An example of having "async def" and "async" attribute in one piece of code:
class Spam:
async = 42
async def ham():
print(getattr(Spam, 'async'))
# The coroutine can be executed and will print '42'
Backwards Compatibility
This proposal preserves 100% backwards compatibility.
asyncio
asyncio module was adapted and tested to work with coroutines and new statements. Backwards compatibility is 100% preserved, i.e. all existing code will work as-is.
The required changes are mainly:
- Modify @asyncio.coroutine decorator to use new types.coroutine() function.
- Add __await__ = __iter__ line to asyncio.Future class.
- Add ensure_future() as an alias for async() function. Deprecate async() function.
asyncio migration strategy
Because plain generators cannot yield from native coroutine objects (see Differences from generators section for more details), it is advised to make sure that all generator-based coroutines are decorated with @asyncio.coroutine before starting to use the new syntax.
async/await in CPython code base
There is no use of await names in CPython.
async is mostly used by asyncio. We are addressing this by renaming async() function to ensure_future() (see asyncio section for details.)
Another use of async keyword is in Lib/xml/dom/xmlbuilder.py, to define an async = False attribute for DocumentLS class. There is no documentation or tests for it, it is not used anywhere else in CPython. It is replaced with a getter, that raises a DeprecationWarning, advising to use async_ attribute instead. 'async' attribute is not documented and is not used in CPython code base.
Grammar Updates
Grammar changes are fairly minimal:
decorated: decorators (classdef | funcdef | async_funcdef)
async_funcdef: ASYNC funcdef
compound_stmt: (if_stmt | while_stmt | for_stmt | try_stmt | with_stmt
| funcdef | classdef | decorated | async_stmt)
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)
power: atom_expr ['**' factor]
atom_expr: [AWAIT] atom trailer*
Transition Period Shortcomings
There is just one.
Until async and await are not proper keywords, it is not possible (or at least very hard) to fix tokenizer.c to recognize them on the same line with def keyword:
# async and await will always be parsed as variables
async def outer(): # 1
def nested(a=(await fut)):
pass
async def foo(): return (await fut) # 2
Since await and async in such cases are parsed as NAME tokens, a SyntaxError will be raised.
To workaround these issues, the above examples can be easily rewritten to a more readable form:
async def outer(): # 1
a_default = await fut
def nested(a=a_default):
pass
async def foo(): # 2
return (await fut)
This limitation will go away as soon as async and await are proper keywords.
Deprecation Plans
async and await names will be softly deprecated in CPython 3.5 and 3.6. In 3.7 we will transform them to proper keywords. Making async and await proper keywords before 3.7 might make it harder for people to port their code to Python 3.
Design Considerations
PEP 3152
PEP 3152 by Gregory Ewing proposes a different mechanism for coroutines (called "cofunctions"). Some key points:
A new keyword codef to declare a cofunction. Cofunction is always a generator, even if there is no cocall expressions inside it. Maps to async def in this proposal.
A new keyword cocall to call a cofunction. Can only be used inside a cofunction. Maps to await in this proposal (with some differences, see below.)
It is not possible to call a cofunction without a cocall keyword.
cocall grammatically requires parentheses after it:
atom: cocall | <existing alternatives for atom> cocall: 'cocall' atom cotrailer* '(' [arglist] ')' cotrailer: '[' subscriptlist ']' | '.' NAMEcocall f(*args, **kwds) is semantically equivalent to yield from f.__cocall__(*args, **kwds).
Differences from this proposal:
There is no equivalent of __cocall__ in this PEP, which is called and its result is passed to yield from in the cocall expression. await keyword expects an awaitable object, validates the type, and executes yield from on it. Although, __await__ method is similar to __cocall__, but is only used to define Future-like objects.
await is defined in almost the same way as yield from in the grammar (it is later enforced that await can only be inside async def). It is possible to simply write await future, whereas cocall always requires parentheses.
To make asyncio work with PEP 3152 it would be required to modify @asyncio.coroutine decorator to wrap all functions in an object with a __cocall__ method, or to implement __cocall__ on generators. To call cofunctions from existing generator-based coroutines it would be required to use costart(cofunc, *args, **kwargs) built-in.
Since it is impossible to call a cofunction without a cocall keyword, it automatically prevents the common mistake of forgetting to use yield from on generator-based coroutines. This proposal addresses this problem with a different approach, see Debugging Features.
A shortcoming of requiring a cocall keyword to call a coroutine is that if is decided to implement coroutine-generators -- coroutines with yield or async yield expressions -- we wouldn't need a cocall keyword to call them. So we'll end up having __cocall__ and no __call__ for regular coroutines, and having __call__ and no __cocall__ for coroutine- generators.
Requiring parentheses grammatically also introduces a whole lot of new problems.
The following code:
await fut await function_returning_future() await asyncio.gather(coro1(arg1, arg2), coro2(arg1, arg2))
would look like:
cocall fut() # or cocall costart(fut) cocall (function_returning_future())() cocall asyncio.gather(costart(coro1, arg1, arg2), costart(coro2, arg1, arg2))There are no equivalents of async for and async with in PEP 3152.
Coroutine-generators
With async for keyword it is desirable to have a concept of a coroutine-generator -- a coroutine with yield and yield from expressions. To avoid any ambiguity with regular generators, we would likely require to have an async keyword before yield, and async yield from would raise a StopAsyncIteration exception.
While it is possible to implement coroutine-generators, we believe that they are out of scope of this proposal. It is an advanced concept that should be carefully considered and balanced, with a non-trivial changes in the implementation of current generator objects. This is a matter for a separate PEP.
Why "async" and "await" keywords
async/await is not a new concept in programming languages:
- C# has it since long time ago [5];
- proposal to add async/await in ECMAScript 7 [2]; see also Traceur project [9];
- Facebook's Hack/HHVM [6];
- Google's Dart language [7];
- Scala [8];
- proposal to add async/await to C++ [10];
- and many other less popular languages.
This is a huge benefit, as some users already have experience with async/await, and because it makes working with many languages in one project easier (Python with ECMAScript 7 for instance).
Why "__aiter__" returns awaitable
In principle, __aiter__ could be a regular function. There are several good reasons to make it a coroutine:
- as most of the __anext__, __aenter__, and __aexit__ methods are coroutines, users would often make a mistake defining it as async anyways;
- there might be a need to run some asynchronous operations in __aiter__, for instance to prepare DB queries or do some file operation.
Importance of "async" keyword
While it is possible to just implement await expression and treat all functions with at least one await as coroutines, this approach makes APIs design, code refactoring and its long time support harder.
Let's pretend that Python only has await keyword:
def useful():
...
await log(...)
...
def important():
await useful()
If useful() function is refactored and someone removes all await expressions from it, it would become a regular python function, and all code that depends on it, including important() would be broken. To mitigate this issue a decorator similar to @asyncio.coroutine has to be introduced.
Why "async def"
For some people bare async name(): pass syntax might look more appealing than async def name(): pass. It is certainly easier to type. But on the other hand, it breaks the symmetry between async def, async with and async for, where async is a modifier, stating that the statement is asynchronous. It is also more consistent with the existing grammar.
Why not "await for" and "await with"
async is an adjective, and hence it is a better choice for a statement qualifier keyword. await for/with would imply that something is awaiting for a completion of a for or with statement.
Why "async def" and not "def async"
async keyword is a statement qualifier. A good analogy to it are "static", "public", "unsafe" keywords from other languages. "async for" is an asynchronous "for" statement, "async with" is an asynchronous "with" statement, "async def" is an asynchronous function.
Having "async" after the main statement keyword might introduce some confusion, like "for async item in iterator" can be read as "for each asynchronous item in iterator".
Having async keyword before def, with and for also makes the language grammar simpler. And "async def" better separates coroutines from regular functions visually.
Why not a __future__ import
Transition Plan section explains how tokenizer is modified to treat async and await as keywords only in async def blocks. Hence async def fills the role that a module level compiler declaration like from __future__ import async_await would otherwise fill.
Why magic methods start with "a"
New asynchronous magic methods __aiter__, __anext__, __aenter__, and __aexit__ all start with the same prefix "a". An alternative proposal is to use "async" prefix, so that __aiter__ becomes __async_iter__. However, to align new magic methods with the existing ones, such as __radd__ and __iadd__ it was decided to use a shorter version.
Why not reuse existing magic names
An alternative idea about new asynchronous iterators and context managers was to reuse existing magic methods, by adding an async keyword to their declarations:
class CM:
async def __enter__(self): # instead of __aenter__
...
This approach has the following downsides:
- it would not be possible to create an object that works in both with and async with statements;
- it would break backwards compatibility, as nothing prohibits from returning a Future-like objects from __enter__ and/or __exit__ in Python <= 3.4;
- one of the main points of this proposal is to make native coroutines as simple and foolproof as possible, hence the clear separation of the protocols.
Why not reuse existing "for" and "with" statements
The vision behind existing generator-based coroutines and this proposal is to make it easy for users to see where the code might be suspended. Making existing "for" and "with" statements to recognize asynchronous iterators and context managers will inevitably create implicit suspend points, making it harder to reason about the code.
Comprehensions
Syntax for asynchronous comprehensions could be provided, but this construct is outside of the scope of this PEP.
Async lambda functions
Syntax for asynchronous lambda functions could be provided, but this construct is outside of the scope of this PEP.
Performance
Overall Impact
This proposal introduces no observable performance impact. Here is an output of python's official set of benchmarks [4]:
python perf.py -r -b default ../cpython/python.exe ../cpython-aw/python.exe [skipped] Report on Darwin ysmac 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64 i386 Total CPU cores: 8 ### etree_iterparse ### Min: 0.365359 -> 0.349168: 1.05x faster Avg: 0.396924 -> 0.379735: 1.05x faster Significant (t=9.71) Stddev: 0.01225 -> 0.01277: 1.0423x larger The following not significant results are hidden, use -v to show them: django_v2, 2to3, etree_generate, etree_parse, etree_process, fastpickle, fastunpickle, json_dump_v2, json_load, nbody, regex_v8, tornado_http.
Tokenizer modifications
There is no observable slowdown of parsing python files with the modified tokenizer: parsing of one 12Mb file (Lib/test/test_binop.py repeated 1000 times) takes the same amount of time.
async/await
The following micro-benchmark was used to determine performance difference between "async" functions and generators:
import sys
import time
def binary(n):
if n <= 0:
return 1
l = yield from binary(n - 1)
r = yield from binary(n - 1)
return l + 1 + r
async def abinary(n):
if n <= 0:
return 1
l = await abinary(n - 1)
r = await abinary(n - 1)
return l + 1 + r
def timeit(gen, depth, repeat):
t0 = time.time()
for _ in range(repeat):
list(gen(depth))
t1 = time.time()
print('{}({}) * {}: total {:.3f}s'.format(
gen.__name__, depth, repeat, t1-t0))
The result is that there is no observable performance difference. Minimum timing of 3 runs
abinary(19) * 30: total 12.985s binary(19) * 30: total 12.953s
Note that depth of 19 means 1,048,575 calls.
Reference Implementation
The reference implementation can be found here: [3].
List of high-level changes and new protocols
- New syntax for defining coroutines: async def and new await keyword.
- New __await__ method for Future-like objects, and new tp_as_async->am_await slot in PyTypeObject.
- New syntax for asynchronous context managers: async with. And associated protocol with __aenter__ and __aexit__ methods.
- New syntax for asynchronous iteration: async for. And associated protocol with __aiter__, __aexit__ and new built- in exception StopAsyncIteration. New tp_as_async->am_aiter and tp_as_async->am_anext slots in PyTypeObject.
- New AST nodes: AsyncFunctionDef, AsyncFor, AsyncWith, Await.
- New functions: sys.set_coroutine_wrapper(callback), sys.get_coroutine_wrapper(), types.coroutine(gen), inspect.iscoroutinefunction(func), inspect.iscoroutine(obj), and inspect.isawaitable(obj).
- New CO_COROUTINE and CO_ITERABLE_COROUTINE bit flags for code objects.
- New ABCs: collections.abc.Awaitable, collections.abc.Coroutine, collections.abc.AsyncIterable, and collections.abc.AsyncIterator.
While the list of changes and new things is not short, it is important to understand, that most users will not use these features directly. It is intended to be used in frameworks and libraries to provide users with convenient to use and unambiguous APIs with async def, await, async for and async with syntax.
Working example
All concepts proposed in this PEP are implemented [3] and can be tested.
import asyncio
async def echo_server():
print('Serving on localhost:8000')
await asyncio.start_server(handle_connection,
'localhost', 8000)
async def handle_connection(reader, writer):
print('New connection...')
while True:
data = await reader.read(8192)
if not data:
break
print('Sending {:.10}... back'.format(repr(data)))
writer.write(data)
loop = asyncio.get_event_loop()
loop.run_until_complete(echo_server())
try:
loop.run_forever()
finally:
loop.close()
Implementation
The implementation is tracked in issue 24017 [15]. It was committed on May 11, 2015.
References
| [1] | https://docs.python.org/3/library/asyncio-task.html#asyncio.coroutine |
| [2] | (1, 2) http://wiki.ecmascript.org/doku.php?id=strawman:async_functions |
| [3] | (1, 2) https://github.com/1st1/cpython/tree/await |
| [4] | https://hg.python.org/benchmarks |
| [5] | (1, 2) https://msdn.microsoft.com/en-us/library/hh191443.aspx |
| [6] | (1, 2) http://docs.hhvm.com/manual/en/hack.async.php |
| [7] | (1, 2) https://www.dartlang.org/articles/await-async/ |
| [8] | (1, 2) http://docs.scala-lang.org/sips/pending/async.html |
| [9] | https://github.com/google/traceur-compiler/wiki/LanguageFeatures#async-functions-experimental |
| [10] | (1, 2) http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3722.pdf (PDF) |
| [11] | https://docs.python.org/3/reference/expressions.html#generator-iterator-methods |
| [12] | https://docs.python.org/3/reference/expressions.html#primaries |
| [13] | https://mail.python.org/pipermail/python-dev/2015-May/139851.html |
| [14] | https://mail.python.org/pipermail/python-dev/2015-May/139844.html |
| [15] | http://bugs.python.org/issue24017 |
| [16] | https://github.com/python/asyncio/issues/233 |
Acknowledgments
I thank Guido van Rossum, Victor Stinner, Elvis Pranskevichus, Andrew Svetlov, Ĺukasz Langa, Greg Ewing, Stephen J. Turnbull, Jim J. Jewett, Brett Cannon, Nick Coghlan, Steven D'Aprano, Paul Moore, Nathaniel Smith, Ethan Furman, Stefan Behnel, Paul Sokolovsky, Victor Petrovykh, and many others for their feedback, ideas, edits, criticism, code reviews, and discussions around this PEP.
Copyright
This document has been placed in the public domain.
pep-0493 HTTPS verification recommendations for Python 2.7 redistributors
| PEP: | 493 |
|---|---|
| Title: | HTTPS verification recommendations for Python 2.7 redistributors |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com>, Robert Kuska <rkuska at redhat.com> |
| Status: | Draft |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 10-May-2015 |
Contents
Abstract
PEP 476 updated Python's default handling of HTTPS certificates to be appropriate for communication over the public internet. The Python 2.7 long term maintenance series was judged to be in scope for this change, with the new behaviour introduced in the Python 2.7.9 maintenance release.
This PEP provides recommendations to downstream redistributors wishing to provide a smoother migration experience when helping their users to manage this change in Python's default behaviour.
Note that this PEP is not currently accepted, so it is a *proposed recommendation, rather than an active one.*
Rationale
PEP 476 changed Python's default behaviour to better match the needs and expectations of developers operating over the public internet, a category which appears to include most new Python developers. It is the position of the authors of this PEP that this was a correct decision.
However, it is also the case that this change does cause problems for infrastructure administrators operating private intranets that rely on self-signed certificates, or otherwise encounter problems with the new default certificate verification settings.
The long term answer for such environments is to update their internal certificate management to at least match the standards set by the public internet, but in the meantime, it is desirable to offer these administrators a way to continue receiving maintenance updates to the Python 2.7 series, without having to gate that on upgrades to their certificate management infrastructure.
PEP 476 did attempt to address this question, by covering how to revert the new settings process wide by monkeypatching the ssl module to restore the old behaviour. Unfortunately, the sitecustomize.py based technique proposed to allow system administrators to disable the feature by default in their Standard Operating Environment definition has been determined to be insufficient in at least some cases. The specific case of interest to the authors of this PEP is the one where a Linux distributor aims to provide their users with a smoother migration path than the standard one provided by consuming upstream CPython 2.7 releases directly, but other potential challenges have also been pointed out with updating embedded Python runtimes and other user level installations of Python.
Rather than allowing a plethora of mutually incompatibile migration techniques to bloom, this PEP proposes two alternative approaches that redistributors may take when addressing these problems. Redistributors may choose to implement one, both, or neither of these approaches based on their assessment of the needs of their particular userbase.
These designs are being proposed as a recommendation for redistributors, rather than as new upstream features, as they are needed purely to support legacy environments migrating from older versions of Python 2.7. Neither approach is being proposed as an upstream Python 2.7 feature, nor as a feature in any version of Python 3 (whether published directly by the Python Software Foundation or by a redistributor).
Recommendation for an environment variable based security downgrade
Some redistributors may wish to provide a per-application option to disable certificate verification in selected applications that run on or embed CPython without needing to modify the application itself.
In these cases, a configuration mechanism is needed that provides:
- an opt-out model that allows certificate verification to be selectively turned off for particular applications after upgrading to a version of Python that verifies certificates by default
- the ability for all users to configure this setting on a per-application basis, rather than on a per-system, or per-Python-installation basis
This approach may be used for any redistributor provided version of Python 2.7, including those that advertise themselves as providing Python 2.7.9 or later.
Recommended modifications to the Python standard library
The recommended approach to providing a per-application configuration setting for HTTPS certificate verification that doesn't require modifications to the application itself is to:
- modify the ssl module to read the PYTHONHTTPSVERIFY environment variable when the module is first imported into a Python process
- set the ssl._create_default_https_context function to be an alias for ssl._create_unverified_context if this environment variable is present and set to '0'
- otherwise, set the ssl._create_default_https_context function to be an alias for ssl.create_default_context as usual
Example implementation
def _get_https_context_factory():
config_setting = os.environ.get('PYTHONHTTPSVERIFY')
if config_setting == '0':
return _create_unverified_context
return create_default_context
_create_default_https_context = _get_https_context_factory()
Security Considerations
Relative to an unmodified version of CPython 2.7.9 or later, this approach does introduce a new downgrade attack against the default security settings that potentially allows a sufficiently determined attacker to revert Python to the vulnerable configuration used in CPython 2.7.8 and earlier releases. Such an attack requires the ability to modify the execution environment of a Python process prior to the import of the ssl module.
Redistributors should balance this marginal increase in risk against the ability to offer a smoother migration path to their users when deciding whether or not it is appropriate for them to implement this per-application "opt out" model.
Recommendation for backporting to earlier Python versions
Some redistributors, most notably Linux distributions, may choose to backport the PEP 476 HTTPS verification changes to modified Python versions based on earlier Python 2 maintenance releases. In these cases, a configuration mechanism is needed that provides:
- an opt-in model that allows the decision to enable HTTPS certificate verification to be made independently of the decision to upgrade to the Python version where the feature was first backported
- the ability for system administrators to set the default behaviour of Python applications and scripts run directly in the system Python installation
- the ability for the redistributor to consider changing the default behaviour of new installations at some point in the future without impacting existing installations that have been explicitly configured to skip verifying HTTPS certificates by default
This approach should not be used for any Python installation that advertises itself as providing Python 2.7.9 or later, as most Python users will have the reasonable expectation that all such environments will validate HTTPS certificates by default.
Recommended modifications to the Python standard library
The recommended approach to backporting the PEP 476 modifications to an earlier point release is to implement the following changes relative to the default PEP 476 behaviour implemented in Python 2.7.9+:
- modify the ssl module to read a system wide configuration file when the module is first imported into a Python process
- define a platform default behaviour (either verifying or not verifying HTTPS certificates) to be used if this configuration file is not present
- support selection between the following three modes of operation:
- ensure HTTPS certificate verification is enabled
- ensure HTTPS certificate verification is disabled
- delegate the decision to the redistributor providing this Python version
- set the ssl._create_default_https_context function to be an alias for either ssl.create_default_context or ssl._create_unverified_context based on the given configuration setting.
Recommended file location
This approach is currently only defined for *nix system Python installations.
The recommended configuration file name is /etc/python/cert-verification.cfg.
The .cfg filename extension is recommended for consistency with the pyvenv.cfg used by the venv module in Python 3's standard library.
Recommended file format
The configuration file should use a ConfigParser ini-style format with a single section named [https] containing one required setting verify.
Permitted values for verify are:
- enable: ensure HTTPS certificate verification is enabled by default
- disable: ensure HTTPS certificate verification is disabled by default
- platform_default: delegate the decision to the redistributor providing this particular Python version
If the [https] section or the verify setting are missing, or if the verify setting is set to an unknown value, it should be treated as if the configuration file is not present.
Example implementation
def _get_https_context_factory():
# Check for a system-wide override of the default behaviour
config_file = '/etc/python/cert-verification.cfg'
context_factories = {
'enable': create_default_context,
'disable': _create_unverified_context,
'platform_default': _create_unverified_context, # For now :)
}
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read(config_file)
try:
verify_mode = config.get('https', 'verify')
except (ConfigParser.NoSectionError, ConfigParser.NoOptionError):
verify_mode = 'platform_default'
default_factory = context_factories.get('platform_default')
return context_factories.get(verify_mode, default_factory)
_create_default_https_context = _get_https_context_factory()
Security Considerations
The specific recommendations for the backporting case are designed to work for privileged, security sensitive processes, even those being run in the following locked down configuration:
- run from a locked down administrator controlled directory rather than a normal user directory (preventing sys.path[0] based privilege escalation attacks)
- run using the -E switch (preventing PYTHON* environment variable based privilege escalation attacks)
- run using the -s switch (preventing user site directory based privilege escalation attacks)
- run using the -S switch (preventing sitecustomize based privilege escalation attacks)
The intent is that the only reason HTTPS verification should be getting turned off system wide when using this approach is because:
- an end user is running a redistributor provided version of CPython rather than running upstream CPython directly
- that redistributor has decided to provide a smoother migration path to verifying HTTPS certificates by default than that being provided by the upstream project
- either the redistributor or the local infrastructure administrator has determined that it is appropriate to override the default upstream behaviour (at least for the time being)
Using an administrator controlled configuration file rather than an environment variable has the essential feature of providing a smoother migraiton path, even for applications being run with the -E switch.
Combining the recommendations
If a redistributor chooses to implement both recommendations, then the environment variable should take precedence over the system-wide configuration setting. This allows the setting to be changed for a given user, virtual environment or application, regardless of the system-wide default behaviour.
In this case, if the PYTHONHTTPSVERIFY environment variable is defined, and set to anything other than '0', then HTTPS certificate verification should be enabled.
Example implementation
def _get_https_context_factory():
# Check for am environmental override of the default behaviour
config_setting = os.environ.get('PYTHONHTTPSVERIFY')
if config_setting is not None:
if config_setting == '0':
return _create_unverified_context
return create_default_context
# Check for a system-wide override of the default behaviour
config_file = '/etc/python/cert-verification.cfg'
context_factories = {
'enable': create_default_context,
'disable': _create_unverified_context,
'platform_default': _create_unverified_context, # For now :)
}
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read(config_file)
try:
verify_mode = config.get('https', 'verify')
except (ConfigParser.NoSectionError, ConfigParser.NoOptionError):
verify_mode = 'platform_default'
default_factory = context_factories.get('platform_default')
return context_factories.get(verify_mode, default_factory)
_create_default_https_context = _get_https_context_factory()
Copyright
This document has been placed into the public domain.
pep-0494 Python 3.6 Release Schedule
| PEP: | 494 |
|---|---|
| Title: | Python 3.6 Release Schedule |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ned Deily <nad at acm.org> |
| Status: | Active |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 30-May-2015 |
| Python-Version: | 3.6 |
Abstract
This document describes the development and release schedule for Python 3.6. The schedule primarily concerns itself with PEP-sized items.
Release Manager and Crew
- 3.6 Release Manager: Ned Deily
- Windows installers: Steve Dower
- Mac installers: Ned Deily
- Documentation: Georg Brandl
Release Schedule
The releases:
- 3.6.0 alpha 1: TBD
- 3.6.0 beta 1: TBD
- 3.6.0 candidate 1: TBD
- 3.6.0 final: TBD (late 2016?)
(Beta 1 is also "feature freeze"--no new features beyond this point.)
Copyright
This document has been placed in the public domain.
pep-0628 Add math.tau
| PEP: | 628 |
|---|---|
| Title: | Add math.tau |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2011-06-28 |
| Python-Version: | 3.x |
| Post-History: | 2011-06-28 |
| Resolution: | TBD |
Abstract
In honour of Tau Day 2011, this PEP proposes the addition of the circle constant math.tau to the Python standard library.
The concept of tau (Ď) is based on the observation that the ratio of a circle's circumference to its radius is far more fundamental and interesting than the ratio between its circumference and diameter. It is simply a matter of assigning a name to the value 2 * pi (2Ď).
PEP Deferral
The idea in this PEP was first proposed in the auspiciously named issue 12345 [1]. The immediate negative reactions I received from other core developers on that issue made it clear to me that there wasn't likely to be much collective interest in being part of a movement towards greater clarity in the explanation of profound mathematical concepts that are unnecessarily obscured by a historical quirk of notation.
Accordingly, this PEP is being submitted in a Deferred state, in the hope that it may someday be revisited if the mathematical and educational establishment choose to adopt a more enlightened and informative notation for dealing with radians.
Converts to the merits of tau as the more fundamental circle constant should feel free to start their mathematical code with tau = 2 * math.pi.
The Rationale for Tau
pi is defined as the ratio of a circle's circumference to its diameter. However, a circle is defined by its centre point and its radius. This is shown clearly when we note that the parameter of integration to go from a circle's circumference to its area is the radius, not the diameter. If we use the diameter instead we have to divide by four to get rid of the extraneous multiplier.
When working with radians, it is trivial to convert any given fraction of a circle to a value in radians in terms of tau. A quarter circle is tau/4, a half circle is tau/2, seven 25ths is 7*tau/25, etc. In contrast with the equivalent expressions in terms of pi (pi/2, pi, 14*pi/25), the unnecessary and needlessly confusing multiplication by two is gone.
Other Resources
I've barely skimmed the surface of the many examples put forward to point out just how much easier and more sensible many aspects of mathematics become when conceived in terms of tau rather than pi. If you don't find my specific examples sufficiently persausive, here are some more resources that may be of interest:
- Michael Hartl is the primary instigator of Tau Day in his Tau Manifesto [2]
- Bob Palais, the author of the original mathematics journal article highlighting the problems with pi has a page of resources [5] on the topic
- For those that prefer videos to written text, Pi is wrong! [4] and Pi is (still) wrong [3] are available on YouTube
References
| [1] | http://bugs.python.org/issue12345 |
| [2] | http://tauday.com/ |
| [3] | http://www.youtube.com/watch?v=jG7vhMMXagQ |
| [4] | http://www.youtube.com/watch?v=IF1zcRoOVN0 |
| [5] | http://www.math.utah.edu/~palais/pi.html |
Copyright
This document has been placed in the public domain.
pep-0666 Reject Foolish Indentation
| PEP: | 666 |
|---|---|
| Title: | Reject Foolish Indentation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Laura Creighton <lac at strakt.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Created: | 3-Dec-2001 |
| Python-Version: | 2.2 |
| Post-History: | 5-Dec-2001 |
Abstract
Everybody agrees that mixing tabs and spaces is a bad idea. Some
people want more than this. I propose that we let people define
whatever Python behaviour they want, so it will only run the way
they like it, and will not run the way they don't like it. We
will do this with a command line switch. Programs that aren't
formatted the way the programmer wants things will raise
IndentationError:
Python -TNone will refuse to run when there are any tabs.
Python -Tn will refuse to run when tabs are not exactly n spaces
Python -TOnly will refuse to run when blocks are indented by anything
other than tabs
People who mix tabs and spaces, naturally, will find that their
programs do not run. Alas, we haven't found a way to give them an
electric shock as from a cattle prod remotely. (Though if somebody
finds out a way to do this, I will be pleased to add this option to
the PEP.)
Rationale
Python-list@python.org (a.k.a. comp.lang.python) is periodically awash with discussions about tabs and spaces. This is inevitable, given that indentation is syntactically significant in Python. This has never solved anything, and just makes various people frustrated and angry. Eventually they start saying rude things to each other which is sad for all of us. And it is also sad that they are wasting their valuable time which they could spend creating something with Python. Moreover, for the Python community as a whole, from a public relations point of view, this is quite unfortunate. The people who aren't posting about tabs and spaces, are, (unsurprisingly) invisible, while the people who are posting make the rest of us look somewhat foolish. The problem is that there is no polite way to say 'Stop wasting your valuable time and mine.' People who are already in the middle of a flame war are not well disposed to believe that you are acting out of compassion for them, and quite rightly insist that their own time is their own to do with as they please. They are stuck like flies in treacle in this wretched argument, and it is self-evident that they cannot disengage or they would have already done so. But today I had to spend time cleaning my keyboard because the 'n' key is sticking. So, in addition to feeling compassion for these people, I am pretty annoyed. I figure if I make this PEP, we can then ask Guido to quickly reject it, and then when this argument next starts up again, we can say 'Guido isn't changing things to suit the tab-haters or the only-tabbers, so this conversation is a waste of time.' Then everybody can quietly believe that a) they are correct and b) other people are fools and c) they are undeniably fortunate to not have to share a lab with idiots, (which is something the arguers could do _now_, but apparently have forgotten). And python-list can go back to worrying if it is too smug, rather than whether it is too hostile for newcomers. Possibly somebody could get around to explaining to me what is the difference between __getattr__ and __getattribute__ in non-Classic classes in 2.2, a question I have foolishly posted in the middle of the current tab thread. I would like to know the answer to that question.[2] This proposal, if accepted, will probably mean a heck of a lot of work for somebody. But since I don't want it accepted, I don't care.
References
[1] PEP 1, PEP Purpose and Guidelines
http://www.python.org/dev/peps/pep-0001/
[2] Tim Peters already has (private correspondence). My early 2.2
didn't have a __getattribute__, and __getattr__ was
implemented like __getattribute__ now is. This has been
fixed. The important conclusion is that my Decorator Pattern
is safe and all is right with the world.
Copyright
This document has been placed in the public domain.
pep-0754 IEEE 754 Floating Point Special Values
| PEP: | 754 |
|---|---|
| Title: | IEEE 754 Floating Point Special Values |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gregory R. Warnes <gregory_r_warnes at groton.pfizer.com> (Pfizer, Inc.) |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Mar-2003 |
| Python-Version: | 2.3 |
| Post-History: |
Contents
Rejection Notice
This PEP has been rejected. After sitting open for four years, it has failed to generate sufficient community interest.
Several ideas of this PEP were implemented for Python 2.6. float('inf') and repr(float('inf')) are now guaranteed to work on every supported platform with IEEE 754 semantics. However the eval(repr(float('inf'))) roundtrip is still not supported unless you define inf and nan yourself:
>>> inf = float('inf')
>>> inf, 1E400
(inf, inf)
>>> neginf = float('-inf')
>>> neginf, -1E400
(-inf, -inf)
>>> nan = float('nan')
>>> nan, inf * 0.
(nan, nan)
The math and the sys module also have gained additional features, sys.float_info, math.isinf, math.isnan, math.copysign.
Abstract
This PEP proposes an API and a provides a reference module that generates and tests for IEEE 754 double-precision special values: positive infinity, negative infinity, and not-a-number (NaN).
Rationale
The IEEE 754 standard defines a set of binary representations and algorithmic rules for floating point arithmetic. Included in the standard is a set of constants for representing special values, including positive infinity, negative infinity, and indeterminate or non-numeric results (NaN). Most modern CPUs implement the IEEE 754 standard, including the (Ultra)SPARC, PowerPC, and x86 processor series.
Currently, the handling of IEEE 754 special values in Python depends on the underlying C library. Unfortunately, there is little consistency between C libraries in how or whether these values are handled. For instance, on some systems "float('Inf')" will properly return the IEEE 754 constant for positive infinity. On many systems, however, this expression will instead generate an error message.
The output string representation for an IEEE 754 special value also varies by platform. For example, the expression "float(1e3000)", which is large enough to generate an overflow, should return a string representation corresponding to IEEE 754 positive infinity. Python 2.1.3 on x86 Debian Linux returns "inf". On Sparc Solaris 8 with Python 2.2.1, this same expression returns "Infinity", and on MS-Windows 2000 with Active Python 2.2.1, it returns "1.#INF".
Adding to the confusion, some platforms generate one string on conversion from floating point and accept a different string for conversion to floating point. On these systems
float(str(x))
will generate an error when "x" is an IEEE special value.
In the past, some have recommended that programmers use expressions like:
PosInf = 1e300**2 NaN = PosInf/PosInf
to obtain positive infinity and not-a-number constants. However, the first expression generates an error on current Python interpreters. A possible alternative is to use:
PosInf = 1e300000 NaN = PosInf/PosInf
While this does not generate an error with current Python interpreters, it is still an ugly and potentially non-portable hack. In addition, defining NaN in this way does solve the problem of detecting such values. First, the IEEE 754 standard provides for an entire set of constant values for Not-a-Number. Second, the standard requires that
NaN != X
for all possible values of X, including NaN. As a consequence
NaN == NaN
should always evaluate to false. However, this behavior also is not consistently implemented. [e.g. Cygwin Python 2.2.2]
Due to the many platform and library inconsistencies in handling IEEE special values, it is impossible to consistently set or detect IEEE 754 floating point values in normal Python code without resorting to directly manipulating bit-patterns.
This PEP proposes a standard Python API and provides a reference module implementation which allows for consistent handling of IEEE 754 special values on all supported platforms.
API Definition
Constants
- NaN
- Non-signalling IEEE 754 "Not a Number" value
- PosInf
- IEEE 754 Positive Infinity value
- NegInf
- IEEE 754 Negative Infinity value
Functions
- isNaN(value)
- Determine if the argument is a IEEE 754 NaN (Not a Number) value.
- isPosInf(value)
- Determine if the argument is a IEEE 754 positive infinity value.
- isNegInf(value)
- Determine if the argument is a IEEE 754 negative infinity value.
- isFinite(value)
- Determine if the argument is an finite IEEE 754 value (i.e., is not NaN, positive, or negative infinity).
- isInf(value)
- Determine if the argument is an infinite IEEE 754 value (positive or negative infinity)
Example
(Run under Python 2.2.1 on Solaris 8.)
>>> import fpconst >>> val = 1e30000 # should be cause overflow and result in "Inf" >>> val Infinity >>> fpconst.isInf(val) 1 >>> fpconst.PosInf Infinity >>> nval = val/val # should result in NaN >>> nval NaN >>> fpconst.isNaN(nval) 1 >>> fpconst.isNaN(val) 0
Implementation
The reference implementation is provided in the module "fpconst" [1], which is written in pure Python by taking advantage of the "struct" standard module to directly set or test for the bit patterns that define IEEE 754 special values. Care has been taken to generate proper results on both big-endian and little-endian machines. The current implementation is pure Python, but some efficiency could be gained by translating the core routines into C.
Patch 1151323 "New fpconst module" [2] on SourceForge adds the fpconst module to the Python standard library.
References
See http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html for reference material on the IEEE 754 floating point standard.
| [1] | Further information on the reference package is available at http://research.warnes.net/projects/rzope/fpconst/ |
| [2] | http://sourceforge.net/tracker/?func=detail&aid=1151323&group_id=5470&atid=305470 |
Copyright
This document has been placed in the public domain.
pep-3000 Python 3000
| PEP: | 3000 |
|---|---|
| Title: | Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 05-Apr-2006 |
| Post-History: |
Contents
Abstract
This PEP sets guidelines for Python 3000 development. Ideally, we first agree on the process, and start discussing features only after the process has been decided and specified. In practice, we'll be discussing features and process simultaneously; often the debate about a particular feature will prompt a process discussion.
Naming
Python 3000, Python 3.0 and Py3K are all names for the same thing. The project is called Python 3000, or abbreviated to Py3k. The actual Python release will be referred to as Python 3.0, and that's what "python3.0 -V" will print; the actual file names will use the same naming convention we use for Python 2.x. I don't want to pick a new name for the executable or change the suffix for Python source files.
PEP Numbering
Python 3000 PEPs are numbered starting at PEP 3000. PEPs 3000-3099 are meta-PEPs -- these can be either process or informational PEPs. PEPs 3100-3999 are feature PEPs. PEP 3000 itself (this PEP) is special; it is the meta-PEP for Python 3000 meta-PEPs (IOW it describe the process to define processes). PEP 3100 is also special; it's a laundry list of features that were selected for (hopeful) inclusion in Python 3000 before we started the Python 3000 process for real. PEP 3099, finally, is a list of features that will not change.
Timeline
See PEP 361 [3], which contains the release schedule for Python 2.6 and 3.0. These versions will be released in lockstep.
Note: standard library development is expected to ramp up after 3.0a1 is released.
I expect that there will be parallel Python 2.x and 3.x releases for some time; the Python 2.x releases will continue for a longer time than the traditional 2.x.y bugfix releases. Typically, we stop releasing bugfix versions for 2.x once version 2.(x+1) has been released. But I expect there to be at least one or two new 2.x releases even after 3.0 (final) has been released, probably well into 3.1 or 3.2. This will to some extent depend on community demand for continued 2.x support, acceptance and stability of 3.0, and volunteer stamina.
I expect that Python 3.1 and 3.2 will be released much sooner after 3.0 than has been customary for the 2.x series. The 3.x release pattern will stabilize once the community is happy with 3.x.
Compatibility and Transition
Python 3.0 will break backwards compatibility with Python 2.x.
There is no requirement that Python 2.6 code will run unmodified on Python 3.0. Not even a subset. (Of course there will be a tiny subset, but it will be missing major functionality.)
Python 2.6 will support forward compatibility in the following two ways:
- It will support a "Py3k warnings mode" which will warn dynamically (i.e. at runtime) about features that will stop working in Python 3.0, e.g. assuming that range() returns a list.
- It will contain backported versions of many Py3k features, either enabled through __future__ statements or simply by allowing old and new syntax to be used side-by-side (if the new syntax would be a syntax error in 2.x).
Instead, and complementary to the forward compatibility features in 2.6, there will be a separate source code conversion tool [1]. This tool can do a context-free source-to-source translation. For example, it can translate apply(f, args) into f(*args). However, the tool cannot do data flow analysis or type inferencing, so it simply assumes that apply in this example refers to the old built-in function.
The recommended development model for a project that needs to support Python 2.6 and 3.0 simultaneously is as follows:
- You should have excellent unit tests with close to full coverage.
- Port your project to Python 2.6.
- Turn on the Py3k warnings mode.
- Test and edit until no warnings remain.
- Use the 2to3 tool to convert this source code to 3.0 syntax. Do not manually edit the output!
- Test the converted source code under 3.0.
- If problems are found, make corrections to the 2.6 version of the source code and go back to step 3.
- When it's time to release, release separate 2.6 and 3.0 tarballs (or whatever archive form you use for releases).
It is recommended not to edit the 3.0 source code until you are ready to reduce 2.6 support to pure maintenance (i.e. the moment when you would normally move the 2.6 code to a maintenance branch anyway).
PS. We need a meta-PEP to describe the transitional issues in detail.
Implementation Language
Python 3000 will be implemented in C, and the implementation will be derived as an evolution of the Python 2 code base. This reflects my views (which I share with Joel Spolsky [2]) on the dangers of complete rewrites. Since Python 3000 as a language is a relatively mild improvement on Python 2, we can gain a lot by not attempting to reimplement the language from scratch. I am not against parallel from-scratch implementation efforts, but my own efforts will be directed at the language and implementation that I know best.
Meta-Contributions
Suggestions for additional text for this PEP are gracefully accepted by the author. Draft meta-PEPs for the topics above and additional topics are even more welcome!
References
| [1] | The 2to3 tool, in the subversion sandbox http://svn.python.org/view/sandbox/trunk/2to3/ |
| [2] | Joel on Software: Things You Should Never Do, Part I http://www.joelonsoftware.com/articles/fog0000000069.html |
| [3] | PEP 361 (Python 2.6 and 3.0 Release Schedule) http://www.python.org/dev/peps/pep-0361 |
Copyright
This document has been placed in the public domain.
pep-3001 Procedure for reviewing and improving standard library modules
| PEP: | 3001 |
|---|---|
| Title: | Procedure for reviewing and improving standard library modules |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Withdrawn |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 05-Apr-2006 |
| Post-History: |
Contents
Abstract
This PEP describes a procedure for reviewing and improving standard library modules, especially those written in Python, making them ready for Python 3000. There can be different steps of refurbishing, each of which is described in a section below. Of course, not every step has to be performed for every module.
Removal of obsolete modules
All modules marked as deprecated in 2.x versions should be removed for Python 3000. The same applies to modules which are seen as obsolete today, but are too widely used to be deprecated or removed. Python 3000 is the big occasion to get rid of them.
There will have to be a document listing all removed modules, together with information on possible substitutes or alternatives. This infor- mation will also have to be provided by the python3warn.py porting helper script mentioned in PEP XXX.
Renaming modules
There are proposals for a "great stdlib renaming" introducing a hierarchic library namespace or a top-level package from which to import standard modules. That possibility aside, some modules' names are known to have been chosen unwisely, a mistake which could never be corrected in the 2.x series. Examples are names like "StringIO" or "Cookie". For Python 3000, there will be the possibility to give those modules less confusing and more conforming names.
Of course, each rename will have to be stated in the documentation of the respective module and perhaps in the global document of Step 1. Additionally, the python3warn.py script will recognize the old module names and notify the user accordingly.
If the name change is made in time for another release of the Python 2.x series, it is worth considering to introduce the new name in the 2.x branch to ease transition.
Code cleanup
As most library modules written in Python have not been touched except for bug fixes, following the policy of never changing a running system, many of them may contain code that is not up to the newest language features and could be rewritten in a more concise, modern Python.
PyChecker should run cleanly over the library. With a carefully tuned configuration file, PyLint should also emit as few warnings as possible.
As long as these changes don't change the module's interface and behavior, no documentation updates are necessary.
Enhancement of test and documentation coverage
Code coverage by unit tests varies greatly between modules. Each test suite should be checked for completeness, and the remaining classic tests should be converted to PyUnit (or whatever new shiny testing framework comes with Python 3000, perhaps py.test?).
It should also be verified that each publicly visible function has a meaningful docstring which ideally contains several doctests.
No documentation changes are necessary for enhancing test coverage.
Unification of module metadata
This is a small and probably not very important step. There have been various attempts at providing author, version and similar metadata in modules (such as a "__version__" global). Those could be standardized and used throughout the library.
No documentation changes are necessary for this step, too.
Backwards incompatible bug fixes
Over the years, many bug reports have been filed which complained about bugs in standard library modules, but have subsequently been closed as "Won't fix" since a fix would have introduced a major incompatibility which was not acceptable in the Python 2.x series. In Python 3000, the fix can be applied if the interface per se is still acceptable.
Each slight behavioral change caused by such fixes must be mentioned in the documentation, perhaps in a "Changed in Version 3.0" paragraph.
Interface changes
The last and most disruptive change is the overhaul of a module's public interface. If a module's interface is to be changed, a justification should be made beforehand, or a PEP should be written.
The change must be fully documented as "New in Version 3.0", and the python3warn.py script must know about it.
References
None yet.
Copyright
This document has been placed in the public domain.
pep-3002 Procedure for Backwards-Incompatible Changes
| PEP: | 3002 |
|---|---|
| Title: | Procedure for Backwards-Incompatible Changes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Steven Bethard <steven.bethard at gmail.com> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 27-Mar-2006 |
| Post-History: | 27-Mar-2006, 13-Apr-2006 |
Contents
Abstract
This PEP describes the procedure for changes to Python that are backwards-incompatible between the Python 2.X series and Python 3000. All such changes must be documented by an appropriate Python 3000 PEP and must be accompanied by code that can identify when pieces of Python 2.X code may be problematic in Python 3000.
Rationale
Python 3000 will introduce a number of backwards-incompatible changes to Python, mainly to streamline the language and to remove some previous design mistakes. But Python 3000 is not intended to be a new and completely different language from the Python 2.X series, and it is expected that much of the Python user community will make the transition to Python 3000 when it becomes available.
To encourage this transition, it is crucial to provide a clear and complete guide on how to upgrade Python 2.X code to Python 3000 code. Thus, for any backwards-incompatible change, two things are required:
- An official Python Enhancement Proposal (PEP)
- Code that can identify pieces of Python 2.X code that may be problematic in Python 3000
Python Enhancement Proposals
Every backwards-incompatible change must be accompanied by a PEP. This PEP should follow the usual PEP guidelines and explain the purpose and reasoning behind the backwards incompatible change. In addition to the usual PEP sections, all PEPs proposing backwards-incompatible changes must include an additional section: Compatibility Issues. This section should describe what is backwards incompatible about the proposed change to Python, and the major sorts of breakage to be expected.
While PEPs must still be evaluated on a case-by-case basis, a PEP may be inappropriate for Python 3000 if its Compatibility Issues section implies any of the following:
Most or all instances of a Python 2.X construct are incorrect in Python 3000, and most or all instances of the Python 3000 construct are incorrect in Python 2.X.
So for example, changing the meaning of the for-loop else-clause from "executed when the loop was not broken out of" to "executed when the loop had zero iterations" would mean that all Python 2.X for-loop else-clauses would be broken, and there would be no way to use a for-loop else-clause in a Python-3000-appropriate manner. Thus a PEP for such an idea would likely be rejected.
Many instances of a Python 2.X construct are incorrect in Python 3000 and the PEP fails to demonstrate real-world use-cases for the changes.
Backwards incompatible changes are allowed in Python 3000, but not to excess. A PEP that proposes backwards-incompatible changes should provide good examples of code that visibly benefits from the changes.
PEP-writing is time-consuming, so when a number of backwards-incompatible changes are closely related, they should be proposed in the same PEP. Such PEPs will likely have longer Compatibility Issues sections, however, since they must now describe the sorts of breakage expected from all the proposed changes.
Identifying Problematic Code
In addition to the PEP requirement, backwards incompatible changes to Python must also be accompanied by code to issue warnings for pieces of Python 2.X code that will behave differently in Python 3000. Such warnings will be enabled in Python 2.X using a new command-line switch: -3. All backwards incompatible changes should be accompanied by a patch for Python 2.X that, when -3 is specified, issues warnings for each construct that is being changed.
For example, if dict.keys() returns an iterator in Python 3000, the patch to the Python 2.X branch should do something like:
If -3 was specified, change dict.keys() to return a subclass of list that issues warnings whenever you use any methods other than __iter__().
Such a patch would mean that warnings are only issued when features that will not be present in Python 3000 are used, and almost all existing code should continue to work. (Code that relies on dict.keys() always returning a list and not a subclass should be pretty much non-existent.)
References
TBD
Copyright
This document has been placed in the public domain.
pep-3003 Python Language Moratorium
| PEP: | 3003 |
|---|---|
| Title: | Python Language Moratorium |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon, Jesse Noller, Guido van Rossum |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 21-Oct-2009 |
| Post-History: | 03-Nov-2009 |
Contents
Abstract
This PEP proposes a temporary moratorium (suspension) of all changes to the Python language syntax, semantics, and built-ins for a period of at least two years from the release of Python 3.1. In particular, the moratorium would include Python 3.2 (to be released 18-24 months after 3.1) but allow Python 3.3 (assuming it is not released prematurely) to once again include language changes.
This suspension of features is designed to allow non-CPython implementations to "catch up" to the core implementation of the language, help ease adoption of Python 3.x, and provide a more stable base for the community.
Rationale
This idea was proposed by Guido van Rossum on the python-ideas [1] mailing list. The premise of his email was to slow the alteration of the Python core syntax, builtins and semantics to allow non-CPython implementations to catch up to the current state of Python, both 2.x and 3.x.
Python, as a language is more than the core implementation -- CPython -- with a rich, mature and vibrant community of implementations, such as Jython [2], IronPython [3] and PyPy [4] that are a benefit not only to the community, but to the language itself.
Still others, such as Unladen Swallow [5] (a branch of CPython) seek not to create an alternative implementation, but rather they seek to enhance the performance and implementation of CPython itself.
Python 3.x was a large part of the last several years of Python's development. Its release, as well as a bevy of changes to the language introduced by it and the previous 2.6.x releases, puts alternative implementations at a severe disadvantage in "keeping pace" with core python development.
Additionally, many of the changes put into the recent releases of the language as implemented by CPython have not yet seen widespread usage by the general user population. For example, most users are limited to the version of the interpreter (typically CPython) which comes pre-installed with their operating system. Most OS vendors are just barely beginning to ship Python 2.6 -- even fewer are shipping Python 3.x.
As it is expected that Python 2.7 be the effective "end of life" of the Python 2.x code line, with Python 3.x being the future, it is in the best interest of Python core development to temporarily suspend the alteration of the language itself to allow all of these external entities to catch up and to assist in the adoption of, and migration to, Python 3.x
Finally, the moratorium is intended to free up cycles within core development to focus on other issues, such as the CPython interpreter and improvements therein, the standard library, etc.
This moratorium does not allow for exceptions -- once accepted, any pending changes to the syntax or semantics of the language will be postponed until the moratorium is lifted.
This moratorium does not attempt to apply to any other Python implementation meaning that if desired other implementations may add features which deviate from the standard implementation.
Details
Cannot Change
New built-ins
- Language syntax
The grammar file essentially becomes immutable apart from ambiguity fixes.
- General language semantics
The language operates as-is with only specific exemptions (see below).
- New __future__ imports
These are explicitly forbidden, as they effectively change the language syntax and/or semantics (albeit using a compiler directive).
Case-by-Case Exemptions
- New methods on built-ins
The case for adding a method to a built-in object can be made.
- Incorrect language semantics
If the language semantics turn out to be ambiguous or improperly implemented based on the intention of the original design then the semantics may change.
- Language semantics that are difficult to implement
Because other VMs have not begun implementing Python 3.x semantics there is a possibility that certain semantics are too difficult to replicate. In those cases they can be changed to ease adoption of Python 3.x by the other VMs.
Allowed to Change
- C API
It is entirely acceptable to change the underlying C code of CPython as long as other restrictions of this moratorium are not broken. E.g. removing the GIL would be fine assuming certain operations that are currently atomic remain atomic.
- The standard library
As the standard library is not directly tied to the language definition it is not covered by this moratorium.
- Backports of 3.x features to 2.x
The moratorium only affects features that would be new in 3.x.
- Import semantics
For example, PEP 382. After all, import semantics vary between Python implementations anyway.
Retroactive
It is important to note that the moratorium covers all changes since the release of Python 3.1. This rule is intended to avoid features being rushed or smuggled into the CPython source tree while the moratorium is being discussed. A review of the NEWS file for the py3k development branch showed no commits would need to be rolled back in order to meet this goal.
Extensions
The time period of the moratorium can only be extended through a new PEP.
Copyright
This document has been placed in the public domain.
References
| [1] | http://mail.python.org/pipermail/python-ideas/2009-October/006305.html |
| [2] | http://www.jython.org/ |
| [3] | http://www.codeplex.com/IronPython |
| [4] | http://codespeak.net/pypy/ |
| [5] | http://code.google.com/p/unladen-swallow/ |
pep-3099 Things that will Not Change in Python 3000
| PEP: | 3099 |
|---|---|
| Title: | Things that will Not Change in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 04-Apr-2006 |
| Post-History: |
Contents
Abstract
Some ideas are just bad. While some thoughts on Python evolution are constructive, some go against the basic tenets of Python so egregiously that it would be like asking someone to run in a circle: it gets you nowhere, even for Python 3000, where extraordinary proposals are allowed. This PEP tries to list all BDFL pronouncements on Python 3000 that refer to changes that will not happen and new features that will not be introduced, sorted by topics, along with a short explanation or a reference to the relevant thread on the python-3000 mailing list.
If you think you should suggest any of the listed ideas it would be better to just step away from the computer, go outside, and enjoy yourself. Being active outdoors by napping in a nice patch of grass is more productive than bringing up a beating-a-dead-horse idea and having people tell you how dead the idea is. Consider yourself warned.
Core language
Python 3000 will not be case-insensitive.
Python 3000 will not be a rewrite from scratch.
It will also not use C++ or another language different from C as implementation language. Rather, there will be a gradual transmogrification of the codebase. There's an excellent essay by Joel Spolsky explaining why: http://www.joelonsoftware.com/articles/fog0000000069.html
self will not become implicit.
Having self be explicit is a good thing. It makes the code clear by removing ambiguity about how a variable resolves. It also makes the difference between functions and methods small.
Thread: "Draft proposal: Implicit self in Python 3.0" http://mail.python.org/pipermail/python-dev/2006-January/059468.html
lambda will not be renamed.
At one point lambda was slated for removal in Python 3000. Unfortunately no one was able to come up with a better way of providing anonymous functions. And so lambda is here to stay.
But it is here to stay as-is. Adding support for statements is a non-starter. It would require allowing multi-line lambda expressions which would mean a multi-line expression could suddenly exist. That would allow for multi-line arguments to function calls, for instance. That is just plain ugly.
Thread: "genexp syntax / lambda", http://mail.python.org/pipermail/python-3000/2006-April/001042.html
Python will not have programmable syntax.
Thread: "It's a statement! It's a function! It's BOTH!", http://mail.python.org/pipermail/python-3000/2006-April/000286.html
There won't be a syntax for zip()-style parallel iteration.
Thread: "Parallel iteration syntax", http://mail.python.org/pipermail/python-3000/2006-March/000210.html
Strings will stay iterable.
Thread: "Making strings non-iterable", http://mail.python.org/pipermail/python-3000/2006-April/000759.html
There will be no syntax to sort the result of a generator expression or list comprehension. sorted() covers all use cases.
Thread: "Adding sorting to generator comprehension", http://mail.python.org/pipermail/python-3000/2006-April/001295.html
Slices and extended slices won't go away (even if the __getslice__ and __setslice__ APIs may be replaced) nor will they return views for the standard object types.
Thread: Future of slices http://mail.python.org/pipermail/python-3000/2006-May/001563.html
It will not be forbidden to reuse a loop variable inside the loop's suite.
Thread: elimination of scope bleeding of iteration variables http://mail.python.org/pipermail/python-dev/2006-May/064761.html
The parser won't be more complex than LL(1).
Simple is better than complex. This idea extends to the parser. Restricting Python's grammar to an LL(1) parser is a blessing, not a curse. It puts us in handcuffs that prevent us from going overboard and ending up with funky grammar rules like some other dynamic languages that will go unnamed, such as Perl.
No braces.
This is so obvious that it doesn't need a reference to a mailing list. Do from __future__ import braces to get a definitive answer on this subject.
No more backticks.
Backticks (`) will no longer be used as shorthand for repr -- but that doesn't mean they are available for other uses. Even ignoring the backwards compatibility confusion, the character itself causes too many problems (in some fonts, on some keyboards, when typesetting a book, etc).
Thread: "new operators via backquoting", http://mail.python.org/pipermail/python-ideas/2007-January/000054.html
Referencing the global name foo will not be spelled globals.foo. The global statement will stay.
Threads: "replace globals() and global statement with global builtin object", http://mail.python.org/pipermail/python-3000/2006-July/002485.html, "Explicit Lexical Scoping (pre-PEP?)", http://mail.python.org/pipermail/python-dev/2006-July/067111.html
There will be no alternative binding operators such as :=.
Thread: "Explicit Lexical Scoping (pre-PEP?)", http://mail.python.org/pipermail/python-dev/2006-July/066995.html
We won't be removing container literals. That is, {expr: expr, ...}, [expr, ...] and (expr, ...) will stay.
Thread: "No Container Literals", http://mail.python.org/pipermail/python-3000/2006-July/002550.html
The else clause in while and for loops will not change semantics, or be removed.
Thread: "for/except/else syntax" http://mail.python.org/pipermail/python-ideas/2009-October/006083.html
Builtins
zip() won't grow keyword arguments or other mechanisms to prevent it from stopping at the end of the shortest sequence.
Thread: "have zip() raise exception for sequences of different lengths", http://mail.python.org/pipermail/python-3000/2006-August/003338.html
hash() won't become an attribute since attributes should be cheap to compute, which isn't necessarily the case for a hash.
Thread: "hash as attribute/property", http://mail.python.org/pipermail/python-3000/2006-April/000362.html
Standard types
Iterating over a dictionary will continue to yield the keys.
Thread: "Iterating over a dict", http://mail.python.org/pipermail/python-3000/2006-April/000283.html
Thread: have iter(mapping) generate (key, value) pairs http://mail.python.org/pipermail/python-3000/2006-June/002368.html
There will be no frozenlist type.
Thread: "Immutable lists", http://mail.python.org/pipermail/python-3000/2006-May/002219.html
int will not support subscripts yielding a range.
Thread: "xrange vs. int.__getslice__", http://mail.python.org/pipermail/python-3000/2006-June/002450.html
Coding style
The (recommended) maximum line width will remain 80 characters, for both C and Python code.
Thread: "C style guide", http://mail.python.org/pipermail/python-3000/2006-March/000131.html
Interactive Interpreter
The interpreter prompt (>>>) will not change. It gives Guido warm fuzzy feelings.
Thread: "Low-hanging fruit: change interpreter prompt?", http://mail.python.org/pipermail/python-3000/2006-November/004891.html
Copyright
This document has been placed in the public domain.
pep-3100 Miscellaneous Python 3.0 Plans
| PEP: | 3100 |
|---|---|
| Title: | Miscellaneous Python 3.0 Plans |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Process |
| Content-Type: | text/x-rst |
| Created: | 20-Aug-2004 |
| Post-History: |
Contents
Abstract
This PEP, previously known as PEP 3000, describes smaller scale changes and new features for which no separate PEP is written yet, all targeted for Python 3000.
The list of features included in this document is subject to change and isn't binding on the Python development community; features may be added, removed, and modified at any time. The purpose of this list is to focus our language development effort on changes that are steps to 3.0, and to encourage people to invent ways to smooth the transition.
This document is not a wish-list that anyone can extend. While there are two authors of this PEP, we're just supplying the text; the decisions for which changes are listed in this document are made by Guido van Rossum, who has chosen them as goals for Python 3.0.
Guido's pronouncements on things that will not change in Python 3.0 are recorded in PEP 3099. [43]
General goals
A general goal is to reduce feature duplication by removing old ways of doing things. A general principle of the design will be that one obvious way of doing something is enough. [1]
Style changes
- The C style guide will be updated to use 4-space indents, never tabs. This style should be used for all new files; existing files can be updated only if there is no hope to ever merge a particular file from the Python 2 HEAD. Within a file, the indentation style should be consistent. No other style guide changes are planned ATM.
Core language
True division becomes default behavior [34] [done]
exec as a statement is not worth it -- make it a function [done]
Add optional declarations for static typing [45] [10] [done]
Support only new-style classes; classic classes will be gone [1] [done]
The softspace attribute of files goes away. [done]
Use except E1, E2, E3 as err: if you want the error variable. [3] [done]
None becomes a keyword [4]; also True and False [done]
... to become a general expression element [16] [done]
as becomes a keyword [5] (starting in 2.6 already) [done]
Have list comprehensions be syntactic sugar for passing an equivalent generator expression to list(); as a consequence the loop variable will no longer be exposed [36] [done]
Comparisons other than == and != between disparate types will raise an exception unless explicitly supported by the type [6] [done]
floats will not be acceptable as arguments in place of ints for operations where floats are inadvertantly accepted (PyArg_ParseTuple() i & l formats)
Remove from ... import * at function scope. [done] This means that functions can always be optimized and support for unoptimized functions can go away.
- Imports [39]
- Imports will be absolute by default. [done]
- Relative imports must be explicitly specified. [done]
- Indirection entries in sys.modules (i.e., a value of None for A.string means to use the top-level string module) will not be supported.
__init__.py might become optional in sub-packages? __init__.py will still be required for top-level packages.
Cleanup the Py_InitModule() variants {,3,4} (also import and parser APIs)
Cleanup the APIs exported in pythonrun, etc.
Some expressions will require parentheses that didn't in 2.x:
- List comprehensions will require parentheses around the iterables. This will make list comprehensions more similar to generator comprehensions. [x for x in 1, 2] will need to be: [x for x in (1, 2)] [done]
- Lambdas may have to be parenthesized [38] [NO]
In order to get rid of the confusion between __builtin__ and __builtins__, it was decided to rename __builtin__ (the module) to builtins, and to leave __builtins__ (the sandbox hook) alone. [47] [48] [done]
Attributes on functions of the form func_whatever will be renamed __whatever__ [17] [done]
Set literals and comprehensions [19] [20] [done] {x} means set([x]); {x, y} means set([x, y]). {F(x) for x in S if P(x)} means set(F(x) for x in S if P(x)). NB. {range(x)} means set([range(x)]), NOT set(range(x)). There's no literal for an empty set; use set() (or {1}&{2} :-). There's no frozenset literal; they are too rarely needed.
The __nonzero__ special method will be renamed to __bool__ and have to return a bool. The typeobject slot will be called tp_bool [23] [done]
Dict comprehensions, as first proposed in [35] [done] {K(x): V(x) for x in S if P(x)} means dict((K(x), V(x)) for x in S if P(x)).
To be removed:
String exceptions: use instances of an Exception class [2] [done]
raise Exception, "message": use raise Exception("message") [12] [done]
`x`: use repr(x) [2] [done]
The <> operator: use != instead [3] [done]
The __mod__ and __divmod__ special methods on float. [they should stay] [21]
METH_OLDARGS [done]
WITH_CYCLE_GC [done]
__getslice__, __setslice__, __delslice__ [32]; remove slice opcodes and use slice objects. [done]
__oct__, __hex__: use __index__ in oct() and hex() instead. [done]
__methods__ and __members__ [done]
C APIs (see code): PyFloat_AsString, PyFloat_AsReprString, PyFloat_AsStringEx, PySequence_In, PyEval_EvalFrame, PyEval_CallObject, _PyObject_Del, _PyObject_GC_Del, _PyObject_GC_Track, _PyObject_GC_UnTrack PyString_AsEncodedString, PyString_AsDecodedString PyArg_NoArgs, PyArg_GetInt, intargfunc, intintargfunc
PyImport_ReloadModule ?
Atomic Types
- Remove distinction between int and long types; 'long' built-in type and literals with 'L' or 'l' suffix disappear [1] [done]
- Make all strings be Unicode, and have a separate bytes() type [1] The new string type will be called 'str'. See PEP 3137. [done]
- Return iterable views instead of lists where appropriate for atomic type methods (e.g. dict.keys(), dict.values(), dict.items(), etc.); iter* methods will be removed. [done]
- Make string.join() stringify its arguments? [18] [NO]
- Fix open() so it returns a ValueError if the mode is bad rather than IOError. [done]
To be removed:
- basestring.find() and basestring.rfind(); use basestring.index() or basestring.[r]partition() or or basestring.rindex() in a try/except block??? [13] [UNLIKELY]
- file.xreadlines() method [31] [done]
- dict.setdefault()? [15] [UNLIKELY]
- dict.has_key() method; use in operator [done]
- list.sort() and builtin.sorted() methods: eliminate cmp parameter [27] [done]
Built-in Namespace
- Make built-ins return an iterator where appropriate (e.g. range(), zip(), map(), filter(), etc.) [done]
- Remove input() and rename raw_input() to input(). If you need the old input(), use eval(input()). [done]
- Introduce trunc(), which would call the __trunc__() method on its argument; suggested use is for objects like float where calling __int__() has data loss, but an integral representation is still desired? [8] [done]
- Exception hierarchy changes [41] [done]
- Add a bin() function for a binary representation of integers [done]
To be removed:
apply(): use f(*args, **kw) instead [2] [done]
buffer(): must die (use a bytes() type instead) (?) [2] [done]
callable(): just use isinstance(x, collections.Callable) (?) [2] [done]
compile(): put in sys (or perhaps in a module of its own) [2]
coerce(): no longer needed [2] [done]
execfile(), reload(): use exec() [2] [done]
reduce(): put in functools, a loop is more readable most of the times [2], [9] [done]
xrange(): use range() instead [1] [See range() above] [done]
- StandardError: this is a relic from the original exception hierarchy;
subclass Exception instead. [done]
Standard library
- Reorganize the standard library to not be as shallow?
- Move test code to where it belongs, there will be no more test() functions in the standard library
- Convert all tests to use either doctest or unittest.
- For the procedures of standard library improvement, see PEP 3001 [42]
To be removed:
The sets module. [done]
- stdlib modules to be removed
- see docstrings and comments in the source
- macfs [to do]
- new, reconvert, stringold, xmllib, pcre, pypcre, strop [all done]
- Everything in lib-old [33] [done]
- Para, addpack, cmp, cmpcache, codehack, dircmp, dump, find, fmt, grep, lockfile, newdir, ni, packmail, poly, rand, statcache, tb, tzparse, util, whatsound, whrandom, zmod
sys.exc_type, sys.exc_values, sys.exc_traceback: not thread-safe; use sys.exc_info() or an attribute of the exception [2] [11] [28] [done]
sys.exc_clear: Python 3's except statements provide the same functionality [24] [46] [28] [done]
array.read, array.write [30]
operator.isCallable : callable() built-in is being removed [29] [50] [done]
operator.sequenceIncludes : redundant thanks to operator.contains [29] [50] [done]
In the thread module, the aquire_lock() and release_lock() aliases for the acquire() and release() methods on lock objects. (Probably also just remove the thread module as a public API, in favor of always using threading.py.)
UserXyz classes, in favour of XyzMixins.
Remove the unreliable empty() and full() methods from Queue.py?
Remove jumpahead() from the random API?
Make the primitive for random be something generating random bytes rather than random floats?
Get rid of Cookie.SerialCookie and Cookie.SmartCookie?
Modify the heapq.heapreplace() API to compare the new value to the top of the heap?
Outstanding Issues
- Require C99, so we can use // comments, named initializers, declare variables without introducing a new scope, among other benefits. (Also better support for IEEE floating point issues like NaN and infinities?)
- Remove support for old systems, including: BeOS, RISCOS, (SGI) Irix, Tru64
References
| [1] | (1, 2, 3, 4, 5) PyCon 2003 State of the Union: http://www.python.org/doc/essays/ppt/pycon2003/pycon2003.ppt |
| [2] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) Python Regrets: http://www.python.org/doc/essays/ppt/regrets/PythonRegrets.pdf |
| [3] | (1, 2) Python Wiki: http://www.python.org/moin/Python3.0 |
| [4] | python-dev email ("Constancy of None") http://mail.python.org/pipermail/python-dev/2004-July/046294.html |
| [5] | python-dev email (' "as" to be a keyword?') http://mail.python.org/pipermail/python-dev/2004-July/046316.html |
| [6] | python-dev email ("Comparing heterogeneous types") http://mail.python.org/pipermail/python-dev/2004-June/045111.html |
| [7] | python-dev email ("Let's get rid of unbound methods") http://mail.python.org/pipermail/python-dev/2005-January/050625.html |
| [8] | python-dev email ("Fixing _PyEval_SliceIndex so that integer-like objects can be used") http://mail.python.org/pipermail/python-dev/2005-February/051674.html |
| [9] | Guido's blog ("The fate of reduce() in Python 3000") http://www.artima.com/weblogs/viewpost.jsp?thread=98196 |
| [10] | Guido's blog ("Python Optional Typechecking Redux") http://www.artima.com/weblogs/viewpost.jsp?thread=89161 |
| [11] | python-dev email ("anonymous blocks") http://mail.python.org/pipermail/python-dev/2005-April/053060.html |
| [12] | python-dev email ("PEP 8: exception style") http://mail.python.org/pipermail/python-dev/2005-August/055190.html |
| [13] | python-dev email (Remove str.find in 3.0?) http://mail.python.org/pipermail/python-dev/2005-August/055705.html |
| [14] | python-dev email (Replacement for print in Python 3.0) http://mail.python.org/pipermail/python-dev/2005-September/056154.html |
| [15] | python-dev email ("defaultdict") http://mail.python.org/pipermail/python-dev/2006-February/061261.html |
| [16] | python-3000 email http://mail.python.org/pipermail/python-3000/2006-April/000996.html |
| [17] | python-3000 email ("Pronouncement on parameter lists") http://mail.python.org/pipermail/python-3000/2006-April/001175.html |
| [18] | python-3000 email ("More wishful thinking") http://mail.python.org/pipermail/python-3000/2006-April/000810.html |
| [19] | python-3000 email ("sets in P3K?") http://mail.python.org/pipermail/python-3000/2006-April/001286.html |
| [20] | python-3000 email ("sets in P3K?") http://mail.python.org/pipermail/python-3000/2006-May/001666.html |
| [21] | python-3000 email ("bug in modulus?") http://mail.python.org/pipermail/python-3000/2006-May/001735.html |
| [22] | SF patch "sys.id() and sys.intern()" http://www.python.org/sf/1601678 |
| [23] | python-3000 email ("__nonzero__ vs. __bool__") http://mail.python.org/pipermail/python-3000/2006-November/004524.html |
| [24] | python-3000 email ("Pre-peps on raise and except changes") http://mail.python.org/pipermail/python-3000/2007-February/005672.html |
| [25] | python-3000 email ("Py3.0 Library Ideas") http://mail.python.org/pipermail/python-3000/2007-February/005726.html |
| [26] | python-dev email ("Should we do away with unbound methods in Py3k?") http://mail.python.org/pipermail/python-dev/2007-November/075279.html |
| [27] | python-dev email ("Mutable sequence .sort() signature") http://mail.python.org/pipermail/python-dev/2008-February/076818.html |
| [28] | (1, 2, 3) Python docs (sys -- System-specific parameters and functions) http://docs.python.org/library/sys.html |
| [29] | (1, 2) Python docs (operator -- Standard operators as functions) http://docs.python.org/library/operator.html |
| [30] | Python docs (array -- Efficient arrays of numeric values) http://docs.python.org/library/array.html |
| [31] | Python docs (File objects) http://docs.python.org/library/stdtypes.html |
| [32] | Python docs (Additional methods for emulation of sequence types) http://docs.python.org/reference/datamodel.html#additional-methods-for-emulation-of-sequence-types |
| [33] | (1, 2) PEP 4 ("Deprecation of Standard Modules") http://www.python.org/dev/peps/pep-0004 |
| [34] | (1, 2) PEP 238 (Changing the Division Operator) http://www.python.org/dev/peps/pep-0238 |
| [35] | PEP 274 (Dict Comprehensions) http://www.python.org/dev/peps/pep-0274 |
| [36] | PEP 289 ("Generator Expressions") http://www.python.org/dev/peps/pep-0289 |
| [37] | PEP 299 ("Special __main__() function in modules") http://www.python.org/dev/peps/pep-0299 |
| [38] | PEP 308 ("Conditional Expressions") http://www.python.org/dev/peps/pep-0308 |
| [39] | (1, 2) PEP 328 (Imports: Multi-Line and Absolute/Relative) http://www.python.org/dev/peps/pep-0328 |
| [40] | PEP 343 (The "with" Statement) http://www.python.org/dev/peps/pep-0343 |
| [41] | (1, 2) PEP 352 (Required Superclass for Exceptions) http://www.python.org/dev/peps/pep-0352 |
| [42] | PEP 3001 (Process for reviewing and improving standard library modules) http://www.python.org/dev/peps/pep-3001 |
| [43] | PEP 3099 (Things that will Not Change in Python 3000) http://www.python.org/dev/peps/pep-3099 |
| [44] | PEP 3105 (Make print a function) http://www.python.org/dev/peps/pep-3105 |
| [45] | PEP 3107 (Function Annotations) http://www.python.org/dev/peps/pep-3107 |
| [46] | PEP 3110 (Catching Exceptions in Python 3000) http://www.python.org/dev/peps/pep-3110/#semantic-changes |
| [47] | Approach to resolving __builtin__ vs __builtins__ http://mail.python.org/pipermail/python-3000/2007-March/006161.html |
| [48] | New name for __builtins__ http://mail.python.org/pipermail/python-dev/2007-November/075388.html |
| [49] | Patch to remove sys.exitfunc http://www.python.org/sf/1680961 |
| [50] | (1, 2) Remove deprecated functions from operator http://www.python.org/sf/1516309 |
Copyright
This document has been placed in the public domain.
pep-3101 Advanced String Formatting
| PEP: | 3101 |
|---|---|
| Title: | Advanced String Formatting |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Talin <talin at acm.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 16-Apr-2006 |
| Python-Version: | 3.0 |
| Post-History: | 28-Apr-2006, 6-May-2006, 10-Jun-2007, 14-Aug-2007, 14-Sep-2008 |
Abstract
This PEP proposes a new system for built-in string formatting
operations, intended as a replacement for the existing '%' string
formatting operator.
Rationale
Python currently provides two methods of string interpolation:
- The '%' operator for strings. [1]
- The string.Template module. [2]
The primary scope of this PEP concerns proposals for built-in
string formatting operations (in other words, methods of the
built-in string type).
The '%' operator is primarily limited by the fact that it is a
binary operator, and therefore can take at most two arguments.
One of those arguments is already dedicated to the format string,
leaving all other variables to be squeezed into the remaining
argument. The current practice is to use either a dictionary or a
tuple as the second argument, but as many people have commented
[3], this lacks flexibility. The "all or nothing" approach
(meaning that one must choose between only positional arguments,
or only named arguments) is felt to be overly constraining.
While there is some overlap between this proposal and
string.Template, it is felt that each serves a distinct need,
and that one does not obviate the other. This proposal is for
a mechanism which, like '%', is efficient for small strings
which are only used once, so, for example, compilation of a
string into a template is not contemplated in this proposal,
although the proposal does take care to define format strings
and the API in such a way that an efficient template package
could reuse the syntax and even some of the underlying
formatting code.
Specification
The specification will consist of the following parts:
- Specification of a new formatting method to be added to the
built-in string class.
- Specification of functions and flag values to be added to
the string module, so that the underlying formatting engine
can be used with additional options.
- Specification of a new syntax for format strings.
- Specification of a new set of special methods to control the
formatting and conversion of objects.
- Specification of an API for user-defined formatting classes.
- Specification of how formatting errors are handled.
Note on string encodings: When discussing this PEP in the context
of Python 3.0, it is assumed that all strings are unicode strings,
and that the use of the word 'string' in the context of this
document will generally refer to a Python 3.0 string, which is
the same as Python 2.x unicode object.
In the context of Python 2.x, the use of the word 'string' in this
document refers to an object which may either be a regular string
or a unicode object. All of the function call interfaces
described in this PEP can be used for both strings and unicode
objects, and in all cases there is sufficient information
to be able to properly deduce the output string type (in
other words, there is no need for two separate APIs).
In all cases, the type of the format string dominates - that
is, the result of the conversion will always result in an object
that contains the same representation of characters as the
input format string.
String Methods
The built-in string class (and also the unicode class in 2.6) will
gain a new method, 'format', which takes an arbitrary number of
positional and keyword arguments:
"The story of {0}, {1}, and {c}".format(a, b, c=d)
Within a format string, each positional argument is identified
with a number, starting from zero, so in the above example, 'a' is
argument 0 and 'b' is argument 1. Each keyword argument is
identified by its keyword name, so in the above example, 'c' is
used to refer to the third argument.
There is also a global built-in function, 'format' which formats
a single value:
print(format(10.0, "7.3g"))
This function is described in a later section.
Format Strings
Format strings consist of intermingled character data and markup.
Character data is data which is transferred unchanged from the
format string to the output string; markup is not transferred from
the format string directly to the output, but instead is used to
define 'replacement fields' that describe to the format engine
what should be placed in the output string in place of the markup.
Brace characters ('curly braces') are used to indicate a
replacement field within the string:
"My name is {0}".format('Fred')
The result of this is the string:
"My name is Fred"
Braces can be escaped by doubling:
"My name is {0} :-{{}}".format('Fred')
Which would produce:
"My name is Fred :-{}"
The element within the braces is called a 'field'. Fields consist
of a 'field name', which can either be simple or compound, and an
optional 'format specifier'.
Simple and Compound Field Names
Simple field names are either names or numbers. If numbers, they
must be valid base-10 integers; if names, they must be valid
Python identifiers. A number is used to identify a positional
argument, while a name is used to identify a keyword argument.
A compound field name is a combination of multiple simple field
names in an expression:
"My name is {0.name}".format(open('out.txt', 'w'))
This example shows the use of the 'getattr' or 'dot' operator
in a field expression. The dot operator allows an attribute of
an input value to be specified as the field value.
Unlike some other programming languages, you cannot embed arbitrary
expressions in format strings. This is by design - the types of
expressions that you can use is deliberately limited. Only two operators
are supported: the '.' (getattr) operator, and the '[]' (getitem)
operator. The reason for allowing these operators is that they don't
normally have side effects in non-pathological code.
An example of the 'getitem' syntax:
"My name is {0[name]}".format(dict(name='Fred'))
It should be noted that the use of 'getitem' within a format string
is much more limited than its conventional usage. In the above example,
the string 'name' really is the literal string 'name', not a variable
named 'name'. The rules for parsing an item key are very simple.
If it starts with a digit, then it is treated as a number, otherwise
it is used as a string.
Because keys are not quote-delimited, it is not possible to
specify arbitrary dictionary keys (e.g., the strings "10" or
":-]") from within a format string.
Implementation note: The implementation of this proposal is
not required to enforce the rule about a simple or dotted name
being a valid Python identifier. Instead, it will rely on the
getattr function of the underlying object to throw an exception if
the identifier is not legal. The str.format() function will have
a minimalist parser which only attempts to figure out when it is
"done" with an identifier (by finding a '.' or a ']', or '}',
etc.).
Format Specifiers
Each field can also specify an optional set of 'format
specifiers' which can be used to adjust the format of that field.
Format specifiers follow the field name, with a colon (':')
character separating the two:
"My name is {0:8}".format('Fred')
The meaning and syntax of the format specifiers depends on the
type of object that is being formatted, but there is a standard
set of format specifiers used for any object that does not
override them.
Format specifiers can themselves contain replacement fields.
For example, a field whose field width is itself a parameter
could be specified via:
"{0:{1}}".format(a, b)
These 'internal' replacement fields can only occur in the format
specifier part of the replacement field. Internal replacement fields
cannot themselves have format specifiers. This implies also that
replacement fields cannot be nested to arbitrary levels.
Note that the doubled '}' at the end, which would normally be
escaped, is not escaped in this case. The reason is because
the '{{' and '}}' syntax for escapes is only applied when used
*outside* of a format field. Within a format field, the brace
characters always have their normal meaning.
The syntax for format specifiers is open-ended, since a class
can override the standard format specifiers. In such cases,
the str.format() method merely passes all of the characters between
the first colon and the matching brace to the relevant underlying
formatting method.
Standard Format Specifiers
If an object does not define its own format specifiers, a standard
set of format specifiers is used. These are similar in concept to
the format specifiers used by the existing '%' operator, however
there are also a number of differences.
The general form of a standard format specifier is:
[[fill]align][sign][#][0][minimumwidth][.precision][type]
The brackets ([]) indicate an optional element.
Then the optional align flag can be one of the following:
'<' - Forces the field to be left-aligned within the available
space (This is the default.)
'>' - Forces the field to be right-aligned within the
available space.
'=' - Forces the padding to be placed after the sign (if any)
but before the digits. This is used for printing fields
in the form '+000000120'. This alignment option is only
valid for numeric types.
'^' - Forces the field to be centered within the available
space.
Note that unless a minimum field width is defined, the field
width will always be the same size as the data to fill it, so
that the alignment option has no meaning in this case.
The optional 'fill' character defines the character to be used to
pad the field to the minimum width. The fill character, if present,
must be followed by an alignment flag.
The 'sign' option is only valid for numeric types, and can be one
of the following:
'+' - indicates that a sign should be used for both
positive as well as negative numbers
'-' - indicates that a sign should be used only for negative
numbers (this is the default behavior)
' ' - indicates that a leading space should be used on
positive numbers
If the '#' character is present, integers use the 'alternate form'
for formatting. This means that binary, octal, and hexadecimal
output will be prefixed with '0b', '0o', and '0x', respectively.
'width' is a decimal integer defining the minimum field width. If
not specified, then the field width will be determined by the
content.
If the width field is preceded by a zero ('0') character, this enables
zero-padding. This is equivalent to an alignment type of '=' and a
fill character of '0'.
The 'precision' is a decimal number indicating how many digits
should be displayed after the decimal point in a floating point
conversion. For non-numeric types the field indicates the maximum
field size - in other words, how many characters will be used from
the field content. The precision is ignored for integer conversions.
Finally, the 'type' determines how the data should be presented.
The available integer presentation types are:
'b' - Binary. Outputs the number in base 2.
'c' - Character. Converts the integer to the corresponding
Unicode character before printing.
'd' - Decimal Integer. Outputs the number in base 10.
'o' - Octal format. Outputs the number in base 8.
'x' - Hex format. Outputs the number in base 16, using lower-
case letters for the digits above 9.
'X' - Hex format. Outputs the number in base 16, using upper-
case letters for the digits above 9.
'n' - Number. This is the same as 'd', except that it uses the
current locale setting to insert the appropriate
number separator characters.
'' (None) - the same as 'd'
The available floating point presentation types are:
'e' - Exponent notation. Prints the number in scientific
notation using the letter 'e' to indicate the exponent.
'E' - Exponent notation. Same as 'e' except it converts the
number to uppercase.
'f' - Fixed point. Displays the number as a fixed-point
number.
'F' - Fixed point. Same as 'f' except it converts the number
to uppercase.
'g' - General format. This prints the number as a fixed-point
number, unless the number is too large, in which case
it switches to 'e' exponent notation.
'G' - General format. Same as 'g' except switches to 'E'
if the number gets to large.
'n' - Number. This is the same as 'g', except that it uses the
current locale setting to insert the appropriate
number separator characters.
'%' - Percentage. Multiplies the number by 100 and displays
in fixed ('f') format, followed by a percent sign.
'' (None) - similar to 'g', except that it prints at least one
digit after the decimal point.
Objects are able to define their own format specifiers to
replace the standard ones. An example is the 'datetime' class,
whose format specifiers might look something like the
arguments to the strftime() function:
"Today is: {0:%a %b %d %H:%M:%S %Y}".format(datetime.now())
For all built-in types, an empty format specification will produce
the equivalent of str(value). It is recommended that objects
defining their own format specifiers follow this convention as
well.
Explicit Conversion Flag
The explicit conversion flag is used to transform the format field value
before it is formatted. This can be used to override the type-specific
formatting behavior, and format the value as if it were a more
generic type. Currently, two explicit conversion flags are
recognized:
!r - convert the value to a string using repr().
!s - convert the value to a string using str().
These flags are placed before the format specifier:
"{0!r:20}".format("Hello")
In the preceding example, the string "Hello" will be printed, with quotes,
in a field of at least 20 characters width.
A custom Formatter class can define additional conversion flags.
The built-in formatter will raise a ValueError if an invalid
conversion flag is specified.
Controlling Formatting on a Per-Type Basis
Each Python type can control formatting of its instances by defining
a __format__ method. The __format__ method is responsible for
interpreting the format specifier, formatting the value, and
returning the resulting string.
The new, global built-in function 'format' simply calls this special
method, similar to how len() and str() simply call their respective
special methods:
def format(value, format_spec):
return value.__format__(format_spec)
It is safe to call this function with a value of "None" (because the
"None" value in Python is an object and can have methods.)
Several built-in types, including 'str', 'int', 'float', and 'object'
define __format__ methods. This means that if you derive from any of
those types, your class will know how to format itself.
The object.__format__ method is the simplest: It simply converts the
object to a string, and then calls format again:
class object:
def __format__(self, format_spec):
return format(str(self), format_spec)
The __format__ methods for 'int' and 'float' will do numeric formatting
based on the format specifier. In some cases, these formatting
operations may be delegated to other types. So for example, in the case
where the 'int' formatter sees a format type of 'f' (meaning 'float')
it can simply cast the value to a float and call format() again.
Any class can override the __format__ method to provide custom
formatting for that type:
class AST:
def __format__(self, format_spec):
...
Note for Python 2.x: The 'format_spec' argument will be either
a string object or a unicode object, depending on the type of the
original format string. The __format__ method should test the type
of the specifiers parameter to determine whether to return a string or
unicode object. It is the responsibility of the __format__ method
to return an object of the proper type.
Note that the 'explicit conversion' flag mentioned above is not passed
to the __format__ method. Rather, it is expected that the conversion
specified by the flag will be performed before calling __format__.
User-Defined Formatting
There will be times when customizing the formatting of fields
on a per-type basis is not enough. An example might be a
spreadsheet application, which displays hash marks '#' when a value
is too large to fit in the available space.
For more powerful and flexible formatting, access to the underlying
format engine can be obtained through the 'Formatter' class that
lives in the 'string' module. This class takes additional options
which are not accessible via the normal str.format method.
An application can subclass the Formatter class to create its own
customized formatting behavior.
The PEP does not attempt to exactly specify all methods and
properties defined by the Formatter class; instead, those will be
defined and documented in the initial implementation. However, this
PEP will specify the general requirements for the Formatter class,
which are listed below.
Although string.format() does not directly use the Formatter class
to do formatting, both use the same underlying implementation. The
reason that string.format() does not use the Formatter class directly
is because "string" is a built-in type, which means that all of its
methods must be implemented in C, whereas Formatter is a Python
class. Formatter provides an extensible wrapper around the same
C functions as are used by string.format().
Formatter Methods
The Formatter class takes no initialization arguments:
fmt = Formatter()
The public API methods of class Formatter are as follows:
-- format(format_string, *args, **kwargs)
-- vformat(format_string, args, kwargs)
'format' is the primary API method. It takes a format template,
and an arbitrary set of positional and keyword arguments.
'format' is just a wrapper that calls 'vformat'.
'vformat' is the function that does the actual work of formatting. It
is exposed as a separate function for cases where you want to pass in
a predefined dictionary of arguments, rather than unpacking and
repacking the dictionary as individual arguments using the '*args' and
'**kwds' syntax. 'vformat' does the work of breaking up the format
template string into character data and replacement fields. It calls
the 'get_positional' and 'get_index' methods as appropriate (described
below.)
Formatter defines the following overridable methods:
-- get_value(key, args, kwargs)
-- check_unused_args(used_args, args, kwargs)
-- format_field(value, format_spec)
'get_value' is used to retrieve a given field value. The 'key' argument
will be either an integer or a string. If it is an integer, it represents
the index of the positional argument in 'args'; If it is a string, then
it represents a named argument in 'kwargs'.
The 'args' parameter is set to the list of positional arguments to
'vformat', and the 'kwargs' parameter is set to the dictionary of
positional arguments.
For compound field names, these functions are only called for the
first component of the field name; subsequent components are handled
through normal attribute and indexing operations.
So for example, the field expression '0.name' would cause 'get_value'
to be called with a 'key' argument of 0. The 'name' attribute will be
looked up after 'get_value' returns by calling the built-in 'getattr'
function.
If the index or keyword refers to an item that does not exist, then an
IndexError/KeyError should be raised.
'check_unused_args' is used to implement checking for unused arguments
if desired. The arguments to this function is the set of all argument
keys that were actually referred to in the format string (integers for
positional arguments, and strings for named arguments), and a reference
to the args and kwargs that was passed to vformat. The set of unused
args can be calculated from these parameters. 'check_unused_args'
is assumed to throw an exception if the check fails.
'format_field' simply calls the global 'format' built-in. The method
is provided so that subclasses can override it.
To get a better understanding of how these functions relate to each
other, here is pseudocode that explains the general operation of
vformat.
def vformat(format_string, args, kwargs):
# Output buffer and set of used args
buffer = StringIO.StringIO()
used_args = set()
# Tokens are either format fields or literal strings
for token in self.parse(format_string):
if is_format_field(token):
# Split the token into field value and format spec
field_spec, _, format_spec = token.partition(":")
# Check for explicit type conversion
explicit, _, field_spec = field_spec.rpartition("!")
# 'first_part' is the part before the first '.' or '['
# Assume that 'get_first_part' returns either an int or
# a string, depending on the syntax.
first_part = get_first_part(field_spec)
value = self.get_value(first_part, args, kwargs)
# Record the fact that we used this arg
used_args.add(first_part)
# Handle [subfield] or .subfield. Assume that 'components'
# returns an iterator of the various subfields, not including
# the first part.
for comp in components(field_spec):
value = resolve_subfield(value, comp)
# Handle explicit type conversion
if explicit == 'r':
value = repr(value)
elif explicit == 's':
value = str(value)
# Call the global 'format' function and write out the converted
# value.
buffer.write(self.format_field(value, format_spec))
else:
buffer.write(token)
self.check_unused_args(used_args, args, kwargs)
return buffer.getvalue()
Note that the actual algorithm of the Formatter class (which will be
implemented in C) may not be the one presented here. (It's likely
that the actual implementation won't be a 'class' at all - rather,
vformat may just call a C function which accepts the other overridable
methods as arguments.) The primary purpose of this code example is to
illustrate the order in which overridable methods are called.
Customizing Formatters
This section describes some typical ways that Formatter objects
can be customized.
To support alternative format-string syntax, the 'vformat' method
can be overridden to alter the way format strings are parsed.
One common desire is to support a 'default' namespace, so that
you don't need to pass in keyword arguments to the format()
method, but can instead use values in a pre-existing namespace.
This can easily be done by overriding get_value() as follows:
class NamespaceFormatter(Formatter):
def __init__(self, namespace={}):
Formatter.__init__(self)
self.namespace = namespace
def get_value(self, key, args, kwds):
if isinstance(key, str):
try:
# Check explicitly passed arguments first
return kwds[key]
except KeyError:
return self.namespace[key]
else:
Formatter.get_value(key, args, kwds)
One can use this to easily create a formatting function that allows
access to global variables, for example:
fmt = NamespaceFormatter(globals())
greeting = "hello"
print(fmt.format("{greeting}, world!"))
A similar technique can be done with the locals() dictionary to
gain access to the locals dictionary.
It would also be possible to create a 'smart' namespace formatter
that could automatically access both locals and globals through
snooping of the calling stack. Due to the need for compatibility
with the different versions of Python, such a capability will not
be included in the standard library, however it is anticipated
that someone will create and publish a recipe for doing this.
Another type of customization is to change the way that built-in
types are formatted by overriding the 'format_field' method. (For
non-built-in types, you can simply define a __format__ special
method on that type.) So for example, you could override the
formatting of numbers to output scientific notation when needed.
Error handling
There are two classes of exceptions which can occur during formatting:
exceptions generated by the formatter code itself, and exceptions
generated by user code (such as a field object's 'getattr' function).
In general, exceptions generated by the formatter code itself are
of the "ValueError" variety -- there is an error in the actual "value"
of the format string. (This is not always true; for example, the
string.format() function might be passed a non-string as its first
parameter, which would result in a TypeError.)
The text associated with these internally generated ValueError
exceptions will indicate the location of the exception inside
the format string, as well as the nature of the exception.
For exceptions generated by user code, a trace record and
dummy frame will be added to the traceback stack to help
in determining the location in the string where the exception
occurred. The inserted traceback will indicate that the
error occurred at:
File "<format_string>;", line XX, in column_YY
where XX and YY represent the line and character position
information in the string, respectively.
Alternate Syntax
Naturally, one of the most contentious issues is the syntax of the
format strings, and in particular the markup conventions used to
indicate fields.
Rather than attempting to exhaustively list all of the various
proposals, I will cover the ones that are most widely used
already.
- Shell variable syntax: $name and $(name) (or in some variants,
${name}). This is probably the oldest convention out there, and
is used by Perl and many others. When used without the braces,
the length of the variable is determined by lexically scanning
until an invalid character is found.
This scheme is generally used in cases where interpolation is
implicit - that is, in environments where any string can contain
interpolation variables, and no special substitution function
need be invoked. In such cases, it is important to prevent the
interpolation behavior from occurring accidentally, so the '$'
(which is otherwise a relatively uncommonly-used character) is
used to signal when the behavior should occur.
It is the author's opinion, however, that in cases where the
formatting is explicitly invoked, that less care needs to be
taken to prevent accidental interpolation, in which case a
lighter and less unwieldy syntax can be used.
- printf and its cousins ('%'), including variations that add a
field index, so that fields can be interpolated out of order.
- Other bracket-only variations. Various MUDs (Multi-User
Dungeons) such as MUSH have used brackets (e.g. [name]) to do
string interpolation. The Microsoft .Net libraries uses braces
({}), and a syntax which is very similar to the one in this
proposal, although the syntax for format specifiers is quite
different. [4]
- Backquoting. This method has the benefit of minimal syntactical
clutter, however it lacks many of the benefits of a function
call syntax (such as complex expression arguments, custom
formatters, etc.).
- Other variations include Ruby's #{}, PHP's {$name}, and so
on.
Some specific aspects of the syntax warrant additional comments:
1) Backslash character for escapes. The original version of
this PEP used backslash rather than doubling to escape a bracket.
This worked because backslashes in Python string literals that
don't conform to a standard backslash sequence such as '\n'
are left unmodified. However, this caused a certain amount
of confusion, and led to potential situations of multiple
recursive escapes, i.e. '\\\\{' to place a literal backslash
in front of a bracket.
2) The use of the colon character (':') as a separator for
format specifiers. This was chosen simply because that's
what .Net uses.
Alternate Feature Proposals
Restricting attribute access: An earlier version of the PEP
restricted the ability to access attributes beginning with a
leading underscore, for example "{0}._private". However, this
is a useful ability to have when debugging, so the feature
was dropped.
Some developers suggested that the ability to do 'getattr' and
'getitem' access should be dropped entirely. However, this
is in conflict with the needs of another set of developers who
strongly lobbied for the ability to pass in a large dict as a
single argument (without flattening it into individual keyword
arguments using the **kwargs syntax) and then have the format
string refer to dict entries individually.
There has also been suggestions to expand the set of expressions
that are allowed in a format string. However, this was seen
to go against the spirit of TOOWTDI, since the same effect can
be achieved in most cases by executing the same expression on
the parameter before it's passed in to the formatting function.
For cases where the format string is being use to do arbitrary
formatting in a data-rich environment, it's recommended to use
a template engine specialized for this purpose, such as
Genshi [5] or Cheetah [6].
Many other features were considered and rejected because they
could easily be achieved by subclassing Formatter instead of
building the feature into the base implementation. This includes
alternate syntax, comments in format strings, and many others.
Security Considerations
Historically, string formatting has been a common source of
security holes in web-based applications, particularly if the
string formatting system allows arbitrary expressions to be
embedded in format strings.
The best way to use string formatting in a way that does not
create potential security holes is to never use format strings
that come from an untrusted source.
Barring that, the next best approach is to ensure that string
formatting has no side effects. Because of the open nature of
Python, it is impossible to guarantee that any non-trivial
operation has this property. What this PEP does is limit the
types of expressions in format strings to those in which visible
side effects are both rare and strongly discouraged by the
culture of Python developers. So for example, attribute access
is allowed because it would be considered pathological to write
code where the mere access of an attribute has visible side
effects (whether the code has *invisible* side effects - such
as creating a cache entry for faster lookup - is irrelevant.)
Sample Implementation
An implementation of an earlier version of this PEP was created by
Patrick Maupin and Eric V. Smith, and can be found in the pep3101
sandbox at:
http://svn.python.org/view/sandbox/trunk/pep3101/
Backwards Compatibility
Backwards compatibility can be maintained by leaving the existing
mechanisms in place. The new system does not collide with any of
the method names of the existing string formatting techniques, so
both systems can co-exist until it comes time to deprecate the
older system.
References
[1] Python Library Reference - String formating operations
http://docs.python.org/library/stdtypes.html#string-formatting-operations
[2] Python Library References - Template strings
http://docs.python.org/library/string.html#string.Template
[3] [Python-3000] String formating operations in python 3k
http://mail.python.org/pipermail/python-3000/2006-April/000285.html
[4] Composite Formatting - [.Net Framework Developer's Guide]
http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
[5] Genshi templating engine.
http://genshi.edgewall.org/
[5] Cheetah - The Python-Powered Template Engine.
http://www.cheetahtemplate.org/
Copyright
This document has been placed in the public domain.
pep-3102 Keyword-Only Arguments
| PEP: | 3102 |
|---|---|
| Title: | Keyword-Only Arguments |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Talin <talin at acm.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 22-Apr-2006 |
| Python-Version: | 3.0 |
| Post-History: | 28-Apr-2006, May-19-2006 |
Abstract
This PEP proposes a change to the way that function arguments are
assigned to named parameter slots. In particular, it enables the
declaration of "keyword-only" arguments: arguments that can only
be supplied by keyword and which will never be automatically
filled in by a positional argument.
Rationale
The current Python function-calling paradigm allows arguments to
be specified either by position or by keyword. An argument can be
filled in either explicitly by name, or implicitly by position.
There are often cases where it is desirable for a function to take
a variable number of arguments. The Python language supports this
using the 'varargs' syntax ('*name'), which specifies that any
'left over' arguments be passed into the varargs parameter as a
tuple.
One limitation on this is that currently, all of the regular
argument slots must be filled before the vararg slot can be.
This is not always desirable. One can easily envision a function
which takes a variable number of arguments, but also takes one
or more 'options' in the form of keyword arguments. Currently,
the only way to do this is to define both a varargs argument,
and a 'keywords' argument (**kwargs), and then manually extract
the desired keywords from the dictionary.
Specification
Syntactically, the proposed changes are fairly simple. The first
change is to allow regular arguments to appear after a varargs
argument:
def sortwords(*wordlist, case_sensitive=False):
...
This function accepts any number of positional arguments, and it
also accepts a keyword option called 'case_sensitive'. This
option will never be filled in by a positional argument, but
must be explicitly specified by name.
Keyword-only arguments are not required to have a default value.
Since Python requires that all arguments be bound to a value,
and since the only way to bind a value to a keyword-only argument
is via keyword, such arguments are therefore 'required keyword'
arguments. Such arguments must be supplied by the caller, and
they must be supplied via keyword.
The second syntactical change is to allow the argument name to
be omitted for a varargs argument. The meaning of this is to
allow for keyword-only arguments for functions that would not
otherwise take a varargs argument:
def compare(a, b, *, key=None):
...
The reasoning behind this change is as follows. Imagine for a
moment a function which takes several positional arguments, as
well as a keyword argument:
def compare(a, b, key=None):
...
Now, suppose you wanted to have 'key' be a keyword-only argument.
Under the above syntax, you could accomplish this by adding a
varargs argument immediately before the keyword argument:
def compare(a, b, *ignore, key=None):
...
Unfortunately, the 'ignore' argument will also suck up any
erroneous positional arguments that may have been supplied by the
caller. Given that we'd prefer any unwanted arguments to raise an
error, we could do this:
def compare(a, b, *ignore, key=None):
if ignore: # If ignore is not empty
raise TypeError
As a convenient shortcut, we can simply omit the 'ignore' name,
meaning 'don't allow any positional arguments beyond this point'.
(Note: After much discussion of alternative syntax proposals, the
BDFL has pronounced in favor of this 'single star' syntax for
indicating the end of positional parameters.)
Function Calling Behavior
The previous section describes the difference between the old
behavior and the new. However, it is also useful to have a
description of the new behavior that stands by itself, without
reference to the previous model. So this next section will
attempt to provide such a description.
When a function is called, the input arguments are assigned to
formal parameters as follows:
- For each formal parameter, there is a slot which will be used
to contain the value of the argument assigned to that
parameter.
- Slots which have had values assigned to them are marked as
'filled'. Slots which have no value assigned to them yet are
considered 'empty'.
- Initially, all slots are marked as empty.
- Positional arguments are assigned first, followed by keyword
arguments.
- For each positional argument:
o Attempt to bind the argument to the first unfilled
parameter slot. If the slot is not a vararg slot, then
mark the slot as 'filled'.
o If the next unfilled slot is a vararg slot, and it does
not have a name, then it is an error.
o Otherwise, if the next unfilled slot is a vararg slot then
all remaining non-keyword arguments are placed into the
vararg slot.
- For each keyword argument:
o If there is a parameter with the same name as the keyword,
then the argument value is assigned to that parameter slot.
However, if the parameter slot is already filled, then that
is an error.
o Otherwise, if there is a 'keyword dictionary' argument,
the argument is added to the dictionary using the keyword
name as the dictionary key, unless there is already an
entry with that key, in which case it is an error.
o Otherwise, if there is no keyword dictionary, and no
matching named parameter, then it is an error.
- Finally:
o If the vararg slot is not yet filled, assign an empty tuple
as its value.
o For each remaining empty slot: if there is a default value
for that slot, then fill the slot with the default value.
If there is no default value, then it is an error.
In accordance with the current Python implementation, any errors
encountered will be signaled by raising TypeError. (If you want
something different, that's a subject for a different PEP.)
Backwards Compatibility
The function calling behavior specified in this PEP is a superset
of the existing behavior - that is, it is expected that any
existing programs will continue to work.
Copyright
This document has been placed in the public domain.
pep-3103 A Switch/Case Statement
| PEP: | 3103 |
|---|---|
| Title: | A Switch/Case Statement |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | guido at python.org (Guido van Rossum) |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 25-Jun-2006 |
| Python-Version: | 3.0 |
| Post-History: | 26-Jun-2006 |
Contents
Rejection Notice
A quick poll during my keynote presentation at PyCon 2007 shows this proposal has no popular support. I therefore reject it.
Abstract
Python-dev has recently seen a flurry of discussion on adding a switch statement. In this PEP I'm trying to extract my own preferences from the smorgasboard of proposals, discussing alternatives and explaining my choices where I can. I'll also indicate how strongly I feel about alternatives I discuss.
This PEP should be seen as an alternative to PEP 275. My views are somewhat different from that PEP's author, but I'm grateful for the work done in that PEP.
This PEP introduces canonical names for the many variants that have been discussed for different aspects of the syntax and semantics, such as "alternative 1", "school II", "option 3" and so on. Hopefully these names will help the discussion.
Rationale
A common programming idiom is to consider an expression and do different things depending on its value. This is usually done with a chain of if/elif tests; I'll refer to this form as the "if/elif chain". There are two main motivations to want to introduce new syntax for this idiom:
- It is repetitive: the variable and the test operator, usually '==' or 'in', are repeated in each if/elif branch.
- It is inefficient: when an expression matches the last test value (or no test value at all) it is compared to each of the preceding test values.
Both of these complaints are relatively mild; there isn't a lot of readability or performance to be gained by writing this differently. Yet, some kind of switch statement is found in many languages and it is not unreasonable to expect that its addition to Python will allow us to write up certain code more cleanly and efficiently than before.
There are forms of dispatch that are not suitable for the proposed switch statement; for example, when the number of cases is not statically known, or when it is desirable to place the code for different cases in different classes or files.
Basic Syntax
I'm considering several variants of the syntax first proposed in PEP 275 here. There are lots of other possibilities, but I don't see that they add anything.
I've recently been converted to alternative 1.
I should note that all alternatives here have the "implicit break" property: at the end of the suite for a particular case, the control flow jumps to the end of the whole switch statement. There is no way to pass control from one case to another. This in contrast to C, where an explicit 'break' statement is required to prevent falling through to the next case.
In all alternatives, the else-suite is optional. It is more Pythonic to use 'else' here rather than introducing a new reserved word, 'default', as in C.
Semantics are discussed in the next top-level section.
Alternative 1
This is the preferred form in PEP 275:
switch EXPR:
case EXPR:
SUITE
case EXPR:
SUITE
...
else:
SUITE
The main downside is that the suites where all the action is are indented two levels deep; this can be remedied by indenting the cases "half a level" (e.g. 2 spaces if the general indentation level is 4).
Alternative 2
This is Fredrik Lundh's preferred form; it differs by not indenting the cases:
switch EXPR:
case EXPR:
SUITE
case EXPR:
SUITE
....
else:
SUITE
Some reasons not to choose this include expected difficulties for auto-indenting editors, folding editors, and the like; and confused users. There are no situations currently in Python where a line ending in a colon is followed by an unindented line.
Alternative 3
This is the same as alternative 2 but leaves out the colon after the switch:
switch EXPR
case EXPR:
SUITE
case EXPR:
SUITE
....
else:
SUITE
The hope of this alternative is that it will not upset the auto-indent logic of the average Python-aware text editor less. But it looks strange to me.
Alternative 4
This leaves out the 'case' keyword on the basis that it is redundant:
switch EXPR:
EXPR:
SUITE
EXPR:
SUITE
...
else:
SUITE
Unfortunately now we are forced to indent the case expressions, because otherwise (at least in the absence of an 'else' keyword) the parser would have a hard time distinguishing between an unindented case expression (which continues the switch statement) or an unrelated statement that starts like an expression (such as an assignment or a procedure call). The parser is not smart enough to backtrack once it sees the colon. This is my least favorite alternative.
Extended Syntax
There is one additional concern that needs to be addressed syntactically. Often two or more values need to be treated the same. In C, this done by writing multiple case labels together without any code between them. The "fall through" semantics then mean that these are all handled by the same code. Since the Python switch will not have fall-through semantics (which have yet to find a champion) we need another solution. Here are some alternatives.
Alternative A
Use:
case EXPR:
to match on a single expression; use:
case EXPR, EXPR, ...:
to match on mulltiple expressions. The is interpreted so that if EXPR is a parenthesized tuple or another expression whose value is a tuple, the switch expression must equal that tuple, not one of its elements. This means that we cannot use a variable to indicate multiple cases. While this is also true in C's switch statement, it is a relatively common occurrence in Python (see for example sre_compile.py).
Alternative B
Use:
case EXPR:
to match on a single expression; use:
case in EXPR_LIST:
to match on multiple expressions. If EXPR_LIST is a single expression, the 'in' forces its interpretation as an iterable (or something supporting __contains__, in a minority semantics alternative). If it is multiple expressions, each of those is considered for a match.
Alternative C
Use:
case EXPR:
to match on a single expression; use:
case EXPR, EXPR, ...:
to match on multiple expressions (as in alternative A); and use:
case *EXPR:
to match on the elements of an expression whose value is an iterable. The latter two cases can be combined, so that the true syntax is more like this:
case [*]EXPR, [*]EXPR, ...:
The * notation is similar to the use of prefix * already in use for variable-length parameter lists and for passing computed argument lists, and often proposed for value-unpacking (e.g. a, b, *c = X as an alternative to (a, b), c = X[:2], X[2:]).
Alternative D
This is a mixture of alternatives B and C; the syntax is like alternative B but instead of the 'in' keyword it uses '*'. This is more limited, but still allows the same flexibility. It uses:
case EXPR:
to match on a single expression and:
case *EXPR:
to match on the elements of an iterable. If one wants to specify multiple matches in one case, one can write this:
case *(EXPR, EXPR, ...):
or perhaps this (although it's a bit strange because the relative priority of '*' and ',' is different than elsewhere):
case * EXPR, EXPR, ...:
Discussion
Alternatives B, C and D are motivated by the desire to specify multiple cases with the same treatment using a variable representing a set (usually a tuple) rather than spelling them out. The motivation for this is usually that if one has several switches over the same set of cases it's a shame to have to spell out all the alternatives each time. An additional motivation is to be able to specify ranges to be matched easily and efficiently, similar to Pascal's "1..1000:" notation. At the same time we want to prevent the kind of mistake that is common in exception handling (and which will be addressed in Python 3000 by changing the syntax of the except clause): writing "case 1, 2:" where "case (1, 2):" was meant, or vice versa.
The case could be made that the need is insufficient for the added complexity; C doesn't have a way to express ranges either, and it's used a lot more than Pascal these days. Also, if a dispatch method based on dict lookup is chosen as the semantics, large ranges could be inefficient (consider range(1, sys.maxint)).
All in all my preferences are (from most to least favorite) B, A, D', C, where D' is D without the third possibility.
Semantics
There are several issues to review before we can choose the right semantics.
If/Elif Chain vs. Dict-based Dispatch
There are several main schools of thought about the switch statement's semantics:
- School I wants to define the switch statement in term of an equivalent if/elif chain (possibly with some optimization thrown in).
- School II prefers to think of it as a dispatch on a precomputed dict. There are different choices for when the precomputation happens.
- There's also school III, which agrees with school I that the definition of a switch statement should be in terms of an equivalent if/elif chain, but concedes to the optimization camp that all expressions involved must be hashable.
We need to further separate school I into school Ia and school Ib:
- School Ia has a simple position: a switch statement is translated to an equivalent if/elif chain, and that's that. It should not be linked to optimization at all. That is also my main objection against this school: without any hint of optimization, the switch statement isn't attractive enough to warrant new syntax.
- School Ib has a more complex position: it agrees with school II that optimization is important, and is willing to concede the compiler certain liberties to allow this. (For example, PEP 275 Solution 1.) In particular, hash() of the switch and case expressions may or may not be called (so it should be side-effect-free); and the case expressions may not be evaluated each time as expected by the if/elif chain behavior, so the case expressions should also be side-effect free. My objection to this (elaborated below) is that if either the hash() or the case expressions aren't side-effect-free, optimized and unoptimized code may behave differently.
School II grew out of the realization that optimization of commonly found cases isn't so easy, and that it's better to face this head on. This will become clear below.
The differences between school I (mostly school Ib) and school II are threefold:
- When optimizing using a dispatch dict, if either the switch expression or the case expressions are unhashable (in which case hash() raises an exception), school Ib requires catching the hash() failure and falling back to an if/elif chain. School II simply lets the exception happen. The problem with catching an exception in hash() as required by school Ib, is that this may hide a genuine bug. A possible way out is to only use a dispatch dict if all case expressions are ints, strings or other built-ins with known good hash behavior, and to only attempt to hash the switch expression if it is also one of those types. Type objects should probably also be supported here. This is the (only) problem that school III addresses.
- When optimizing using a dispatch dict, if the hash() function of any expression involved returns an incorrect value, under school Ib, optimized code will not behave the same as unoptimized code. This is a well-known problem with optimization-related bugs, and waste lots of developer time. Under school II, in this situation incorrect results are produced at least consistently, which should make debugging a bit easier. The way out proposed for the previous bullet would also help here.
- School Ib doesn't have a good optimization strategy if the case expressions are named constants. The compiler cannot know their values for sure, and it cannot know whether they are truly constant. As a way out, it has been proposed to re-evaluate the expression corresponding to the case once the dict has identified which case should be taken, to verify that the value of the expression didn't change. But strictly speaking, all the case expressions occurring before that case would also have to be checked, in order to preserve the true if/elif chain semantics, thereby completely killing the optimization. Another proposed solution is to have callbacks notifying the dispatch dict of changes in the value of variables or attributes involved in the case expressions. But this is not likely implementable in the general case, and would require many namespaces to bear the burden of supporting such callbacks, which currently don't exist at all.
- Finally, there's a difference of opinion regarding the treatment of duplicate cases (i.e. two or more cases with match expressions that evaluates to the same value). School I wants to treat this the same is an if/elif chain would treat it (i.e. the first match wins and the code for the second match is silently unreachable); school II wants this to be an error at the time the dispatch dict is frozen (so dead code doesn't go undiagnosed).
School I sees trouble in school II's approach of pre-freezing a dispatch dict because it places a new and unusual burden on programmers to understand exactly what kinds of case values are allowed to be frozen and when the case values will be frozen, or they might be surprised by the switch statement's behavior.
School II doesn't believe that school Ia's unoptimized switch is worth the effort, and it sees trouble in school Ib's proposal for optimization, which can cause optimized and unoptimized code to behave differently.
In addition, school II sees little value in allowing cases involving unhashable values; after all if the user expects such values, they can just as easily write an if/elif chain. School II also doesn't believe that it's right to allow dead code due to overlapping cases to occur unflagged, when the dict-based dispatch implementation makes it so easy to trap this.
However, there are some use cases for overlapping/duplicate cases. Suppose you're switching on some OS-specific constants (e.g. exported by the os module or some module like that). You have a case for each. But on some OS, two different constants have the same value (since on that OS they are implemented the same way -- like O_TEXT and O_BINARY on Unix). If duplicate cases are flagged as errors, your switch wouldn't work at all on that OS. It would be much better if you could arrange the cases so that one case has preference over another.
There's also the (more likely) use case where you have a set of cases to be treated the same, but one member of the set must be treated differently. It would be convenient to put the exception in an earlier case and be done with it.
(Yes, it seems a shame not to be able to diagnose dead code due to accidental case duplication. Maybe that's less important, and pychecker can deal with it? After all we don't diagnose duplicate method definitions either.)
This suggests school IIb: like school II but redundant cases must be resolved by choosing the first match. This is trivial to implement when building the dispatch dict (skip keys already present).
(An alternative would be to introduce new syntax to indicate "okay to have overlapping cases" or "ok if this case is dead code" but I find that overkill.)
Personally, I'm in school II: I believe that the dict-based dispatch is the one true implementation for switch statements and that we should face the limitiations up front, so that we can reap maximal benefits. I'm leaning towards school IIb -- duplicate cases should be resolved by the ordering of the cases instead of flagged as errors.
When to Freeze the Dispatch Dict
For the supporters of school II (dict-based dispatch), the next big dividing issue is when to create the dict used for switching. I call this "freezing the dict".
The main problem that makes this interesting is the observation that Python doesn't have named compile-time constants. What is conceptually a constant, such as re.IGNORECASE, is a variable to the compiler, and there's nothing to stop crooked code from modifying its value.
Option 1
The most limiting option is to freeze the dict in the compiler. This would require that the case expressions are all literals or compile-time expressions involving only literals and operators whose semantics are known to the compiler, since with the current state of Python's dynamic semantics and single-module compilation, there is no hope for the compiler to know with sufficient certainty the values of any variables occurring in such expressions. This is widely though not universally considered too restrictive.
Raymond Hettinger is the main advocate of this approach. He proposes a syntax where only a single literal of certain types is allowed as the case expression. It has the advantage of being unambiguous and easy to implement.
My main complaint about this is that by disallowing "named constants" we force programmers to give up good habits. Named constants are introduced in most languages to solve the problem of "magic numbers" occurring in the source code. For example, sys.maxint is a lot more readable than 2147483647. Raymond proposes to use string literals instead of named "enums", observing that the string literal's content can be the name that the constant would otherwise have. Thus, we could write "case 'IGNORECASE':" instead of "case re.IGNORECASE:". However, if there is a spelling error in the string literal, the case will silently be ignored, and who knows when the bug is detected. If there is a spelling error in a NAME, however, the error will be caught as soon as it is evaluated. Also, sometimes the constants are externally defined (e.g. when parsing a file format like JPEG) and we can't easily choose appropriate string values. Using an explicit mapping dict sounds like a poor hack.
Option 2
The oldest proposal to deal with this is to freeze the dispatch dict the first time the switch is executed. At this point we can assume that all the named "constants" (constant in the programmer's mind, though not to the compiler) used as case expressions are defined -- otherwise an if/elif chain would have little chance of success either. Assuming the switch will be executed many times, doing some extra work the first time pays back quickly by very quick dispatch times later.
An objection to this option is that there is no obvious object where the dispatch dict can be stored. It can't be stored on the code object, which is supposed to be immutable; it can't be stored on the function object, since many function objects may be created for the same function (e.g. for nested functions). In practice, I'm sure that something can be found; it could be stored in a section of the code object that's not considered when comparing two code objects or when pickling or marshalling a code object; or all switches could be stored in a dict indexed by weak references to code objects. The solution should also be careful not to leak switch dicts between multiple interpreters.
Another objection is that the first-use rule allows obfuscated code like this:
def foo(x, y):
switch x:
case y:
print 42
To the untrained eye (not familiar with Python) this code would be equivalent to this:
def foo(x, y):
if x == y:
print 42
but that's not what it does (unless it is always called with the same value as the second argument). This has been addressed by suggesting that the case expressions should not be allowed to reference local variables, but this is somewhat arbitrary.
A final objection is that in a multi-threaded application, the first-use rule requires intricate locking in order to guarantee the correct semantics. (The first-use rule suggests a promise that side effects of case expressions are incurred exactly once.) This may be as tricky as the import lock has proved to be, since the lock has to be held while all the case expressions are being evaluated.
Option 3
A proposal that has been winning support (including mine) is to freeze a switch's dict when the innermost function containing it is defined. The switch dict is stored on the function object, just as parameter defaults are, and in fact the case expressions are evaluated at the same time and in the same scope as the parameter defaults (i.e. in the scope containing the function definition).
This option has the advantage of avoiding many of the finesses needed to make option 2 work: there's no need for locking, no worry about immutable code objects or multiple interpreters. It also provides a clear explanation for why locals can't be referenced in case expressions.
This option works just as well for situations where one would typically use a switch; case expressions involving imported or global named constants work exactly the same way as in option 2, as long as they are imported or defined before the function definition is encountered.
A downside however is that the dispatch dict for a switch inside a nested function must be recomputed each time the nested function is defined. For certain "functional" styles of programming this may make switch unattractive in nested functions. (Unless all case expressions are compile-time constants; then the compiler is of course free to optimize away the swich freezing code and make the dispatch table part of the code object.)
Another downside is that under this option, there's no clear moment when the dispatch dict is frozen for a switch that doesn't occur inside a function. There are a few pragmatic choices for how to treat a switch outside a function:
- Disallow it.
- Translate it into an if/elif chain.
- Allow only compile-time constant expressions.
- Compute the dispatch dict each time the switch is reached.
- Like (b) but tests that all expressions evaluated are hashable.
Of these, (a) seems too restrictive: it's uniformly worse than (c); and (d) has poor performance for little or no benefits compared to (b). It doesn't make sense to have a performance-critical inner loop at the module level, as all local variable references are slow there; hence (b) is my (weak) favorite. Perhaps I should favor (e), which attempts to prevent atypical use of a switch; examples that work interactively but not in a function are annoying. In the end I don't think this issue is all that important (except it must be resolved somehow) and am willing to leave it up to whoever ends up implementing it.
When a switch occurs in a class but not in a function, we can freeze the dispatch dict at the same time the temporary function object representing the class body is created. This means the case expressions can reference module globals but not class variables. Alternatively, if we choose (b) above, we could choose this implementation inside a class definition as well.
Option 4
There are a number of proposals to add a construct to the language that makes the concept of a value pre-computed at function definition time generally available, without tying it either to parameter default values or case expressions. Some keywords proposed include 'const', 'static', 'only' or 'cached'. The associated syntax and semantics vary.
These proposals are out of scope for this PEP, except to suggest that if such a proposal is accepted, there are two ways for the switch to benefit: we could require case expressions to be either compile-time constants or pre-computed values; or we could make pre-computed values the default (and only) evaluation mode for case expressions. The latter would be my preference, since I don't see a use for more dynamic case expressions that isn't addressed adequately by writing an explicit if/elif chain.
Conclusion
It is too early to decide. I'd like to see at least one completed proposal for pre-computed values before deciding. In the mean time, Python is fine without a switch statement, and perhaps those who claim it would be a mistake to add one are right.
Copyright
This document has been placed in the public domain.
pep-3104 Access to Names in Outer Scopes
| PEP: | 3104 |
|---|---|
| Title: | Access to Names in Outer Scopes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee <ping at zesty.ca> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Oct-2006 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
In most languages that support nested scopes, code can refer to or rebind (assign to) any name in the nearest enclosing scope. Currently, Python code can refer to a name in any enclosing scope, but it can only rebind names in two scopes: the local scope (by simple assignment) or the module-global scope (using a global declaration).
This limitation has been raised many times on the Python-Dev mailing list and elsewhere, and has led to extended discussion and many proposals for ways to remove this limitation. This PEP summarizes the various alternatives that have been suggested, together with advantages and disadvantages that have been mentioned for each.
Rationale
Before version 2.1, Python's treatment of scopes resembled that of standard C: within a file there were only two levels of scope, global and local. In C, this is a natural consequence of the fact that function definitions cannot be nested. But in Python, though functions are usually defined at the top level, a function definition can be executed anywhere. This gave Python the syntactic appearance of nested scoping without the semantics, and yielded inconsistencies that were surprising to some programmers -- for example, a recursive function that worked at the top level would cease to work when moved inside another function, because the recursive function's own name would no longer be visible in its body's scope. This violates the intuition that a function should behave consistently when placed in different contexts. Here's an example:
def enclosing_function():
def factorial(n):
if n < 2:
return 1
return n * factorial(n - 1) # fails with NameError
print factorial(5)
Python 2.1 moved closer to static nested scoping by making visible the names bound in all enclosing scopes (see PEP 227). This change makes the above code example work as expected. However, because any assignment to a name implicitly declares that name to be local, it is impossible to rebind a name in an outer scope (except when a global declaration forces the name to be global). Thus, the following code, intended to display a number that can be incremented and decremented by clicking buttons, doesn't work as someone familiar with lexical scoping might expect:
def make_scoreboard(frame, score=0):
label = Label(frame)
label.pack()
for i in [-10, -1, 1, 10]:
def increment(step=i):
score = score + step # fails with UnboundLocalError
label['text'] = score
button = Button(frame, text='%+d' % i, command=increment)
button.pack()
return label
Python syntax doesn't provide a way to indicate that the name score mentioned in increment refers to the variable score bound in make_scoreboard, not a local variable in increment. Users and developers of Python have expressed an interest in removing this limitation so that Python can have the full flexibility of the Algol-style scoping model that is now standard in many programming languages, including JavaScript, Perl, Ruby, Scheme, Smalltalk, C with GNU extensions, and C# 2.0.
It has been argued that that such a feature isn't necessary, because a rebindable outer variable can be simulated by wrapping it in a mutable object:
class Namespace:
pass
def make_scoreboard(frame, score=0):
ns = Namespace()
ns.score = 0
label = Label(frame)
label.pack()
for i in [-10, -1, 1, 10]:
def increment(step=i):
ns.score = ns.score + step
label['text'] = ns.score
button = Button(frame, text='%+d' % i, command=increment)
button.pack()
return label
However, this workaround only highlights the shortcomings of existing scopes: the purpose of a function is to encapsulate code in its own namespace, so it seems unfortunate that the programmer should have to create additional namespaces to make up for missing functionality in the existing local scopes, and then have to decide whether each name should reside in the real scope or the simulated scope.
Another common objection is that the desired functionality can be written as a class instead, albeit somewhat more verbosely. One rebuttal to this objection is that the existence of a different implementation style is not a reason to leave a supported programming construct (nested scopes) functionally incomplete. Python is sometimes called a "multi-paradigm language" because it derives so much strength, practical flexibility, and pedagogical power from its support and graceful integration of multiple programming paradigms.
A proposal for scoping syntax appeared on Python-Dev as far back as 1994 [1], long before PEP 227's support for nested scopes was adopted. At the time, Guido's response was:
This is dangerously close to introducing CSNS [classic static nested scopes]. If you were to do so, your proposed semantics of scoped seem allright. I still think there is not enough need for CSNS to warrant this kind of construct ...
After PEP 227, the "outer name rebinding discussion" has reappeared on Python-Dev enough times that it has become a familiar event, having recurred in its present form since at least 2003 [2]. Although none of the language changes proposed in these discussions have yet been adopted, Guido has acknowledged that a language change is worth considering [12].
Other Languages
To provide some background, this section describes how some other languages handle nested scopes and rebinding.
JavaScript, Perl, Scheme, Smalltalk, GNU C, C# 2.0
These languages use variable declarations to indicate scope. In JavaScript, a lexically scoped variable is declared with the var keyword; undeclared variable names are assumed to be global. In Perl, a lexically scoped variable is declared with the my keyword; undeclared variable names are assumed to be global. In Scheme, all variables must be declared (with define or let, or as formal parameters). In Smalltalk, any block can begin by declaring a list of local variable names between vertical bars. C and C# require type declarations for all variables. For all these cases, the variable belongs to the scope containing the declaration.
Ruby (as of 1.8)
Ruby is an instructive example because it appears to be the only other currently popular language that, like Python, tries to support statically nested scopes without requiring variable declarations, and thus has to come up with an unusual solution. Functions in Ruby can contain other function definitions, and they can also contain code blocks enclosed in curly braces. Blocks have access to outer variables, but nested functions do not. Within a block, an assignment to a name implies a declaration of a local variable only if it would not shadow a name already bound in an outer scope; otherwise assignment is interpreted as rebinding of the outer name. Ruby's scoping syntax and rules have also been debated at great length, and changes seem likely in Ruby 2.0 [28].
Overview of Proposals
There have been many different proposals on Python-Dev for ways to rebind names in outer scopes. They all fall into two categories: new syntax in the scope where the name is bound, or new syntax in the scope where the name is used.
New Syntax in the Binding (Outer) Scope
Scope Override Declaration
The proposals in this category all suggest a new kind of declaration statement similar to JavaScript's var. A few possible keywords have been proposed for this purpose:
In all these proposals, a declaration such as var x in a particular scope S would cause all references to x in scopes nested within S to refer to the x bound in S.
The primary objection to this category of proposals is that the meaning of a function definition would become context-sensitive. Moving a function definition inside some other block could cause any of the local name references in the function to become nonlocal, due to declarations in the enclosing block. For blocks in Ruby 1.8, this is actually the case; in the following example, the two setters have different effects even though they look identical:
setter1 = proc { | x | y = x } # y is local here
y = 13
setter2 = proc { | x | y = x } # y is nonlocal here
setter1.call(99)
puts y # prints 13
setter2.call(77)
puts y # prints 77
Note that although this proposal resembles declarations in JavaScript and Perl, the effect on the language is different because in those languages undeclared variables are global by default, whereas in Python undeclared variables are local by default. Thus, moving a function inside some other block in JavaScript or Perl can only reduce the scope of a previously global name reference, whereas in Python with this proposal, it could expand the scope of a previously local name reference.
Required Variable Declaration
A more radical proposal [21] suggests removing Python's scope-guessing convention altogether and requiring that all names be declared in the scope where they are to be bound, much like Scheme. With this proposal, var x = 3 would both declare x to belong to the local scope and bind it, where as x = 3 would rebind the existing visible x. In a context without an enclosing scope containing a var x declaration, the statement x = 3 would be statically determined to be illegal.
This proposal yields a simple and consistent model, but it would be incompatible with all existing Python code.
New Syntax in the Referring (Inner) Scope
There are three kinds of proposals in this category.
Outer Reference Expression
This type of proposal suggests a new way of referring to a variable in an outer scope when using the variable in an expression. One syntax that has been suggested for this is .x [7], which would refer to x without creating a local binding for it. A concern with this proposal is that in many contexts x and .x could be used interchangeably, which would confuse the reader. A closely related idea is to use multiple dots to specify the number of scope levels to ascend [8], but most consider this too error-prone [17].
Rebinding Operator
This proposal suggests a new assignment-like operator that rebinds a name without declaring the name to be local [2]. Whereas the statement x = 3 both declares x a local variable and binds it to 3, the statement x := 3 would change the existing binding of x without declaring it local.
This is a simple solution, but according to PEP 3099 it has been rejected (perhaps because it would be too easy to miss or to confuse with =).
Scope Override Declaration
The proposals in this category suggest a new kind of declaration statement in the inner scope that prevents a name from becoming local. This statement would be similar in nature to the global statement, but instead of making the name refer to a binding in the top module-level scope, it would make the name refer to the binding in the nearest enclosing scope.
This approach is attractive due to its parallel with a familiar Python construct, and because it retains context-independence for function definitions.
This approach also has advantages from a security and debugging perspective. The resulting Python would not only match the functionality of other nested-scope languages but would do so with a syntax that is arguably even better for defensive programming. In most other languages, a declaration contracts the scope of an existing name, so inadvertently omitting the declaration could yield farther-reaching (i.e. more dangerous) effects than expected. In Python with this proposal, the extra effort of adding the declaration is aligned with the increased risk of non-local effects (i.e. the path of least resistance is the safer path).
Many spellings have been suggested for such a declaration:
- scoped x [1]
- global x in f [3] (explicitly specify which scope)
- free x [5]
- outer x [6]
- use x [9]
- global x [10] (change the meaning of global)
- nonlocal x [11]
- global x outer [18]
- global in x [18]
- not global x [18]
- extern x [20]
- ref x [22]
- refer x [22]
- share x [22]
- sharing x [22]
- common x [22]
- using x [22]
- borrow x [22]
- reuse x [23]
- scope f x [25] (explicitly specify which scope)
The most commonly discussed choices appear to be outer, global, and nonlocal. outer is already used as both a variable name and an attribute name in the standard library. The word global has a conflicting meaning, because "global variable" is generally understood to mean a variable with top-level scope [27]. In C, the keyword extern means that a name refers to a variable in a different compilation unit. While nonlocal is a bit long and less pleasant-sounding than some of the other options, it does have precisely the correct meaning: it declares a name not local.
Proposed Solution
The solution proposed by this PEP is to add a scope override declaration in the referring (inner) scope. Guido has expressed a preference for this category of solution on Python-Dev [14] and has shown approval for nonlocal as the keyword [19].
The proposed declaration:
nonlocal x
prevents x from becoming a local name in the current scope. All occurrences of x in the current scope will refer to the x bound in an outer enclosing scope. As with global, multiple names are permitted:
nonlocal x, y, z
If there is no pre-existing binding in an enclosing scope, the compiler raises a SyntaxError. (It may be a bit of a stretch to call this a syntax error, but so far SyntaxError is used for all compile-time errors, including, for example, __future__ import with an unknown feature name.) Guido has said that this kind of declaration in the absence of an outer binding should be considered an error [16].
If a nonlocal declaration collides with the name of a formal parameter in the local scope, the compiler raises a SyntaxError.
A shorthand form is also permitted, in which nonlocal is prepended to an assignment or augmented assignment:
nonlocal x = 3
The above has exactly the same meaning as nonlocal x; x = 3. (Guido supports a similar form of the global statement [24].)
On the left side of the shorthand form, only identifiers are allowed, not target expressions like x[0]. Otherwise, all forms of assignment are allowed. The proposed grammar of the nonlocal statement is:
nonlocal_stmt ::=
"nonlocal" identifier ("," identifier)*
["=" (target_list "=")+ expression_list]
| "nonlocal" identifier augop expression_list
The rationale for allowing all these forms of assignment is that it simplifies understanding of the nonlocal statement. Separating the shorthand form into a declaration and an assignment is sufficient to understand what it means and whether it is valid.
Backward Compatibility
This PEP targets Python 3000, as suggested by Guido [19]. However, others have noted that some options considered in this PEP may be small enough changes to be feasible in Python 2.x [26], in which case this PEP could possibly be moved to be a 2.x series PEP.
As a (very rough) measure of the impact of introducing a new keyword, here is the number of times that some of the proposed keywords appear as identifiers in the standard library, according to a scan of the Python SVN repository on November 5, 2006:
nonlocal 0 use 2 using 3 reuse 4 free 8 outer 147
global appears 214 times as an existing keyword. As a measure of the impact of using global as the outer-scope keyword, there are 18 files in the standard library that would break as a result of such a change (because a function declares a variable global before that variable has been introduced in the global scope):
cgi.py dummy_thread.py mhlib.py mimetypes.py idlelib/PyShell.py idlelib/run.py msilib/__init__.py test/inspect_fodder.py test/test_compiler.py test/test_decimal.py test/test_descr.py test/test_dummy_threading.py test/test_fileinput.py test/test_global.py (not counted: this tests the keyword itself) test/test_grammar.py (not counted: this tests the keyword itself) test/test_itertools.py test/test_multifile.py test/test_scope.py (not counted: this tests the keyword itself) test/test_threaded_import.py test/test_threadsignals.py test/test_warnings.py
References
| [1] | (1, 2) Scoping (was Re: Lambda binding solved?) (Rafael Bracho) http://www.python.org/search/hypermail/python-1994q1/0301.html |
| [2] | (1, 2) Extended Function syntax (Just van Rossum) http://mail.python.org/pipermail/python-dev/2003-February/032764.html |
| [3] | Closure semantics (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2003-October/039214.html |
| [4] | (1, 2) Better Control of Nested Lexical Scopes (Almann T. Goo) http://mail.python.org/pipermail/python-dev/2006-February/061568.html |
| [5] | PEP for Better Control of Nested Lexical Scopes (Jeremy Hylton) http://mail.python.org/pipermail/python-dev/2006-February/061602.html |
| [6] | PEP for Better Control of Nested Lexical Scopes (Almann T. Goo) http://mail.python.org/pipermail/python-dev/2006-February/061603.html |
| [7] | Using and binding relative names (Phillip J. Eby) http://mail.python.org/pipermail/python-dev/2006-February/061636.html |
| [8] | Using and binding relative names (Steven Bethard) http://mail.python.org/pipermail/python-dev/2006-February/061749.html |
| [9] | (1, 2) Lexical scoping in Python 3k (Ka-Ping Yee) http://mail.python.org/pipermail/python-dev/2006-July/066862.html |
| [10] | Lexical scoping in Python 3k (Greg Ewing) http://mail.python.org/pipermail/python-dev/2006-July/066889.html |
| [11] | Lexical scoping in Python 3k (Ka-Ping Yee) http://mail.python.org/pipermail/python-dev/2006-July/066942.html |
| [12] | Lexical scoping in Python 3k (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066950.html |
| [13] | Explicit Lexical Scoping (pre-PEP?) (Talin) http://mail.python.org/pipermail/python-dev/2006-July/066978.html |
| [14] | Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066991.html |
| [15] | Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066995.html |
| [16] | Lexical scoping in Python 3k (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/066968.html |
| [17] | Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/067004.html |
| [18] | (1, 2, 3) Explicit Lexical Scoping (pre-PEP?) (Andrew Clover) http://mail.python.org/pipermail/python-dev/2006-July/067007.html |
| [19] | (1, 2) Explicit Lexical Scoping (pre-PEP?) (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2006-July/067067.html |
| [20] | Explicit Lexical Scoping (pre-PEP?) (Matthew Barnes) http://mail.python.org/pipermail/python-dev/2006-July/067221.html |
| [21] | Sky pie: a "var" keyword (a thread started by Neil Toronto) http://mail.python.org/pipermail/python-3000/2006-October/003968.html |
| [22] | (1, 2, 3, 4, 5, 6, 7) Alternatives to 'outer' (Talin) http://mail.python.org/pipermail/python-3000/2006-October/004021.html |
| [23] | Alternatives to 'outer' (Jim Jewett) http://mail.python.org/pipermail/python-3000/2006-November/004153.html |
| [24] | Draft PEP for outer scopes (Guido van Rossum) http://mail.python.org/pipermail/python-3000/2006-November/004166.html |
| [25] | Draft PEP for outer scopes (Talin) http://mail.python.org/pipermail/python-3000/2006-November/004190.html |
| [26] | Draft PEP for outer scopes (Nick Coghlan) http://mail.python.org/pipermail/python-3000/2006-November/004237.html |
| [27] | Global variable (version 2006-11-01T01:23:16) http://en.wikipedia.org/wiki/Global_variable |
| [28] | Ruby 2.0 block local variable http://redhanded.hobix.com/inspect/ruby20BlockLocalVariable.html |
Acknowledgements
The ideas and proposals mentioned in this PEP are gleaned from countless Python-Dev postings. Thanks to Jim Jewett, Mike Orr, Jason Orendorff, and Christian Tanzer for suggesting specific edits to this PEP.
Copyright
This document has been placed in the public domain.
pep-3105 Make print a function
| PEP: | 3105 |
|---|---|
| Title: | Make print a function |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 19-Nov-2006 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
The title says it all -- this PEP proposes a new print() builtin that replaces the print statement and suggests a specific signature for the new function.
Rationale
The print statement has long appeared on lists of dubious language features that are to be removed in Python 3000, such as Guido's "Python Regrets" presentation [1]. As such, the objective of this PEP is not new, though it might become much disputed among Python developers.
The following arguments for a print() function are distilled from a python-3000 message by Guido himself [2]:
- print is the only application-level functionality that has a statement dedicated to it. Within Python's world, syntax is generally used as a last resort, when something can't be done without help from the compiler. Print doesn't qualify for such an exception.
- At some point in application development one quite often feels the need to replace print output by something more sophisticated, like logging calls or calls into some other I/O library. With a print() function, this is a straightforward string replacement, today it is a mess adding all those parentheses and possibly converting >>stream style syntax.
- Having special syntax for print puts up a much larger barrier for evolution, e.g. a hypothetical new printf() function is not too far fetched when it will coexist with a print() function.
- There's no easy way to convert print statements into another call if one needs a different separator, not spaces, or none at all. Also, there's no easy way at all to conveniently print objects with some other separator than a space.
- If print() is a function, it would be much easier to replace it within one module (just def print(*args):...) or even throughout a program (e.g. by putting a different function in __builtin__.print). As it is, one can do this by writing a class with a write() method and assigning that to sys.stdout -- that's not bad, but definitely a much larger conceptual leap, and it works at a different level than print.
Specification
The signature for print(), taken from various mailings and recently posted on the python-3000 list [3] is:
def print(*args, sep=' ', end='\n', file=None)
A call like:
print(a, b, c, file=sys.stderr)
will be equivalent to today's:
print >>sys.stderr, a, b, c
while the optional sep and end arguments specify what is printed between and after the arguments, respectively.
The softspace feature (a semi-secret attribute on files currently used to tell print whether to insert a space before the first item) will be removed. Therefore, there will not be a direct translation for today's:
print "a", print
which will not print a space between the "a" and the newline.
Backwards Compatibility
The changes proposed in this PEP will render most of today's print statements invalid. Only those which incidentally feature parentheses around all of their arguments will continue to be valid Python syntax in version 3.0, and of those, only the ones printing a single parenthesized value will continue to do the same thing. For example, in 2.x:
>>> print ("Hello")
Hello
>>> print ("Hello", "world")
('Hello', 'world')
whereas in 3.0:
>>> print ("Hello")
Hello
>>> print ("Hello", "world")
Hello world
Luckily, as it is a statement in Python 2, print can be detected and replaced reliably and non-ambiguously by an automated tool, so there should be no major porting problems (provided someone writes the mentioned tool).
Implementation
The proposed changes were implemented in the Python 3000 branch in the Subversion revisions 53685 to 53704. Most of the legacy code in the library has been converted too, but it is an ongoing effort to catch every print statement that may be left in the distribution.
References
| [1] | http://legacy.python.org/doc/essays/ppt/regrets/PythonRegrets.pdf |
| [2] | Replacement for print in Python 3.0 (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2005-September/056154.html |
| [3] | print() parameters in py3k (Guido van Rossum) http://mail.python.org/pipermail/python-3000/2006-November/004485.html |
Copyright
This document has been placed in the public domain.
pep-3106 Revamping dict.keys(), .values() and .items()
| PEP: | 3106 |
|---|---|
| Title: | Revamping dict.keys(), .values() and .items() |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 19-Dec-2006 |
| Post-History: |
Abstract
This PEP proposes to change the .keys(), .values() and .items() methods of the built-in dict type to return a set-like or unordered container object whose contents are derived from the underlying dictionary rather than a list which is a copy of the keys, etc.; and to remove the .iterkeys(), .itervalues() and .iteritems() methods.
The approach is inspired by that taken in the Java Collections Framework [1].
Introduction
It has long been the plan to change the .keys(), .values() and .items() methods of the built-in dict type to return a more lightweight object than a list, and to get rid of .iterkeys(), .itervalues() and .iteritems(). The idea is that code that currently (in 2.x) reads:
for k, v in d.iteritems(): ...
should be rewritten as:
for k, v in d.items(): ...
(and similar for .itervalues() and .iterkeys(), except the latter is redundant since we can write that loop as for k in d.)
Code that currently reads:
a = d.keys() # assume we really want a list here
(etc.) should be rewritten as
a = list(d.keys())
There are (at least) two ways to accomplish this. The original plan was to simply let .keys(), .values() and .items() return an iterator, i.e. exactly what iterkeys(), itervalues() and iteritems() return in Python 2.x. However, the Java Collections Framework [1] suggests that a better solution is possible: the methods return objects with set behavior (for .keys() and .items()) or multiset (== bag) behavior (for .values()) that do not contain copies of the keys, values or items, but rather reference the underlying dict and pull their values out of the dict as needed.
The advantage of this approach is that one can still write code like this:
a = d.items() for k, v in a: ... # And later, again: for k, v in a: ...
Effectively, iter(d.keys()) (etc.) in Python 3.0 will do what d.iterkeys() (etc.) does in Python 2.x; but in most contexts we don't have to write the iter() call because it is implied by a for-loop.
The objects returned by the .keys() and .items() methods behave like sets. The object returned by the values() method behaves like a much simpler unordered collection -- it cannot be a set because duplicate values are possible.
Because of the set behavior, it will be possible to check whether two dicts have the same keys by simply testing:
if a.keys() == b.keys(): ...
and similarly for .items().
These operations are thread-safe only to the extent that using them in a thread-unsafe way may cause an exception but will not cause corruption of the internal representation.
As in Python 2.x, mutating a dict while iterating over it using an iterator has an undefined effect and will in most cases raise a RuntimeError exception. (This is similar to the guarantees made by the Java Collections Framework.)
The objects returned by .keys() and .items() are fully interoperable with instances of the built-in set and frozenset types; for example:
set(d.keys()) == d.keys()
is guaranteed to be True (except when d is being modified simultaneously by another thread).
Specification
I'm using pseudo-code to specify the semantics:
class dict:
# Omitting all other dict methods for brevity.
# The .iterkeys(), .itervalues() and .iteritems() methods
# will be removed.
def keys(self):
return d_keys(self)
def items(self):
return d_items(self)
def values(self):
return d_values(self)
class d_keys:
def __init__(self, d):
self.__d = d
def __len__(self):
return len(self.__d)
def __contains__(self, key):
return key in self.__d
def __iter__(self):
for key in self.__d:
yield key
# The following operations should be implemented to be
# compatible with sets; this can be done by exploiting
# the above primitive operations:
#
# <, <=, ==, !=, >=, > (returning a bool)
# &, |, ^, - (returning a new, real set object)
#
# as well as their method counterparts (.union(), etc.).
#
# To specify the semantics, we can specify x == y as:
#
# set(x) == set(y) if both x and y are d_keys instances
# set(x) == y if x is a d_keys instance
# x == set(y) if y is a d_keys instance
#
# and so on for all other operations.
class d_items:
def __init__(self, d):
self.__d = d
def __len__(self):
return len(self.__d)
def __contains__(self, (key, value)):
return key in self.__d and self.__d[key] == value
def __iter__(self):
for key in self.__d:
yield key, self.__d[key]
# As well as the set operations mentioned for d_keys above.
# However the specifications suggested there will not work if
# the values aren't hashable. Fortunately, the operations can
# still be implemented efficiently. For example, this is how
# intersection can be specified:
def __and__(self, other):
if isinstance(other, (set, frozenset, d_keys)):
result = set()
for item in other:
if item in self:
result.add(item)
return result
if not isinstance(other, d_items):
return NotImplemented
d = {}
if len(other) < len(self):
self, other = other, self
for item in self:
if item in other:
key, value = item
d[key] = value
return d.items()
# And here is equality:
def __eq__(self, other):
if isinstance(other, (set, frozenset, d_keys)):
if len(self) != len(other):
return False
for item in other:
if item not in self:
return False
return True
if not isinstance(other, d_items):
return NotImplemented
# XXX We could also just compare the underlying dicts...
if len(self) != len(other):
return False
for item in self:
if item not in other:
return False
return True
def __ne__(self, other):
# XXX Perhaps object.__ne__() should be defined this way.
result = self.__eq__(other)
if result is not NotImplemented:
result = not result
return result
class d_values:
def __init__(self, d):
self.__d = d
def __len__(self):
return len(self.__d)
def __contains__(self, value):
# This is slow, and it's what "x in y" uses as a fallback
# if __contains__ is not defined; but I'd rather make it
# explicit that it is supported.
for v in self:
if v == value:
return True
return False
def __iter__(self):
for key in self.__d:
yield self.__d[key]
def __eq__(self, other):
if not isinstance(other, d_values):
return NotImplemented
if len(self) != len(other):
return False
# XXX Sometimes this could be optimized, but these are the
# semantics: we can't depend on the values to be hashable
# or comparable.
olist = list(other)
for x in self:
try:
olist.remove(x)
except ValueError:
return False
assert olist == []
return True
def __ne__(self, other):
result = self.__eq__(other)
if result is not NotImplemented:
result = not result
return result
Notes:
The view objects are not directly mutable, but don't implement __hash__(); their value can change if the underlying dict is mutated.
The only requirements on the underlying dict are that it implements __getitem__(), __contains__(), __iter__(), and __len__().
We don't implement .copy() -- the presence of a .copy() method suggests that the copy has the same type as the original, but that's not feasible without copying the underlying dict. If you want a copy of a specific type, like list or set, you can just pass one of the above to the list() or set() constructor.
The specification implies that the order in which items are returned by .keys(), .values() and .items() is the same (just as it was in Python 2.x), because the order is all derived from the dict iterator (which is presumably arbitrary but stable as long as a dict isn't modified). This can be expressed by the following invariant:
list(d.items()) == list(zip(d.keys(), d.values()))
Open Issues
Do we need more of a motivation? I would think that being able to do set operations on keys and items without having to copy them should speak for itself.
I've left out the implementation of various set operations. These could still present small surprises.
It would be okay if multiple calls to d.keys() (etc.) returned the same object, since the object's only state is the dict to which it refers. Is this worth having extra slots in the dict object for? Should that be a weak reference or should the d_keys (etc.) object live forever once created? Strawman: probably not worth the extra slots in every dict.
Should d_keys, d_values and d_items have a public instance variable or method through which one can retrieve the underlying dict? Strawman: yes (but what should it be called?).
I'm soliciting better names than d_keys, d_values and d_items. These classes could be public so that their implementations could be reused by the .keys(), .values() and .items() methods of other mappings. Or should they?
Should the d_keys, d_values and d_items classes be reusable? Strawman: yes.
Should they be subclassable? Strawman: yes (but see below).
A particularly nasty issue is whether operations that are specified in terms of other operations (e.g. .discard()) must really be implemented in terms of those other operations; this may appear irrelevant but it becomes relevant if these classes are ever subclassed. Historically, Python has a really poor track record of specifying the semantics of highly optimized built-in types clearly in such cases; my strawman is to continue that trend. Subclassing may still be useful to add new methods, for example.
I'll leave the decisions (especially about naming) up to whoever submits a working implementation.
References
| [1] | (1, 2) Java Collections Framework http://java.sun.com/docs/books/tutorial/collections/index.html |
pep-3107 Function Annotations
| PEP: | 3107 |
|---|---|
| Title: | Function Annotations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com>, Tony Lownds <tony at lownds.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2-Dec-2006 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP introduces a syntax for adding arbitrary metadata annotations to Python functions [1].
Rationale
Because Python's 2.x series lacks a standard way of annotating a function's parameters and return values, a variety of tools and libraries have appeared to fill this gap. Some utilise the decorators introduced in "PEP 318", while others parse a function's docstring, looking for annotations there.
This PEP aims to provide a single, standard way of specifying this information, reducing the confusion caused by the wide variation in mechanism and syntax that has existed until this point.
Fundamentals of Function Annotations
Before launching into a discussion of the precise ins and outs of Python 3.0's function annotations, let's first talk broadly about what annotations are and are not:
Function annotations, both for parameters and return values, are completely optional.
Function annotations are nothing more than a way of associating arbitrary Python expressions with various parts of a function at compile-time.
By itself, Python does not attach any particular meaning or significance to annotations. Left to its own, Python simply makes these expressions available as described in Accessing Function Annotations below.
The only way that annotations take on meaning is when they are interpreted by third-party libraries. These annotation consumers can do anything they want with a function's annotations. For example, one library might use string-based annotations to provide improved help messages, like so:
def compile(source: "something compilable", filename: "where the compilable thing comes from", mode: "is this a single statement or a suite?"): ...Another library might be used to provide typechecking for Python functions and methods. This library could use annotations to indicate the function's expected input and return types, possibly something like:
def haul(item: Haulable, *vargs: PackAnimal) -> Distance: ...However, neither the strings in the first example nor the type information in the second example have any meaning on their own; meaning comes from third-party libraries alone.
Following from point 2, this PEP makes no attempt to introduce any kind of standard semantics, even for the built-in types. This work will be left to third-party libraries.
Syntax
Parameters
Annotations for parameters take the form of optional expressions that follow the parameter name:
def foo(a: expression, b: expression = 5):
...
In pseudo-grammar, parameters now look like identifier [: expression] [= expression]. That is, annotations always precede a parameter's default value and both annotations and default values are optional. Just like how equal signs are used to indicate a default value, colons are used to mark annotations. All annotation expressions are evaluated when the function definition is executed, just like default values.
Annotations for excess parameters (i.e., *args and **kwargs) are indicated similarly:
def foo(*args: expression, **kwargs: expression):
...
Annotations for nested parameters always follow the name of the parameter, not the last parenthesis. Annotating all parameters of a nested parameter is not required:
def foo((x1, y1: expression),
(x2: expression, y2: expression)=(None, None)):
...
Return Values
The examples thus far have omitted examples of how to annotate the type of a function's return value. This is done like so:
def sum() -> expression:
...
That is, the parameter list can now be followed by a literal -> and a Python expression. Like the annotations for parameters, this expression will be evaluated when the function definition is executed.
The grammar for function definitions [11] is now:
decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
decorators: decorator+
funcdef: [decorators] 'def' NAME parameters ['->' test] ':' suite
parameters: '(' [typedargslist] ')'
typedargslist: ((tfpdef ['=' test] ',')*
('*' [tname] (',' tname ['=' test])* [',' '**' tname]
| '**' tname)
| tfpdef ['=' test] (',' tfpdef ['=' test])* [','])
tname: NAME [':' test]
tfpdef: tname | '(' tfplist ')'
tfplist: tfpdef (',' tfpdef)* [',']
Lambda
lambda's syntax does not support annotations. The syntax of lambda could be changed to support annotations, by requiring parentheses around the parameter list. However it was decided [12] not to make this change because:
- It would be an incompatible change.
- Lambda's are neutered anyway.
- The lambda can always be changed to a function.
Accessing Function Annotations
Once compiled, a function's annotations are available via the function's func_annotations attribute. This attribute is a mutable dictionary, mapping parameter names to an object representing the evaluated annotation expression
There is a special key in the func_annotations mapping, "return". This key is present only if an annotation was supplied for the function's return value.
For example, the following annotation:
def foo(a: 'x', b: 5 + 6, c: list) -> max(2, 9):
...
would result in a func_annotation mapping of
{'a': 'x',
'b': 11,
'c': list,
'return': 9}
The return key was chosen because it cannot conflict with the name of a parameter; any attempt to use return as a parameter name would result in a SyntaxError.
func_annotations is an empty, mutable dictionary if there are no annotations on the function or if the functions was created from a lambda expression.
Use Cases
In the course of discussing annotations, a number of use-cases have been raised. Some of these are presented here, grouped by what kind of information they convey. Also included are examples of existing products and packages that could make use of annotations.
- Providing typing information
- Other information
- Documentation for parameters and return values ([24])
Standard Library
pydoc and inspect
The pydoc module should display the function annotations when displaying help for a function. The inspect module should change to support annotations.
Relation to Other PEPs
Function Signature Objects [13]
Function Signature Objects should expose the function's annotations. The Parameter object may change or other changes may be warranted.
Implementation
A reference implementation has been checked into the p3yk branch as revision 53170 [10].
Rejected Proposals
- The BDFL rejected the author's idea for a special syntax for adding annotations to generators as being "too ugly" [2].
- Though discussed early on ([5], [6]), including special objects in the stdlib for annotating generator functions and higher-order functions was ultimately rejected as being more appropriate for third-party libraries; including them in the standard library raised too many thorny issues.
- Despite considerable discussion about a standard type parameterisation syntax, it was decided that this should also be left to third-party libraries. ([7], [8], [9]).
- Despite yet more discussion, it was decided not to standardize a mechanism for annotation interoperability. Standardizing interoperability conventions at this point would be premature. We would rather let these conventions develop organically, based on real-world usage and necessity, than try to force all users into some contrived scheme. ([14], [15], [16]).
References and Footnotes
| [1] | Unless specifically stated, "function" is generally used as a synonym for "callable" throughout this document. |
| [2] | http://mail.python.org/pipermail/python-3000/2006-May/002103.html |
| [3] | http://oakwinter.com/code/typecheck/ |
| [4] | http://maxrepo.info/taxonomy/term/3,6/all |
| [5] | http://mail.python.org/pipermail/python-3000/2006-May/002091.html |
| [6] | http://mail.python.org/pipermail/python-3000/2006-May/001972.html |
| [7] | http://mail.python.org/pipermail/python-3000/2006-May/002105.html |
| [8] | http://mail.python.org/pipermail/python-3000/2006-May/002209.html |
| [9] | http://mail.python.org/pipermail/python-3000/2006-June/002438.html |
| [10] | http://svn.python.org/view?rev=53170&view=rev |
| [11] | http://docs.python.org/reference/compound_stmts.html#function-definitions |
| [12] | http://mail.python.org/pipermail/python-3000/2006-May/001613.html |
| [13] | http://www.python.org/dev/peps/pep-0362/ |
| [14] | http://mail.python.org/pipermail/python-3000/2006-August/002895.html |
| [15] | http://mail.python.org/pipermail/python-ideas/2007-January/000032.html |
| [16] | http://mail.python.org/pipermail/python-list/2006-December/420645.html |
| [17] | http://www.python.org/idle/doc/idle2.html#Tips |
| [18] | http://www.jython.org/Project/index.html |
| [19] | http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython |
| [20] | http://peak.telecommunity.com/PyProtocols.html |
| [21] | http://www.artima.com/weblogs/viewpost.jsp?thread=155123 |
| [22] | http://www-128.ibm.com/developerworks/library/l-cppeak2/ |
| [23] | http://rpyc.wikispaces.com/ |
| [24] | http://docs.python.org/library/pydoc.html |
Copyright
This document has been placed in the public domain.
pep-3108 Standard Library Reorganization
| PEP: | 3108 |
|---|---|
| Title: | Standard Library Reorganization |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 01-Jan-2007 |
| Python-Version: | 3.0 |
| Post-History: | 28-Apr-2008 |
Contents
Note
The merging of profile/cProfile as of Python 3.3 did not occur, and thus is considered abandoned (although it would be acceptable to do in the future).
Abstract
Just like the language itself, Python's standard library (stdlib) has grown over the years to be very rich. But over time some modules have lost their need to be included with Python. There has also been an introduction of a naming convention for modules since Python's inception that not all modules follow.
Python 3.0 has presents a chance to remove modules that do not have long term usefulness. This chance also allows for the renaming of modules so that they follow the Python style guide [8]. This PEP lists modules that should not be included in Python 3.0 or which need to be renamed.
Modules to Remove
Guido pronounced that "silly old stuff" is to be deleted from the stdlib for Py3K [12]. This is open-ended on purpose. Each module to be removed needs to have a justification as to why it should no longer be distributed with Python. This can range from the module being deprecated in Python 2.x to being for a platform that is no longer widely used.
This section of the PEP lists the various modules to be removed. Each subsection represents a different reason for modules to be removed. Each module must have a specific justification on top of being listed in a specific subsection so as to make sure only modules that truly deserve to be removed are in fact removed.
When a reason mentions how long it has been since a module has been "uniquely edited", it is in reference to how long it has been since a checkin was done specifically for the module and not for a change that applied universally across the entire stdlib. If an edit time is not denoted as "unique" then it is the last time the file was edited, period.
Previously deprecated [done]
PEP 4 lists all modules that have been deprecated in the stdlib [7]. The specified motivations mirror those listed in PEP 4. All modules listed in the PEP at the time of the first alpha release of Python 3.0 will be removed.
The entire contents of lib-old will also be removed. These modules have already been removed from being imported but are kept in the distribution for Python for users that rely upon the code.
cfmfile
- Documented as deprecated since Python 2.4 without an explicit reason.
cl
- Documented as obsolete since Python 2.0 or earlier.
- Interface to SGI hardware.
md5
- Supplanted by the hashlib module.
mimetools
- Documented as obsolete in a previous version.
- Supplanted by the email package.
MimeWriter
- Supplanted by the email package.
mimify
- Supplanted by the email package.
multifile
- Supplanted by the email package.
posixfile
- Locking is better done by fcntl.lockf().
rfc822
- Supplanted by the email package.
sha
- Supplanted by the hashlib package.
sv
- Documented as obsolete since Python 2.0 or earlier.
- Interface to obsolete SGI Indigo hardware.
timing
- Documented as obsolete since Python 2.0 or earlier.
- time.clock() gives better time resolution.
Platform-specific with minimal use [done]
Python supports many platforms, some of which are not widely used or maintained. And on some of these platforms there are modules that have limited use to people on those platforms. Because of their limited usefulness it would be better to no longer burden the Python development team with their maintenance.
The modules mentioned below are documented. All undocumented modules for the specified platforms will also be removed.
IRIX
The IRIX operating system is no longer produced [19]. Removing all modules from the plat-irix[56] directory has been deemed reasonable because of this fact.
- AL/al
- Provides sound support on Indy and Indigo workstations.
- Both workstations are no longer available.
- Code has not been uniquely edited in three years.
- cd/CD
- CD drive control for SGI systems.
- SGI no longer sells machines with IRIX on them.
- Code has not been uniquely edited in 14 years.
- cddb
- Undocumented.
- cdplayer
- Undocumented.
- cl/CL/CL_old
- Compression library for SGI systems.
- SGI no longer sells machines with IRIX on them.
- Code has not been uniquely edited in 14 years.
- DEVICE/GL/gl/cgen/cgensuport
- GL access, which is the predecessor to OpenGL.
- Has not been edited in at least eight years.
- Third-party libraries provide better support (PyOpenGL [16]).
- ERRNO
- Undocumented.
- FILE
- Undocumented.
- FL/fl/flp
- Wrapper for the FORMS library [20]
- FORMS has not been edited in 12 years.
- Library is not widely used.
- First eight hits on Google are for Python docs for fl.
- fm
- Wrapper to the IRIS Font Manager library.
- Only available on SGI machines which no longer come with IRIX.
- GET
- Undocumented.
- GLWS
- Undocumented.
- imgfile
- Wrapper for SGI libimage library for imglib image files (.rgb files).
- Python Imaging Library provdes read-only support [17].
- Not uniquely edited in 13 years.
- IN
- Undocumented.
- IOCTL
- Undocumented.
- jpeg
- Wrapper for JPEG (de)compressor.
- Code not uniquely edited in nine years.
- Third-party libraries provide better support (Python Imaging Library [17]).
- panel
- Undocumented.
- panelparser
- Undocumented.
- readcd
- Undocumented.
- SV
- Undocumented.
- torgb
- Undocumented.
- WAIT
- Undocumented.
Mac-specific modules
The Mac-specific modules are not well-maintained (e.g., the bgen tool used to auto-generate many of the modules has never been updated to support UCS-4). It is also not Python's place to maintain such a large amount of OS-specific modules. Thus all modules under Lib/plat-mac and Mac are to be removed.
A stub module for proxy access will be provided for use by urllib.
_builtinSuites
- Undocumented.
- Package under lib-scriptpackages.
Audio_mac
- Undocumented.
aepack
OSA support is better through third-party modules.
- Appscript [22].
Hard-coded endianness which breaks on Intel Macs.
Might need to rename if Carbon package dependent.
aetools
- See aepack.
aetypes
- See aepack.
applesingle
- Undocumented.
- AppleSingle is a binary file format for A/UX.
- A/UX no longer distributed.
appletrawmain
- Undocumented.
appletrunner
- Undocumented.
argvemulator
- Undocumented.
autoGIL
- Very bad model for using Python with the CFRunLoop.
bgenlocations
- Undocumented.
buildtools
- Documented as deprecated since Python 2.3 without an explicit reason.
bundlebuilder
- Undocumented.
Carbon
- Carbon development has stopped.
- Does not support 64-bit systems completely.
- Dependent on bgen which has never been updated to support UCS-4 Unicode builds of Python.
CodeWarrior
- Undocumented.
- Package under lib-scriptpackages.
ColorPicker
- Better to use Cocoa for GUIs.
EasyDialogs
- Better to use Cocoa for GUIs.
Explorer
- Undocumented.
- Package under lib-scriptpackages.
Finder
- Undocumented.
- Package under lib-scriptpackages.
findertools
- No longer useful.
FrameWork
- Poorly documented.
- Not updated to support Carbon Events.
gensuitemodule
- See aepack.
ic
icglue
icopen
- Not needed on OS X.
- Meant to replace 'open' which is usually a bad thing to do.
macerrors
- Undocumented.
MacOS
- Would also mean the removal of binhex.
macostools
macresource
- Undocumented.
MiniAEFrame
- See aepack.
Nav
- Undocumented.
Netscape
- Undocumented.
- Package under lib-scriptpackages.
OSATerminology
pimp
- Undocumented.
PixMapWrapper
- Undocumented.
StdSuites
- Undocumented.
- Package under lib-scriptpackages.
SystemEvents
- Undocumented.
- Package under lib-scriptpackages.
Terminal
- Undocumented.
- Package under lib-scriptpackages.
terminalcommand
- Undocumented.
videoreader
- No longer used.
W
- No longer distributed with Python.
Solaris
- SUNAUDIODEV/sunaudiodev
- Access to the sound card on Sun machines.
- Code not uniquely edited in over eight years.
Hardly used [done]
Some platform-independent modules are rarely used. There are a number of possible explanations for this, including, ease of reimplementation, very small audience or lack of adherence to more modern standards.
- audiodev
- Undocumented.
- Not edited in five years.
- imputil
- Undocumented.
- Never updated to support absolute imports.
- mutex
- Easy to implement using a semaphore and a queue.
- Cannot block on a lock attempt.
- Not uniquely edited since its addition 15 years ago.
- Only useful with the 'sched' module.
- Not thread-safe.
- stringold
- Function versions of the methods on string objects.
- Obsolete since Python 1.6.
- Any functionality not in the string object or module will be moved to the string module (mostly constants).
- sunaudio
- Undocumented.
- Not edited in over seven years.
- The sunau module provides similar abilities.
- toaiff
- Undocumented.
- Requires sox library to be installed on the system.
- user
- Easily handled by allowing the application specify its own module name, check for existence, and import if found.
- new
- Just a rebinding of names from the 'types' module.
- Can also call type built-in to get most types easily.
- Docstring states the module is no longer useful as of revision 27241 (2002-06-15).
- pure
- Written before Pure Atria was bought by Rational which was then bought by IBM (in other words, very old).
- test.testall
- From the days before regrtest.
Obsolete
Becoming obsolete signifies that either another module in the stdlib or a widely distributed third-party library provides a better solution for what the module is meant for.
Bastion/rexec [done]
- Restricted execution / security.
- Turned off in Python 2.3.
- Modules deemed unsafe.
bsddb185 [done]
Canvas [done]
- Marked as obsolete in a comment by Guido since 2000 (see http://bugs.python.org/issue210677).
- Better to use the Tkinter.Canvas class.
commands [done]
- subprocess module replaces it [9].
- Remove getstatus(), move rest to subprocess.
compiler [done]
dircache [done]
- Negligible use.
- Easily replicated.
dl [done]
- ctypes provides better support for same functionality.
fpformat [done]
- All functionality is supported by string interpolation.
htmllib [done]
- Superseded by HTMLParser.
ihooks [done]
- Undocumented.
- For use with rexec which has been turned off since Python 2.3.
imageop [done]
Better support by third-party libraries (Python Imaging Library [17]).
- Unit tests relied on rgbimg and imgfile.
- rgbimg was removed in Python 2.6.
- imgfile slated for removal in this PEP.
linuxaudiodev [done]
- Replaced by ossaudiodev.
mhlib [done]
- Should be removed as an individual module; use mailbox instead.
popen2 [done]
- subprocess module replaces it [9].
sgmllib [done]
- Does not fully parse SGML.
- In the stdlib for support to htmllib which is slated for removal.
sre [done]
- Previously deprecated; import re instead.
stat [TODO need to move all uses over to os.stat()]
- os.stat() now returns a tuple with attributes.
- Functions in the module should be made into methods for the object returned by os.stat.
statvfs [done]
- os.statvfs now returns a tuple with attributes.
thread [done]
- People should use 'threading' instead.
- Rename 'thread' to _thread.
- Deprecate dummy_thread and rename _dummy_thread.
- Move thread.get_ident over to threading.
- Guido has previously supported the deprecation [13].
- People should use 'threading' instead.
urllib [done]
- Superceded by urllib2.
- Functionality unique to urllib will be kept in the urllib package.
UserDict [done: 3.0] [TODO handle 2.6]
- Not as useful since types can be a superclass.
- Useful bits moved to the 'collections' module.
UserList/UserString [done]
- Not useful since types can be a superclass.
- Moved to the 'collections' module.
Maintenance Burden
Over the years, certain modules have become a heavy burden upon python-dev to maintain. In situations like this, it is better for the module to be given to the community to maintain to free python-dev to focus more on language support and other modules in the standard library that do not take up a undue amount of time and effort.
- bsddb3
- Externally maintained at http://www.jcea.es/programacion/pybsddb.htm .
- Consistent testing instability.
- Berkeley DB follows a different release schedule than Python, leading to the bindings not necessarily being in sync with what is available.
Modules to Rename
Many modules existed in the stdlib before PEP 8 came into existence [8]. This has led to some naming inconsistencies and namespace bloat that should be addressed.
PEP 8 violations [done]
PEP 8 specifies that modules "should have short, all-lowercase names" where "underscores can be used ... if it improves readability" [8]. The use of underscores is discouraged in package names. The following modules violate PEP 8 and are not somehow being renamed by being moved to a package.
| Current Name | Replacement Name |
|---|---|
| _winreg | winreg |
| ConfigParser | configparser |
| copy_reg | copyreg |
| Queue | queue |
| SocketServer | socketserver |
Merging C and Python implementations of the same interface
Several interfaces have both a Python and C implementation. While it is great to have a C implementation for speed with a Python implementation as fallback, there is no need to expose the two implementations independently in the stdlib. For Python 3.0 all interfaces with two implementations will be merged into a single public interface.
The C module is to be given a leading underscore to delineate the fact that it is not the reference implementation (the Python implementation is). This means that any semantic difference between the C and Python versions must be dealt with before Python 3.0 or else the C implementation will be removed until it can be fixed.
One interface that is not listed below is xml.etree.ElementTree. This is an externally maintained module and thus is not under the direct control of the Python development team for renaming. See Open Issues for a discussion on this.
- pickle/cPickle [done]
- Rename cPickle to _pickle.
- Semantic completeness of C implementation not verified.
- profile/cProfile [TODO]
- Rename cProfile to _profile.
- Semantic completeness of C implementation not verified.
- StringIO/cStringIO [done]
- Add the class to the 'io' module.
No public, documented interface [done]
There are several modules in the stdlib that have no defined public interface. These modules exist as support code for other modules that are exposed. Because they are not meant to be used directly they should be renamed to reflect this fact.
| Current Name | Replacement Name |
|---|---|
| markupbase | _markupbase |
Poorly chosen names [done]
A few modules have names that were poorly chosen in hindsight. They should be renamed so as to prevent their bad name from perpetuating beyond the 2.x series.
| Current Name | Replacement Name |
|---|---|
| repr | reprlib |
| test.test_support | test.support |
Grouping of modules [done]
As the stdlib has grown, several areas within it have expanded to include multiple modules (e.g., support for database files). It thus makes sense to group related modules into packages.
dbm package
| Current Name | Replacement Name |
|---|---|
| anydbm | dbm.__init__ [1] |
| dbhash | dbm.bsd |
| dbm | dbm.ndbm |
| dumbdm | dbm.dumb |
| gdbm | dbm.gnu |
| whichdb | dbm.__init__ [1] |
| [1] | (1, 2) dbm.__init__ can combine anybdbm and whichdb since the public API for both modules has no name conflict and the two modules have closely related usage. |
html package
| Current Name | Replacement Name |
|---|---|
| HTMLParser | html.parser |
| htmlentitydefs | html.entities |
http package
| Current Name | Replacement Name |
|---|---|
| httplib | http.client |
| BaseHTTPServer | http.server [2] |
| CGIHTTPServer | http.server [2] |
| SimpleHTTPServer | http.server [2] |
| Cookie | http.cookies |
| cookielib | http.cookiejar |
| [2] | (1, 2, 3) The http.server module can combine the specified modules safely as they have no naming conflicts. |
tkinter package
| Current Name | Replacement Name |
|---|---|
| Dialog | tkinter.dialog |
| FileDialog | tkinter.filedialog [4] |
| FixTk | tkinter._fix |
| ScrolledText | tkinter.scrolledtext |
| SimpleDialog | tkinter.simpledialog [5] |
| Tix | tkinter.tix |
| Tkconstants | tkinter.constants |
| Tkdnd | tkinter.dnd |
| Tkinter | tkinter.__init__ |
| tkColorChooser | tkinter.colorchooser |
| tkCommonDialog | tkinter.commondialog |
| tkFileDialog | tkinter.filedialog [4] |
| tkFont | tkinter.font |
| tkMessageBox | tkinter.messagebox |
| tkSimpleDialog | tkinter.simpledialog [5] |
| turtle | tkinter.turtle |
| [4] | (1, 2) tkinter.filedialog can safely combine FileDialog and tkFileDialog as there are no naming conflicts. |
| [5] | (1, 2) tkinter.simpledialog can safely combine SimpleDialog and tkSimpleDialog have no naming conflicts. |
urllib package
Originally this new package was to be named url, but because of the common use of the name as a variable, it has been deemed better to keep the name urllib and instead shift existing modules around into a new package.
| Current Name | Replacement Name |
|---|---|
| urllib2 | urllib.request, urllib.error |
| urlparse | urllib.parse |
| urllib | urllib.parse, urllib.request, urllib.error [6] |
| robotparser | urllib.robotparser |
| [6] | The quoting-related functions from urllib will be added to urllib.parse. urllib.URLOpener and urllib.FancyUrlOpener will be added to urllib.request as long as the documentation for both modules is updated. |
xmlrpc package
| Current Name | Replacement Name |
|---|---|
| xmlrpclib | xmlrpc.client |
| DocXMLRPCServer | xmlrpc.server [3] |
| SimpleXMLRPCServer | xmlrpc.server [3] |
| [3] | (1, 2) The modules being combined into xmlrpc.server have no naming conflicts and thus can safely be merged. |
Transition Plan
Issues
Issues related to this PEP:
- Issue 2775 [25]
Master tracking issue
- Issue 2828 [26]
clean up undoc.rst
For modules to be removed
For module removals, it is easiest to remove the module first in Python 3.0 to see where dependencies exist. This makes finding code that (possibly) requires the suppression of the DeprecationWarning easier.
In Python 3.0
- Remove the module.
- Remove related tests.
- Remove all documentation (typically the module's documentation file and its entry in a file for the Library Reference).
- Edit Modules/Setup.dist and setup.py if needed.
- Run the regression test suite (using -uall); watch out for tests that are skipped because an import failed for the removed module.
- Check in the change (with an appropriate Misc/NEWS entry).
- Update this PEP noting that the 3.0 step is done.
In Python 2.6
Add the following code to the deprecated module if it is implemented in Python as the first piece of executed code (adjusting the module name and the warnings import and needed):
from warnings import warnpy3k warnpy3k("the XXX module has been removed in Python 3.0", stacklevel=2) del warnpy3kor the following if it is an extension module:
if (PyErr_WarnPy3k("the XXX module has been removed in " "Python 3.0", 2) < 0) return;(the Python-Dev TextMate bundle, available from Misc/TextMate, contains a command that will generate all of this for you).
Update the documentation. For modules with their own documentation file, use the :deprecated: option with the module directive along with the deprecated directive, stating the deprecation is occurring in 2.6, but is for the module's removal in 3.0.:
.. deprecated:: 2.6 The :mod:`XXX` module has been removed in Python 3.0.
For modules simply listed in a file (e.g., undoc.rst), use the warning directive.
Add the module to the module deletion test in test_py3kwarn.
- Suppress the warning in the module's test code using
test.test_support.import_module(name, deprecated=True).
Check in the change w/ appropriate Misc/NEWS entry (block this checkin in py3k!).
Update this PEP noting that the 2.6 step is done.
Renaming of modules
Support in the 2to3 refactoring tool for renames will be used to help people transition to new module names [15]. Import statements will be rewritten so that only the import statement and none of the rest of the code needs to be touched. This will be accomplished by using the as keyword in import statements to bind in the module namespace to the old name while importing based on the new name (when the keyword is not already used, otherwise the re-assigned name should be left alone and only the module that is imported needs to be changed). The fix_imports fixer is an example of how to approach this.
Python 3.0
- Update 2to3 in the sandbox to support the rename.
- Use svn move to rename the module.
- Update all import statements in the stdlib to use the new name (use 2to3's fix_imports fixer for the easiest solution).
- Rename the module in its own documentation.
- Update all references in the documentation from the old name to the new name.
- Run regrtest.py -uall to verify the rename worked.
- Add an entry in Misc/NEWS.
- Commit the changes.
Python 2.6
In the module's documentation, add a note mentioning that the module is renamed in Python 3.0:
.. note:: The :mod:`OLDNAME` module has been renamed to :mod:`NEWNAME` in Python 3.0.
Commit the documentation change.
Block the revision in py3k.
Open Issues
Renaming of modules maintained outside of the stdlib
xml.etree.ElementTree not only does not meet PEP 8 naming standards but it also has an exposed C implementation [8]. It is an externally maintained package, though [10]. A request will be made for the maintainer to change the name so that it matches PEP 8 and hides the C implementation.
Rejected Ideas
Modules that were originally suggested for removal
asynchat/asyncore
- Josiah Carlson has said he will maintain the modules.
audioop/sunau/aifc
- Audio modules where the formats are still used.
base64/quopri/uu
- All still widely used.
- 'codecs' module does not provide as nice of an API for basic usage.
fileinput
- Useful when having to work with stdin.
linecache
- Used internally in several places.
nis
- Testimonials from people that new installations of NIS are still occurring
getopt
- Simpler than optparse.
repr
- Useful as a basis for overriding.
- Used internally.
sched
- Useful for simulations.
symtable/_symtable
- Docs were written.
telnetlib
- Really handy for quick-and-dirty remote access.
- Some hardware supports using telnet for configuration and querying.
Tkinter
- Would prevent IDLE from existing.
- No GUI toolkit would be available out of the box.
Introducing a new top-level package
It has been suggested that the entire stdlib be placed within its own package. This PEP will not address this issue as it has its own design issues (naming, does it deserve special consideration in import semantics, etc.). Everything within this PEP can easily be handled if a new top-level package is introduced.
References
| [7] | PEP 4: Deprecation of Standard Modules (http://www.python.org/dev/peps/pep-0004/) |
| [8] | (1, 2, 3, 4) PEP 8: Style Guide for Python Code (http://www.python.org/dev/peps/pep-0008/) |
| [9] | (1, 2) PEP 324: subprocess -- New process module (http://www.python.org/dev/peps/pep-0324/) |
| [10] | PEP 360: Externally Maintained Packages (http://www.python.org/dev/peps/pep-0360/) |
| [11] | Python Documentation: Global Module Index (http://docs.python.org/modindex.html) |
| [12] | Python-Dev email: "Py3k release schedule worries" (http://mail.python.org/pipermail/python-3000/2006-December/005130.html) |
| [13] | Python-Dev email: Autoloading? (http://mail.python.org/pipermail/python-dev/2005-October/057244.html) |
| [14] | Python-Dev Summary: 2004-11-01 (http://www.python.org/dev/summary/2004-11-01_2004-11-15/#id10) |
| [15] | 2to3 refactoring tool (http://svn.python.org/view/sandbox/trunk/2to3/) |
| [16] | PyOpenGL (http://pyopengl.sourceforge.net/) |
| [17] | (1, 2, 3) Python Imaging Library (PIL) (http://www.pythonware.com/products/pil/) |
| [18] | Twisted (http://twistedmatrix.com/trac/) |
| [19] | SGI Press Release: End of General Availability for MIPS IRIX Products -- December 2006 (http://www.sgi.com/support/mips_irix.html) |
| [20] | FORMS Library by Mark Overmars (ftp://ftp.cs.ruu.nl/pub/SGI/FORMS) |
| [21] | Wikipedia: Au file format (http://en.wikipedia.org/wiki/Au_file_format) |
| [22] | appscript (http://appscript.sourceforge.net/) |
| [23] | _ast module (http://docs.python.org/library/ast.html) |
| [24] | python-dev email: getting compiler package failures (http://mail.python.org/pipermail/python-3000/2007-May/007615.html) |
| [25] | http://bugs.python.org/issue2775 |
| [26] | http://bugs.python.org/issue2828 |
| [27] | http://pypi.python.org/ |
Copyright
This document has been placed in the public domain.
pep-3109 Raising Exceptions in Python 3000
| PEP: | 3109 |
|---|---|
| Title: | Raising Exceptions in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 19-Jan-2006 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP introduces changes to Python's mechanisms for raising exceptions intended to reduce both line noise and the size of the language.
Rationale
One of Python's guiding maxims is "there should be one -- and preferably only one -- obvious way to do it" [1]. Python 2.x's raise statement violates this principle, permitting multiple ways of expressing the same thought. For example, these statements are equivalent:
raise E, V raise E(V)
There is a third form of the raise statement, allowing arbitrary tracebacks to be attached to an exception [2]:
raise E, V, T
where T is a traceback. As specified in PEP 344 [4], exception objects in Python 3.x will possess a __traceback__ attribute, admitting this translation of the three-expression raise statement:
raise E, V, T
is translated to
e = E(V) e.__traceback__ = T raise e
Using these translations, we can reduce the raise statement from four forms to two:
raise (with no arguments) is used to re-raise the active exception in an except suite.
raise EXCEPTION is used to raise a new exception. This form has two sub-variants: EXCEPTION may be an exception class or an instance of an exception class; valid exception classes are BaseException and its subclasses [5]. If EXCEPTION is a subclass, it will be called with no arguments to obtain an exception instance.
To raise anything else is an error.
There is a further, more tangible benefit to be obtained through this consolidation, as noted by A.M. Kuchling [6].
PEP 8 doesn't express any preference between the
two forms of raise statements:
raise ValueError, 'blah'
raise ValueError("blah")
I like the second form better, because if the exception arguments
are long or include string formatting, you don't need to use line
continuation characters because of the containing parens.
The BDFL has concurred [7] and endorsed the consolidation of the several raise forms.
Grammar Changes
In Python 3, the grammar for raise statements will change from [2]
raise_stmt: 'raise' [test [',' test [',' test]]]
to
raise_stmt: 'raise' [test]
Changes to Builtin Types
Because of its relation to exception raising, the signature for the throw() method on generator objects will change, dropping the optional second and third parameters. The signature thus changes from [3]
generator.throw(E, [V, [T]])
to
generator.throw(EXCEPTION)
Where EXCEPTION is either a subclass of BaseException or an instance of a subclass of BaseException.
Semantic Changes
In Python 2, the following raise statement is legal
raise ((E1, (E2, E3)), E4), V
The interpreter will take the tuple's first element as the exception type (recursively), making the above fully equivalent to
raise E1, V
As of Python 3.0, support for raising tuples like this will be dropped. This change will bring raise statements into line with the throw() method on generator objects, which already disallows this.
Compatibility Issues
All two- and three-expression raise statements will require modification, as will all two- and three-expression throw() calls on generators. Fortunately, the translation from Python 2.x to Python 3.x in this case is simple and can be handled mechanically by Guido van Rossum's 2to3 utility [8] using the raise and throw fixers ([9], [10]).
The following translations will be performed:
Zero- and one-expression raise statements will be left intact.
Two-expression raise statements will be converted from
raise E, V
to
raise E(V)
Two-expression throw() calls will be converted from
generator.throw(E, V)
to
generator.throw(E(V))
See point #5 for a caveat to this transformation.
Three-expression raise statements will be converted from
raise E, V, T
to
e = E(V) e.__traceback__ = T raise e
Three-expression throw() calls will be converted from
generator.throw(E, V, T)
to
e = E(V) e.__traceback__ = T generator.throw(e)
See point #5 for a caveat to this transformation.
Two- and three-expression raise statements where E is a tuple literal can be converted automatically using 2to3's raise fixer. raise statements where E is a non-literal tuple, e.g., the result of a function call, will need to be converted manually.
Two- and three-expression raise statements where E is an exception class and V is an exception instance will need special attention. These cases break down into two camps:
raise E, V as a long-hand version of the zero-argument raise statement. As an example, assuming F is a subclass of E
try: something() except F as V: raise F(V) except E as V: handle(V)This would be better expressed as
try: something() except F: raise except E as V: handle(V)raise E, V as a way of "casting" an exception to another class. Taking an example from distutils.compiler.unixcompiler
try: self.spawn(pp_args) except DistutilsExecError as msg: raise CompileError(msg)This would be better expressed as
try: self.spawn(pp_args) except DistutilsExecError as msg: raise CompileError from msgUsing the raise ... from ... syntax introduced in PEP 344.
Implementation
This PEP was implemented in revision 57783 [11].
References
| [1] | http://www.python.org/dev/peps/pep-0020/ |
| [2] | (1, 2) http://docs.python.org/reference/simple_stmts.html#raise |
| [3] | http://www.python.org/dev/peps/pep-0342/ |
| [4] | http://www.python.org/dev/peps/pep-0344/ |
| [5] | http://www.python.org/dev/peps/pep-0352/ |
| [6] | http://mail.python.org/pipermail/python-dev/2005-August/055187.html |
| [7] | http://mail.python.org/pipermail/python-dev/2005-August/055190.html |
| [8] | http://svn.python.org/view/sandbox/trunk/2to3/ |
| [9] | http://svn.python.org/view/sandbox/trunk/2to3/fixes/fix_raise.py |
| [10] | http://svn.python.org/view/sandbox/trunk/2to3/fixes/fix_throw.py |
| [11] | http://svn.python.org/view/python/branches/py3k/Include/?rev=57783&view=rev |
Copyright
This document has been placed in the public domain.
pep-3110 Catching Exceptions in Python 3000
| PEP: | 3110 |
|---|---|
| Title: | Catching Exceptions in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 16-Jan-2006 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP introduces changes intended to help eliminate ambiguities in Python's grammar, simplify exception classes, simplify garbage collection for exceptions and reduce the size of the language in Python 3.0.
Rationale
except clauses in Python 2.x present a syntactic ambiguity where the parser cannot differentiate whether
except <expression>, <expression>:
should be interpreted as
except <type>, <type>:
or
except <type>, <name>:
Python 2 opts for the latter semantic, at the cost of requiring the former to be parenthesized, like so
except (<type>, <type>):
As specified in PEP 352 [1], the ability to treat exceptions as tuples will be removed, meaning this code will no longer work
except os.error, (errno, errstr):
Because the automatic unpacking will no longer be possible, it is desirable to remove the ability to use tuples as except targets.
As specified in PEP 344 [5], exception instances in Python 3 will possess a __traceback__ attribute. The Open Issues section of that PEP includes a paragraph on garbage collection difficulties caused by this attribute, namely a "exception -> traceback -> stack frame -> exception" reference cycle, whereby all locals are kept in scope until the next GC run. This PEP intends to resolve this issue by adding a cleanup semantic to except clauses in Python 3 whereby the target name is deleted at the end of the except suite.
In the spirit of "there should be one -- and preferably only one -- obvious way to do it" [2], it is desirable to consolidate duplicate functionality. To this end, the exc_value, exc_type and exc_traceback attributes of the sys module [3] will be removed in favor of sys.exc_info(), which provides the same information. These attributes are already listed in PEP 3100 [4] as targeted for removal.
Grammar Changes
In Python 3, the grammar for except statements will change from [8]
except_clause: 'except' [test [',' test]]
to
except_clause: 'except' [test ['as' NAME]]
The use of as in place of the comma token means that
except (AttributeError, os.error):
can be clearly understood as a tuple of exception classes. This new syntax was first proposed by Greg Ewing [6] and endorsed ([6], [7]) by the BDFL.
Further, the restriction of the token following as from test to NAME means that only valid identifiers can be used as except targets.
Note that the grammar above always requires parenthesized tuples as exception clases. That way, the ambiguous
except A, B:
which would mean different things in Python 2.x and 3.x -- leading to hard-to-catch bugs -- cannot legally occur in 3.x code.
Semantic Changes
In order to resolve the garbage collection issue related to PEP 344, except statements in Python 3 will generate additional bytecode to delete the target, thus eliminating the reference cycle. The source-to-source translation, as suggested by Phillip J. Eby [9], is
try:
try_body
except E as N:
except_body
...
gets translated to (in Python 2.5 terms)
try:
try_body
except E, N:
try:
except_body
finally:
N = None
del N
...
An implementation has already been checked into the p3yk branch [10].
Compatibility Issues
Nearly all except clauses will need to be changed. except clauses with identifier targets will be converted from
except E, N:
to
except E as N:
except clauses with non-tuple, non-identifier targets (e.g., a.b.c[d]) will need to be converted from
except E, T:
to
except E as t:
T = t
Both of these cases can be handled by Guido van Rossum's 2to3 utility [11] using the except fixer [12].
except clauses with tuple targets will need to be converted manually, on a case-by-case basis. These changes will usually need to be accompanied by changes to the exception classes themselves. While these changes generally cannot be automated, the 2to3 utility is able to point out cases where the target of an except clause is a tuple, simplifying conversion.
Situations where it is necessary to keep an exception instance around past the end of the except suite can be easily translated like so
try:
...
except E as N:
...
...
becomes
try:
...
except E as N:
n = N
...
...
This way, when N is deleted at the end of the block, n will persist and can be used as normal.
Lastly, all uses of the sys module's exc_type, exc_value and exc_traceback attributes will need to be removed. They can be replaced with sys.exc_info()[0], sys.exc_info()[1] and sys.exc_info()[2] respectively, a transformation that can be performed by 2to3's sysexcattrs fixer.
2.6 - 3.0 Compatibility
In order to facilitate forwards compatibility between Python 2.6 and 3.0, the except ... as ...: syntax will be backported to the 2.x series. The grammar will thus change from:
except_clause: 'except' [test [',' test]]
to:
except_clause: 'except' [test [('as' | ',') test]]
The end-of-suite cleanup semantic for except statements will not be included in the 2.x series of releases.
Open Issues
Replacing or Dropping "sys.exc_info()"
The idea of dropping sys.exc_info() or replacing it with a sys.exception attribute or a sys.get_exception() function has been raised several times on python-3000 ([13], [14]) and mentioned in PEP 344's "Open Issues" section.
While a 2to3 fixer to replace calls to sys.exc_info() and some attribute accesses would be trivial, it would be far more difficult for static analysis to find and fix functions that expect the values from sys.exc_info() as arguments. Similarly, this does not address the need to rewrite the documentation for all APIs that are defined in terms of sys.exc_info().
Implementation
This PEP was implemented in revisions 53342 [15] and 53349 [16]. Support for the new except syntax in 2.6 was implemented in revision 55446 [17].
References
| [1] | http://www.python.org/dev/peps/pep-0352/ |
| [2] | http://www.python.org/dev/peps/pep-0020/ |
| [3] | http://docs.python.org/library/sys.html |
| [4] | http://www.python.org/dev/peps/pep-3100/ |
| [5] | http://www.python.org/dev/peps/pep-0344/ |
| [6] | (1, 2) http://mail.python.org/pipermail/python-dev/2006-March/062449.html |
| [7] | http://mail.python.org/pipermail/python-dev/2006-March/062640.html |
| [8] | http://docs.python.org/reference/compound_stmts.html#try |
| [9] | http://mail.python.org/pipermail/python-3000/2007-January/005395.html |
| [10] | http://svn.python.org/view?rev=53342&view=rev |
| [11] | http://svn.python.org/view/sandbox/trunk/2to3/ |
| [12] | http://svn.python.org/view/sandbox/trunk/2to3/fixes/fix_except.py |
| [13] | http://mail.python.org/pipermail/python-3000/2007-January/005385.html |
| [14] | http://mail.python.org/pipermail/python-3000/2007-January/005604.html |
| [15] | http://svn.python.org/view/python/branches/p3yk/?view=rev&rev=53342 |
| [16] | http://svn.python.org/view/python/branches/p3yk/?view=rev&rev=53349 |
| [17] | http://svn.python.org/view/python/trunk/?view=rev&rev=55446 |
Copyright
This document has been placed in the public domain.
pep-3111 Simple input built-in in Python 3000
| PEP: | 3111 |
|---|---|
| Title: | Simple input built-in in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Andre Roberge <andre.roberge at gmail.com > |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 13-Sep-2006 |
| Python-Version: | 3.0 |
| Post-History: | 22-Dec-2006 |
Abstract
Input and output are core features of computer programs. Currently, Python provides a simple means of output through the print keyword and two simple means of interactive input through the input() and raw_input() built-in functions.
Python 3.0 will introduce various incompatible changes with previous Python versions[1]. Among the proposed changes, print will become a built-in function, print(), while input() and raw_input() would be removed completely from the built-in namespace, requiring importing some module to provide even the most basic input capability.
This PEP proposes that Python 3.0 retains some simple interactive user input capability, equivalent to raw_input(), within the built-in namespace.
It was accepted by the BDFL in December 2006 [5].
Motivation
With its easy readability and its support for many programming styles (e.g. procedural, object-oriented, etc.) among others, Python is perhaps the best computer language to use in introductory programming classes. Simple programs often need to provide information to the user (output) and to obtain information from the user (interactive input). Any computer language intended to be used in an educational setting should provide straightforward methods for both output and interactive input.
The current proposals for Python 3.0 [1] include a simple output pathway via a built-in function named print(), but a more complicated method for input [e.g. via sys.stdin.readline()], one that requires importing an external module. Current versions of Python (pre-3.0) include raw_input() as a built-in function. With the availability of such a function, programs that require simple input/output can be written from day one, without requiring discussions of importing modules, streams, etc.
Rationale
Current built-in functions, like input() and raw_input(), are found to be extremely useful in traditional teaching settings. (For more details, see [2] and the discussion that followed.) While the BDFL has clearly stated [3] that input() was not to be kept in Python 3000, he has also stated that he was not against revising the decision of killing raw_input().
raw_input() provides a simple mean to ask a question and obtain a response from a user. The proposed plans for Python 3.0 would require the replacement of the single statement:
name = raw_input("What is your name?")
by the more complicated:
import sys
print("What is your name?")
same = sys.stdin.readline()
However, from the point of view of many Python beginners and educators, the use of sys.stdin.readline() presents the following problems:
1. Compared to the name "raw_input", the name "sys.stdin.readline()" is clunky and inelegant.
2. The names "sys" and "stdin" have no meaning for most beginners, who are mainly interested in what the function does, and not where in the package structure it is located. The lack of meaning also makes it difficult to remember: is it "sys.stdin.readline()", or " stdin.sys.readline()"? To a programming novice, there is not any obvious reason to prefer one over the other. In contrast, functions simple and direct names like print, input, and raw_input, and open are easier to remember.
3. The use of "." notation is unmotivated and confusing to many beginners. For example, it may lead some beginners to think "." is a standard character that could be used in any identifier.
4. There is an asymmetry with the print function: why is print not called sys.stdout.print()?
Specification
The existing raw_input() function will be renamed to input().
The Python 2 to 3 conversion tool will replace calls to input() with eval(input()) and raw_input() with input().
Naming Discussion
With input() effectively removed from the language, the name raw_input() makes much less sense and alternatives should be considered. The various possibilities mentioned in various forums include:
ask() ask_user() get_string() input() # initially rejected by BDFL, later accepted prompt() read() user_input() get_response()
While it was initially rejected by the BDFL, it has been suggested that the most direct solution would be to rename "raw_input" to "input" in Python 3000. The main objection is that Python 2.x already has a function named "input", and, even though it is not going to be included in Python 3000, having a built-in function with the same name but different semantics may confuse programmers migrating from 2.x to 3000. Certainly, this is no problem for beginners, and the scope of the problem is unclear for more experienced programmers, since raw_input(), while popular with many, is not in universal use. In this instance, the good it does for beginners could be seen to outweigh the harm it does to experienced programmers - although it could cause confusion for people reading older books or tutorials.
The rationale for accepting the renaming can be found here [4].
References
| [1] | (1, 2) PEP 3100, Miscellaneous Python 3.0 Plans, Kuchling, Cannon http://www.python.org/dev/peps/pep-3100/ |
| [2] | The fate of raw_input() in Python 3000 http://mail.python.org/pipermail/edu-sig/2006-September/006967.html |
| [3] | Educational aspects of Python 3000 http://mail.python.org/pipermail/python-3000/2006-September/003589.html |
| [4] | Rationale for going with the straight renaming http://mail.python.org/pipermail/python-3000/2006-December/005249.html |
| [5] | BDFL acceptance of the PEP http://mail.python.org/pipermail/python-3000/2006-December/005257.html |
Copyright
This document has been placed in the public domain.
pep-3112 Bytes literals in Python 3000
| PEP: | 3112 |
|---|---|
| Title: | Bytes literals in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jason Orendorff <jason.orendorff at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 358 |
| Created: | 23-Feb-2007 |
| Python-Version: | 3.0 |
| Post-History: | 23-Feb-2007 |
Contents
Abstract
This PEP proposes a literal syntax for the bytes objects introduced in PEP 358. The purpose is to provide a convenient way to spell ASCII strings and arbitrary binary data.
Motivation
Existing spellings of an ASCII string in Python 3000 include:
bytes('Hello world', 'ascii')
'Hello world'.encode('ascii')
The proposed syntax is:
b'Hello world'
Existing spellings of an 8-bit binary sequence in Python 3000 include:
bytes([0x7f, 0x45, 0x4c, 0x46, 0x01, 0x01, 0x01, 0x00])
bytes('\x7fELF\x01\x01\x01\0', 'latin-1')
'7f454c4601010100'.decode('hex')
The proposed syntax is:
b'\x7f\x45\x4c\x46\x01\x01\x01\x00' b'\x7fELF\x01\x01\x01\0'
In both cases, the advantages of the new syntax are brevity, some small efficiency gain, and the detection of encoding errors at compile time rather than at runtime. The brevity benefit is especially felt when using the string-like methods of bytes objects:
lines = bdata.split(bytes('\n', 'ascii')) # existing syntax
lines = bdata.split(b'\n') # proposed syntax
And when converting code from Python 2.x to Python 3000:
sok.send('EXIT\r\n') # Python 2.x
sok.send('EXIT\r\n'.encode('ascii')) # Python 3000 existing
sok.send(b'EXIT\r\n') # proposed
Grammar Changes
The proposed syntax is an extension of the existing string syntax [1].
The new syntax for strings, including the new bytes literal, is:
stringliteral: [stringprefix] (shortstring | longstring)
stringprefix: "b" | "r" | "br" | "B" | "R" | "BR" | "Br" | "bR"
shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem: shortstringchar | escapeseq
longstringitem: longstringchar | escapeseq
shortstringchar:
<any source character except "\" or newline or the quote>
longstringchar: <any source character except "\">
escapeseq: "\" NL
| "\\" | "\'" | '\"'
| "\a" | "\b" | "\f" | "\n" | "\r" | "\t" | "\v"
| "\ooo" | "\xhh"
| "\uxxxx" | "\Uxxxxxxxx" | "\N{name}"
The following additional restrictions apply only to bytes literals (stringliteral tokens with b or B in the stringprefix):
- Each shortstringchar or longstringchar must be a character between 1 and 127 inclusive, regardless of any encoding declaration [2] in the source file.
- The Unicode-specific escape sequences \uxxxx, \Uxxxxxxxx, and \N{name} are unrecognized in Python 2.x and forbidden in Python 3000.
Adjacent bytes literals are subject to the same concatenation rules as adjacent string literals [3]. A bytes literal adjacent to a string literal is an error.
Semantics
Each evaluation of a bytes literal produces a new bytes object. The bytes in the new object are the bytes represented by the shortstringitem or longstringitem parts of the literal, in the same order.
Rationale
The proposed syntax provides a cleaner migration path from Python 2.x to Python 3000 for most code involving 8-bit strings. Preserving the old 8-bit meaning of a string literal is usually as simple as adding a b prefix. The one exception is Python 2.x strings containing bytes >127, which must be rewritten using escape sequences. Transcoding a source file from one encoding to another, and fixing up the encoding declaration, should preserve the meaning of the program. Python 2.x non-Unicode strings violate this principle; Python 3000 bytes literals shouldn't.
A string literal with a b in the prefix is always a syntax error in Python 2.5, so this syntax can be introduced in Python 2.6, along with the bytes type.
A bytes literal produces a new object each time it is evaluated, like list displays and unlike string literals. This is necessary because bytes literals, like lists and unlike strings, are mutable [4].
Reference Implementation
Thomas Wouters has checked an implementation into the Py3K branch, r53872.
References
| [1] | http://docs.python.org/reference/lexical_analysis.html#string-literals |
| [2] | http://docs.python.org/reference/lexical_analysis.html#encoding-declarations |
| [3] | http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation |
| [4] | http://mail.python.org/pipermail/python-3000/2007-February/005779.html |
Copyright
This document has been placed in the public domain.
pep-3113 Removal of Tuple Parameter Unpacking
| PEP: | 3113 |
|---|---|
| Title: | Removal of Tuple Parameter Unpacking |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon <brett at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 02-Mar-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
Tuple parameter unpacking is the use of a tuple as a parameter in a function signature so as to have a sequence argument automatically unpacked. An example is:
def fxn(a, (b, c), d):
pass
The use of (b, c) in the signature requires that the second argument to the function be a sequence of length two (e.g., [42, -13]). When such a sequence is passed it is unpacked and has its values assigned to the parameters, just as if the statement b, c = [42, -13] had been executed in the parameter.
Unfortunately this feature of Python's rich function signature abilities, while handy in some situations, causes more issues than they are worth. Thus this PEP proposes their removal from the language in Python 3.0.
Why They Should Go
Introspection Issues
Python has very powerful introspection capabilities. These extend to function signatures. There are no hidden details as to what a function's call signature is. In general it is fairly easy to figure out various details about a function's signature by viewing the function object and various attributes on it (including the function's func_code attribute).
But there is great difficulty when it comes to tuple parameters. The existence of a tuple parameter is denoted by its name being made of a . and a number in the co_varnames attribute of the function's code object. This allows the tuple argument to be bound to a name that only the bytecode is aware of and cannot be typed in Python source. But this does not specify the format of the tuple: its length, whether there are nested tuples, etc.
In order to get all of the details about the tuple from the function one must analyse the bytecode of the function. This is because the first bytecode in the function literally translates into the tuple argument being unpacked. Assuming the tuple parameter is named .1 and is expected to unpack to variables spam and monty (meaning it is the tuple (spam, monty)), the first bytecode in the function will be for the statement spam, monty = .1. This means that to know all of the details of the tuple parameter one must look at the initial bytecode of the function to detect tuple unpacking for parameters formatted as \.\d+ and deduce any and all information about the expected argument. Bytecode analysis is how the inspect.getargspec function is able to provide information on tuple parameters. This is not easy to do and is burdensome on introspection tools as they must know how Python bytecode works (an otherwise unneeded burden as all other types of parameters do not require knowledge of Python bytecode).
The difficulty of analysing bytecode not withstanding, there is another issue with the dependency on using Python bytecode. IronPython [3] does not use Python's bytecode. Because it is based on the .NET framework it instead stores MSIL [4] in func_code.co_code attribute of the function. This fact prevents the inspect.getargspec function from working when run under IronPython. It is unknown whether other Python implementations are affected but is reasonable to assume if the implementation is not just a re-implementation of the Python virtual machine.
No Loss of Abilities If Removed
As mentioned in Introspection Issues, to handle tuple parameters the function's bytecode starts with the bytecode required to unpack the argument into the proper parameter names. This means that there is no special support required to implement tuple parameters and thus there is no loss of abilities if they were to be removed, only a possible convenience (which is addressed in Why They Should (Supposedly) Stay).
The example function at the beginning of this PEP could easily be rewritten as:
def fxn(a, b_c, d):
b, c = b_c
pass
and in no way lose functionality.
Exception To The Rule
When looking at the various types of parameters that a Python function can have, one will notice that tuple parameters tend to be an exception rather than the rule.
Consider PEP 3102 (keyword-only arguments) and PEP 3107 (function annotations) [5] [6]. Both PEPs have been accepted and introduce new functionality within a function's signature. And yet for both PEPs the new feature cannot be applied to tuple parameters as a whole. PEP 3102 has no support for tuple parameters at all (which makes sense as there is no way to reference a tuple parameter by name). PEP 3107 allows annotations for each item within the tuple (e.g., (x:int, y:int)), but not the whole tuple (e.g., (x, y):int).
The existence of tuple parameters also places sequence objects separately from mapping objects in a function signature. There is no way to pass in a mapping object (e.g., a dict) as a parameter and have it unpack in the same fashion as a sequence does into a tuple parameter.
Uninformative Error Messages
Consider the following function:
def fxn((a, b), (c, d)):
pass
If called as fxn(1, (2, 3)) one is given the error message TypeError: unpack non-sequence. This error message in no way tells you which tuple was not unpacked properly. There is also no indication that this was a result that occurred because of the arguments. Other error messages regarding arguments to functions explicitly state its relation to the signature: TypeError: fxn() takes exactly 2 arguments (0 given), etc.
Little Usage
While an informal poll of the handful of Python programmers I know personally and from the PyCon 2007 sprint indicates a huge majority of people do not know of this feature and the rest just do not use it, some hard numbers is needed to back up the claim that the feature is not heavily used.
Iterating over every line in Python's code repository in the Lib/ directory using the regular expression ^\s*def\s*\w+\s*\( to detect function and method definitions there were 22,252 matches in the trunk.
Tacking on .*,\s*\( to find def statements that contained a tuple parameter, only 41 matches were found. This means that for def statements, only 0.18% of them seem to use a tuple parameter.
Why They Should (Supposedly) Stay
Practical Use
In certain instances tuple parameters can be useful. A common example is code that expects a two-item tuple that represents a Cartesian point. While true it is nice to be able to have the unpacking of the x and y coordinates for you, the argument is that this small amount of practical usefulness is heavily outweighed by other issues pertaining to tuple parameters. And as shown in No Loss Of Abilities If Removed, their use is purely practical and in no way provide a unique ability that cannot be handled in other ways very easily.
Self-Documentation For Parameters
It has been argued that tuple parameters provide a way of self-documentation for parameters that are expected to be of a certain sequence format. Using our Cartesian point example from Practical Use, seeing (x, y) as a parameter in a function makes it obvious that a tuple of length two is expected as an argument for that parameter.
But Python provides several other ways to document what parameters are for. Documentation strings are meant to provide enough information needed to explain what arguments are expected. Tuple parameters might tell you the expected length of a sequence argument, it does not tell you what that data will be used for. One must also read the docstring to know what other arguments are expected if not all parameters are tuple parameters.
Function annotations (which do not work with tuple parameters) can also supply documentation. Because annotations can be of any form, what was once a tuple parameter can be a single argument parameter with an annotation of tuple, tuple(2), Cartesian point, (x, y), etc. Annotations provide great flexibility for documenting what an argument is expected to be for a parameter, including being a sequence of a certain length.
Transition Plan
To transition Python 2.x code to 3.x where tuple parameters are removed, two steps are suggested. First, the proper warning is to be emitted when Python's compiler comes across a tuple parameter in Python 2.6. This will be treated like any other syntactic change that is to occur in Python 3.0 compared to Python 2.6.
Second, the 2to3 refactoring tool [1] will gain a fixer [2] for translating tuple parameters to being a single parameter that is unpacked as the first statement in the function. The name of the new parameter will be changed. The new parameter will then be unpacked into the names originally used in the tuple parameter. This means that the following function:
def fxn((a, (b, c))):
pass
will be translated into:
def fxn(a_b_c):
(a, (b, c)) = a_b_c
pass
As tuple parameters are used by lambdas because of the single expression limitation, they must also be supported. This is done by having the expected sequence argument bound to a single parameter and then indexing on that parameter:
lambda (x, y): x + y
will be translated into:
lambda x_y: x_y[0] + x_y[1]
References
| [1] | 2to3 refactoring tool (http://svn.python.org/view/sandbox/trunk/2to3/) |
| [2] | 2to3 fixer (http://svn.python.org/view/sandbox/trunk/2to3/fixes/fix_tuple_params.py) |
| [3] | IronPython (http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython) |
| [4] | Microsoft Intermediate Language (http://msdn.microsoft.com/library/en-us/cpguide/html/cpconmicrosoftintermediatelanguagemsil.asp?frame=true) |
| [5] | PEP 3102 (Keyword-Only Arguments) (http://www.python.org/dev/peps/pep-3102/) |
| [6] | PEP 3107 (Function Annotations) (http://www.python.org/dev/peps/pep-3107/) |
Copyright
This document has been placed in the public domain.
pep-3114 Renaming iterator.next() to iterator.__next__()
| PEP: | 3114 |
|---|---|
| Title: | Renaming iterator.next() to iterator.__next__() |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee <ping at zesty.ca> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 04-Mar-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
The iterator protocol in Python 2.x consists of two methods: __iter__() called on an iterable object to yield an iterator, and next() called on an iterator object to yield the next item in the sequence. Using a for loop to iterate over an iterable object implicitly calls both of these methods. This PEP proposes that the next method be renamed to __next__, consistent with all the other protocols in Python in which a method is implicitly called as part of a language-level protocol, and that a built-in function named next be introduced to invoke __next__ method, consistent with the manner in which other protocols are explicitly invoked.
Names With Double Underscores
In Python, double underscores before and after a name are used to distinguish names that belong to the language itself. Attributes and methods that are implicitly used or created by the interpreter employ this naming convention; some examples are:
- __file__ - an attribute automatically created by the interpreter
- __dict__ - an attribute with special meaning to the interpreter
- __init__ - a method implicitly called by the interpreter
Note that this convention applies to methods such as __init__ that are explicitly defined by the programmer, as well as attributes such as __file__ that can only be accessed by naming them explicitly, so it includes names that are used or created by the interpreter.
(Not all things that are called "protocols" are made of methods with double-underscore names. For example, the __contains__ method has double underscores because the language construct x in y implicitly calls __contains__. But even though the read method is part of the file protocol, it does not have double underscores because there is no language construct that implicitly invokes x.read().)
The use of double underscores creates a separate namespace for names that are part of the Python language definition, so that programmers are free to create variables, attributes, and methods that start with letters, without fear of silently colliding with names that have a language-defined purpose. (Colliding with reserved keywords is still a concern, but at least this will immediately yield a syntax error.)
The naming of the next method on iterators is an exception to this convention. Code that nowhere contains an explicit call to a next method can nonetheless be silently affected by the presence of such a method. Therefore, this PEP proposes that iterators should have a __next__ method instead of a next method (with no change in semantics).
Double-Underscore Methods and Built-In Functions
The Python language defines several protocols that are implemented or customized by defining methods with double-underscore names. In each case, the protocol is provided by an internal method implemented as a C function in the interpreter. For objects defined in Python, this C function supports customization by implicitly invoking a Python method with a double-underscore name (it often does a little bit of additional work beyond just calling the Python method.)
Sometimes the protocol is invoked by a syntactic construct:
- x[y] --> internal tp_getitem --> x.__getitem__(y)
- x + y --> internal nb_add --> x.__add__(y)
- -x --> internal nb_negative --> x.__neg__()
Sometimes there is no syntactic construct, but it is still useful to be able to explicitly invoke the protocol. For such cases Python offers a built-in function of the same name but without the double underscores.
- len(x) --> internal sq_length --> x.__len__()
- hash(x) --> internal tp_hash --> x.__hash__()
- iter(x) --> internal tp_iter --> x.__iter__()
Following this pattern, the natural way to handle next is to add a next built-in function that behaves in exactly the same fashion.
- next(x) --> internal tp_iternext --> x.__next__()
Further, it is proposed that the next built-in function accept a sentinel value as an optional second argument, following the style of the getattr and iter built-in functions. When called with two arguments, next catches the StopIteration exception and returns the sentinel value instead of propagating the exception. This creates a nice duality between iter and next:
iter(function, sentinel) <--> next(iterator, sentinel)
Previous Proposals
This proposal is not a new idea. The idea proposed here was supported by the BDFL on python-dev [1] and is even mentioned in the original iterator PEP, PEP 234:
(In retrospect, it might have been better to go for __next__() and have a new built-in, next(it), which calls it.__next__(). But alas, it's too late; this has been deployed in Python 2.2 since December 2001.)
Objections
There have been a few objections to the addition of more built-ins. In particular, Martin von Loewis writes [2]:
I dislike the introduction of more builtins unless they have a true generality (i.e. are likely to be needed in many programs). For this one, I think the normal usage of __next__ will be with a for loop, so I don't think one would often need an explicit next() invocation. It is also not true that most protocols are explicitly invoked through builtin functions. Instead, most protocols are can be explicitly invoked through methods in the operator module. So following tradition, it should be operator.next. ... As an alternative, I propose that object grows a .next() method, which calls __next__ by default.
Transition Plan
Two additional transformations will be added to the 2to3 translation tool [3]:
- Method definitions named next will be renamed to __next__.
- Explicit calls to the next method will be replaced with calls to the built-in next function. For example, x.next() will become next(x).
Collin Winter looked into the possibility of automatically deciding whether to perform the second transformation depending on the presence of a module-level binding to next [4] and found that it would be "ugly and slow". Instead, the translation tool will emit warnings upon detecting such a binding. Collin has proposed warnings for the following conditions [5]:
- Module-level assignments to next.
- Module-level definitions of a function named next.
- Module-level imports of the name next.
- Assignments to __builtin__.next.
Implementation
A patch with the necessary changes (except the 2to3 tool) was written by Georg Brandl and committed as revision 54910.
References
| [1] | Single- vs. Multi-pass iterability (Guido van Rossum) http://mail.python.org/pipermail/python-dev/2002-July/026814.html |
| [2] | PEP: rename it.next() to it.__next__()... (Martin von Loewis) http://mail.python.org/pipermail/python-3000/2007-March/005965.html |
| [3] | 2to3 refactoring tool http://svn.python.org/view/sandbox/trunk/2to3/ |
| [4] | PEP: rename it.next() to it.__next__()... (Collin Winter) http://mail.python.org/pipermail/python-3000/2007-March/006020.html |
| [5] | (1, 2) PEP 3113 transition plan http://mail.python.org/pipermail/python-3000/2007-March/006044.html |
| [6] | PEP: rename it.next() to it.__next__()... (Guido van Rossum) http://mail.python.org/pipermail/python-3000/2007-March/006027.html |
Copyright
This document has been placed in the public domain.
pep-3115 Metaclasses in Python 3000
| PEP: | 3115 |
|---|---|
| Title: | Metaclasses in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Talin <talin at acm.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 07-Mar-2007 |
| Python-Version: | 3.0 |
| Post-History: | 11-March-2007, 14-March-2007 |
Abstract
This PEP proposes changing the syntax for declaring metaclasses,
and alters the semantics for how classes with metaclasses are
constructed.
Rationale
There are two rationales for this PEP, both of which are somewhat
subtle.
The primary reason for changing the way metaclasses work, is that
there are a number of interesting use cases that require the
metaclass to get involved earlier in the class construction process
than is currently possible. Currently, the metaclass mechanism is
essentially a post-processing step. With the advent of class
decorators, much of these post-processing chores can be taken over
by the decorator mechanism.
In particular, there is an important body of use cases where it
would be useful to preserve the order in which a class members are
declared. Ordinary Python objects store their members in a
dictionary, in which ordering is unimportant, and members are
accessed strictly by name. However, Python is often used to
interface with external systems in which the members are organized
according to an implicit ordering. Examples include declaration of C
structs; COM objects; Automatic translation of Python classes into
IDL or database schemas, such as used in an ORM; and so on.
In such cases, it would be useful for a Python programmer to specify
such ordering directly using the declaration order of class members.
Currently, such orderings must be specified explicitly, using some
other mechanism (see the ctypes module for an example.)
Unfortunately, the current method for declaring a metaclass does
not allow for this, since the ordering information has already been
lost by the time the metaclass comes into play. By allowing the
metaclass to get involved in the class construction process earlier,
the new system allows the ordering or other early artifacts of
construction to be preserved and examined.
There proposed metaclass mechanism also supports a number of other
interesting use cases beyond preserving the ordering of declarations.
One use case is to insert symbols into the namespace of the class
body which are only valid during class construction. An example of
this might be "field constructors", small functions that are used in
the creation of class members. Another interesting possibility is
supporting forward references, i.e. references to Python
symbols that are declared further down in the class body.
The other, weaker, rationale is purely cosmetic: The current method
for specifying a metaclass is by assignment to the special variable
__metaclass__, which is considered by some to be aesthetically less
than ideal. Others disagree strongly with that opinion. This PEP
will not address this issue, other than to note it, since aesthetic
debates cannot be resolved via logical proofs.
Specification
In the new model, the syntax for specifying a metaclass is via a
keyword argument in the list of base classes:
class Foo(base1, base2, metaclass=mymeta):
...
Additional keywords will also be allowed here, and will be passed to
the metaclass, as in the following example:
class Foo(base1, base2, metaclass=mymeta, private=True):
...
Note that this PEP makes no attempt to define what these other
keywords might be - that is up to metaclass implementors to
determine.
More generally, the parameter list passed to a class definition will
now support all of the features of a function call, meaning that you
can now use *args and **kwargs-style arguments in the class base
list:
class Foo(*bases, **kwds):
...
Invoking the Metaclass
In the current metaclass system, the metaclass object can be any
callable type. This does not change, however in order to fully
exploit all of the new features the metaclass will need to have an
extra attribute which is used during class pre-construction.
This attribute is named __prepare__, which is invoked as a function
before the evaluation of the class body. The __prepare__ function
takes two positional arguments, and an arbitrary number of keyword
arguments. The two positional arguments are:
'name' - the name of the class being created.
'bases' - the list of base classes.
The interpreter always tests for the existence of __prepare__ before
calling it; If it is not present, then a regular dictionary is used,
as illustrated in the following Python snippet.
def prepare_class(name, *bases, metaclass=None, **kwargs):
if metaclass is None:
metaclass = compute_default_metaclass(bases)
prepare = getattr(metaclass, '__prepare__', None)
if prepare is not None:
return prepare(name, bases, **kwargs)
else:
return dict()
The example above illustrates how the arguments to 'class' are
interpreted. The class name is the first argument, followed by
an arbitrary length list of base classes. After the base classes,
there may be one or more keyword arguments, one of which can be
'metaclass'. Note that the 'metaclass' argument is not included
in kwargs, since it is filtered out by the normal parameter
assignment algorithm. (Note also that 'metaclass' is a keyword-
only argument as per PEP 3102 [6].)
Even though __prepare__ is not required, the default metaclass
('type') implements it, for the convenience of subclasses calling
it via super().
__prepare__ returns a dictionary-like object which is used to store
the class member definitions during evaluation of the class body.
In other words, the class body is evaluated as a function block
(just like it is now), except that the local variables dictionary
is replaced by the dictionary returned from __prepare__. This
dictionary object can be a regular dictionary or a custom mapping
type.
This dictionary-like object is not required to support the full
dictionary interface. A dictionary which supports a limited set of
dictionary operations will restrict what kinds of actions can occur
during evaluation of the class body. A minimal implementation might
only support adding and retrieving values from the dictionary - most
class bodies will do no more than that during evaluation. For some
classes, it may be desirable to support deletion as well. Many
metaclasses will need to make a copy of this dictionary afterwards,
so iteration or other means for reading out the dictionary contents
may also be useful.
The __prepare__ method will most often be implemented as a class
method rather than an instance method because it is called before
the metaclass instance (i.e. the class itself) is created.
Once the class body has finished evaluating, the metaclass will be
called (as a callable) with the class dictionary, which is no
different from the current metaclass mechanism.
Typically, a metaclass will create a custom dictionary - either a
subclass of dict, or a wrapper around it - that will contain
additional properties that are set either before or during the
evaluation of the class body. Then in the second phase, the
metaclass can use these additional properties to further customize
the class.
An example would be a metaclass that uses information about the
ordering of member declarations to create a C struct. The metaclass
would provide a custom dictionary that simply keeps a record of the
order of insertions. This does not need to be a full 'ordered dict'
implementation, but rather just a Python list of (key,value) pairs
that is appended to for each insertion.
Note that in such a case, the metaclass would be required to deal
with the possibility of duplicate keys, but in most cases that is
trivial. The metaclass can use the first declaration, the last,
combine them in some fashion, or simply throw an exception. It's up
to the metaclass to decide how it wants to handle that case.
Example:
Here's a simple example of a metaclass which creates a list of
the names of all class members, in the order that they were
declared:
# The custom dictionary
class member_table(dict):
def __init__(self):
self.member_names = []
def __setitem__(self, key, value):
# if the key is not already defined, add to the
# list of keys.
if key not in self:
self.member_names.append(key)
# Call superclass
dict.__setitem__(self, key, value)
# The metaclass
class OrderedClass(type):
# The prepare function
@classmethod
def __prepare__(metacls, name, bases): # No keywords in this case
return member_table()
# The metaclass invocation
def __new__(cls, name, bases, classdict):
# Note that we replace the classdict with a regular
# dict before passing it to the superclass, so that we
# don't continue to record member names after the class
# has been created.
result = type.__new__(cls, name, bases, dict(classdict))
result.member_names = classdict.member_names
return result
class MyClass(metaclass=OrderedClass):
# method1 goes in array element 0
def method1(self):
pass
# method2 goes in array element 1
def method2(self):
pass
Sample Implementation:
Guido van Rossum has created a patch which implements the new
functionality:
http://python.org/sf/1681101
Alternate Proposals
Josiah Carlson proposed using the name 'type' instead of
'metaclass', on the theory that what is really being specified is
the type of the type. While this is technically correct, it is also
confusing from the point of view of a programmer creating a new
class. From the application programmer's point of view, the 'type'
that they are interested in is the class that they are writing; the
type of that type is the metaclass.
There were some objections in the discussion to the 'two-phase'
creation process, where the metaclass is invoked twice, once to
create the class dictionary and once to 'finish' the class. Some
people felt that these two phases should be completely separate, in
that there ought to be separate syntax for specifying the custom
dict as for specifying the metaclass. However, in most cases, the
two will be intimately tied together, and the metaclass will most
likely have an intimate knowledge of the internal details of the
class dict. Requiring the programmer to insure that the correct dict
type and the correct metaclass type are used together creates an
additional and unneeded burden on the programmer.
Another good suggestion was to simply use an ordered dict for all
classes, and skip the whole 'custom dict' mechanism. This was based
on the observation that most use cases for a custom dict were for
the purposes of preserving order information. However, this idea has
several drawbacks, first because it means that an ordered dict
implementation would have to be added to the set of built-in types
in Python, and second because it would impose a slight speed (and
complexity) penalty on all class declarations. Later, several people
came up with ideas for use cases for custom dictionaries other
than preserving field orderings, so this idea was dropped.
Backwards Compatibility
It would be possible to leave the existing __metaclass__ syntax in
place. Alternatively, it would not be too difficult to modify the
syntax rules of the Py3K translation tool to convert from the old to
the new syntax.
References
[1] [Python-3000] Metaclasses in Py3K (original proposal)
http://mail.python.org/pipermail/python-3000/2006-December/005030.html
[2] [Python-3000] Metaclasses in Py3K (Guido's suggested syntax)
http://mail.python.org/pipermail/python-3000/2006-December/005033.html
[3] [Python-3000] Metaclasses in Py3K (Objections to two-phase init)
http://mail.python.org/pipermail/python-3000/2006-December/005108.html
[4] [Python-3000] Metaclasses in Py3K (Always use an ordered dict)
http://mail.python.org/pipermail/python-3000/2006-December/005118.html
[5] PEP 359: The 'make' statement -
http://www.python.org/dev/peps/pep-0359/
[6] PEP 3102: Keyword-only arguments -
http://www.python.org/dev/peps/pep-3102/
Copyright
This document has been placed in the public domain.
pep-3116 New I/O
| PEP: | 3116 |
|---|---|
| Title: | New I/O |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Daniel Stutzbach <daniel at stutzbachenterprises.com>, Guido van Rossum <guido at python.org>, Mike Verdone <mike.verdone at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Feb-2007 |
| Python-Version: | 3.0 |
| Post-History: | 26-Feb-2007 |
Contents
Rationale and Goals
Python allows for a variety of stream-like (a.k.a. file-like) objects that can be used via read() and write() calls. Anything that provides read() and write() is stream-like. However, more exotic and extremely useful functions like readline() or seek() may or may not be available on every stream-like object. Python needs a specification for basic byte-based I/O streams to which we can add buffering and text-handling features.
Once we have a defined raw byte-based I/O interface, we can add buffering and text handling layers on top of any byte-based I/O class. The same buffering and text handling logic can be used for files, sockets, byte arrays, or custom I/O classes developed by Python programmers. Developing a standard definition of a stream lets us separate stream-based operations like read() and write() from implementation specific operations like fileno() and isatty(). It encourages programmers to write code that uses streams as streams and not require that all streams support file-specific or socket-specific operations.
The new I/O spec is intended to be similar to the Java I/O libraries, but generally less confusing. Programmers who don't want to muck about in the new I/O world can expect that the open() factory method will produce an object backwards-compatible with old-style file objects.
Specification
The Python I/O Library will consist of three layers: a raw I/O layer, a buffered I/O layer, and a text I/O layer. Each layer is defined by an abstract base class, which may have multiple implementations. The raw I/O and buffered I/O layers deal with units of bytes, while the text I/O layer deals with units of characters.
Raw I/O
The abstract base class for raw I/O is RawIOBase. It has several methods which are wrappers around the appropriate operating system calls. If one of these functions would not make sense on the object, the implementation must raise an IOError exception. For example, if a file is opened read-only, the .write() method will raise an IOError. As another example, if the object represents a socket, then .seek(), .tell(), and .truncate() will raise an IOError. Generally, a call to one of these functions maps to exactly one operating system call.
.read(n: int) -> bytes
Read up to n bytes from the object and return them. Fewer than n bytes may be returned if the operating system call returns fewer than n bytes. If 0 bytes are returned, this indicates end of file. If the object is in non-blocking mode and no bytes are available, the call returns None..readinto(b: bytes) -> int
Read up to len(b) bytes from the object and stores them in b, returning the number of bytes read. Like .read, fewer than len(b) bytes may be read, and 0 indicates end of file. None is returned if a non-blocking object has no bytes available. The length of b is never changed..write(b: bytes) -> int
Returns number of bytes written, which may be < len(b)..seek(pos: int, whence: int = 0) -> int
.tell() -> int
.truncate(n: int = None) -> int
.close() -> None
Additionally, it defines a few other methods:
.readable() -> bool
Returns True if the object was opened for reading, False otherwise. If False, .read() will raise an IOError if called..writable() -> bool
Returns True if the object was opened for writing, False otherwise. If False, .write() and .truncate() will raise an IOError if called..seekable() -> bool
Returns True if the object supports random access (such as disk files), or False if the object only supports sequential access (such as sockets, pipes, and ttys). If False, .seek(), .tell(), and .truncate() will raise an IOError if called..__enter__() -> ContextManager
Context management protocol. Returns self..__exit__(...) -> None
Context management protocol. Same as .close().
If and only if a RawIOBase implementation operates on an underlying file descriptor, it must additionally provide a .fileno() member function. This could be defined specifically by the implementation, or a mix-in class could be used (need to decide about this).
.fileno() -> int
Returns the underlying file descriptor (an integer)
Initially, three implementations will be provided that implement the RawIOBase interface: FileIO, SocketIO (in the socket module), and ByteIO. Each implementation must determine whether the object supports random access as the information provided by the user may not be sufficient (consider open("/dev/tty", "rw") or open("/tmp/named-pipe", "rw")). As an example, FileIO can determine this by calling the seek() system call; if it returns an error, the object does not support random access. Each implementation may provided additional methods appropriate to its type. The ByteIO object is analogous to Python 2's cStringIO library, but operating on the new bytes type instead of strings.
Buffered I/O
The next layer is the Buffered I/O layer which provides more efficient access to file-like objects. The abstract base class for all Buffered I/O implementations is BufferedIOBase, which provides similar methods to RawIOBase:
.read(n: int = -1) -> bytes
Returns the next n bytes from the object. It may return fewer than n bytes if end-of-file is reached or the object is non-blocking. 0 bytes indicates end-of-file. This method may make multiple calls to RawIOBase.read() to gather the bytes, or may make no calls to RawIOBase.read() if all of the needed bytes are already buffered..readinto(b: bytes) -> int
.write(b: bytes) -> int
Write b bytes to the buffer. The bytes are not guaranteed to be written to the Raw I/O object immediately; they may be buffered. Returns len(b)..seek(pos: int, whence: int = 0) -> int
.tell() -> int
.truncate(pos: int = None) -> int
.flush() -> None
.close() -> None
.readable() -> bool
.writable() -> bool
.seekable() -> bool
.__enter__() -> ContextManager
.__exit__(...) -> None
Additionally, the abstract base class provides one member variable:
.raw
A reference to the underlying RawIOBase object.
The BufferedIOBase methods signatures are mostly identical to that of RawIOBase (exceptions: write() returns None, read()'s argument is optional), but may have different semantics. In particular, BufferedIOBase implementations may read more data than requested or delay writing data using buffers. For the most part, this will be transparent to the user (unless, for example, they open the same file through a different descriptor). Also, raw reads may return a short read without any particular reason; buffered reads will only return a short read if EOF is reached; and raw writes may return a short count (even when non-blocking I/O is not enabled!), while buffered writes will raise IOError when not all bytes could be written or buffered.
There are four implementations of the BufferedIOBase abstract base class, described below.
BufferedReader
The BufferedReader implementation is for sequential-access read-only objects. Its .flush() method is a no-op.
BufferedWriter
The BufferedWriter implementation is for sequential-access write-only objects. Its .flush() method forces all cached data to be written to the underlying RawIOBase object.
BufferedRWPair
The BufferedRWPair implementation is for sequential-access read-write objects such as sockets and ttys. As the read and write streams of these objects are completely independent, it could be implemented by simply incorporating a BufferedReader and BufferedWriter instance. It provides a .flush() method that has the same semantics as a BufferedWriter's .flush() method.
BufferedRandom
The BufferedRandom implementation is for all random-access objects, whether they are read-only, write-only, or read-write. Compared to the previous classes that operate on sequential-access objects, the BufferedRandom class must contend with the user calling .seek() to reposition the stream. Therefore, an instance of BufferedRandom must keep track of both the logical and true position within the object. It provides a .flush() method that forces all cached write data to be written to the underlying RawIOBase object and all cached read data to be forgotten (so that future reads are forced to go back to the disk).
Q: Do we want to mandate in the specification that switching between reading and writing on a read-write object implies a .flush()? Or is that an implementation convenience that users should not rely on?
For a read-only BufferedRandom object, .writable() returns False and the .write() and .truncate() methods throw IOError.
For a write-only BufferedRandom object, .readable() returns False and the .read() method throws IOError.
Text I/O
The text I/O layer provides functions to read and write strings from streams. Some new features include universal newlines and character set encoding and decoding. The Text I/O layer is defined by a TextIOBase abstract base class. It provides several methods that are similar to the BufferedIOBase methods, but operate on a per-character basis instead of a per-byte basis. These methods are:
.read(n: int = -1) -> str
.write(s: str) -> int
.tell() -> object
Return a cookie describing the current file position. The only supported use for the cookie is with .seek() with whence set to 0 (i.e. absolute seek)..seek(pos: object, whence: int = 0) -> int
Seek to position pos. If pos is non-zero, it must be a cookie returned from .tell() and whence must be zero..truncate(pos: object = None) -> int
Like BufferedIOBase.truncate(), except that pos (if not None) must be a cookie previously returned by .tell().
Unlike with raw I/O, the units for .seek() are not specified - some implementations (e.g. StringIO) use characters and others (e.g. TextIOWrapper) use bytes. The special case for zero is to allow going to the start or end of a stream without a prior .tell(). An implementation could include stream encoder state in the cookie returned from .tell().
TextIOBase implementations also provide several methods that are pass-throughs to the underlaying BufferedIOBase objects:
.flush() -> None
.close() -> None
.readable() -> bool
.writable() -> bool
.seekable() -> bool
TextIOBase class implementations additionally provide the following methods:
.readline() -> str
Read until newline or EOF and return the line, or "" if EOF hit immediately..__iter__() -> Iterator
Returns an iterator that returns lines from the file (which happens to be self)..next() -> str
Same as readline() except raises StopIteration if EOF hit immediately.
Two implementations will be provided by the Python library. The primary implementation, TextIOWrapper, wraps a Buffered I/O object. Each TextIOWrapper object has a property named ".buffer" that provides a reference to the underlying BufferedIOBase object. Its initializer has the following signature:
.__init__(self, buffer, encoding=None, errors=None, newline=None, line_buffering=False)
buffer is a reference to the BufferedIOBase object to be wrapped with the TextIOWrapper.
encoding refers to an encoding to be used for translating between the byte-representation and character-representation. If it is None, then the system's locale setting will be used as the default.
errors is an optional string indicating error handling. It may be set whenever encoding may be set. It defaults to 'strict'.
newline can be None, '', '\n', '\r', or '\r\n'; all other values are illegal. It controls the handling of line endings. It works as follows:
- On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. (In other words, translation to '\n' only occurs if newline is None.)
- On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string. (Note that the rules guiding translation are different for output than for input.)
line_buffering, if True, causes write() calls to imply a flush() if the string written contains at least one '\n' or '\r' character. This is set by open() when it detects that the underlying stream is a TTY device, or when a buffering argument of 1 is passed.
Further notes on the newline parameter:
- '\r' support is still needed for some OSX applications that produce files using '\r' line endings; Excel (when exporting to text) and Adobe Illustrator EPS files are the most common examples.
- If translation is enabled, it happens regardless of which method is called for reading or writing. For example, f.read() will always produce the same result as ''.join(f.readlines()).
- If universal newlines without translation are requested on input (i.e. newline=''), if a system read operation returns a buffer ending in '\r', another system read operation is done to determine whether it is followed by '\n' or not. In universal newlines mode with translation, the second system read operation may be postponed until the next read request, and if the following system read operation returns a buffer starting with '\n', that character is simply discarded.
Another implementation, StringIO, creates a file-like TextIO implementation without an underlying Buffered I/O object. While similar functionality could be provided by wrapping a BytesIO object in a TextIOWrapper, the StringIO object allows for much greater efficiency as it does not need to actually performing encoding and decoding. A String I/O object can just store the encoded string as-is. The StringIO object's __init__ signature takes an optional string specifying the initial value; the initial position is always 0. It does not support encodings or newline translations; you always read back exactly the characters you wrote.
Unicode encoding/decoding Issues
We should allow allow changing the encoding and error-handling setting later. The behavior of Text I/O operations in the face of Unicode problems and ambiguities (e.g. diacritics, surrogates, invalid bytes in an encoding) should be the same as that of the unicode encode()/decode() methods. UnicodeError may be raised.
Implementation note: we should be able to reuse much of the infrastructure provided by the codecs module. If it doesn't provide the exact APIs we need, we should refactor it to avoid reinventing the wheel.
Non-blocking I/O
Non-blocking I/O is fully supported on the Raw I/O level only. If a raw object is in non-blocking mode and an operation would block, then .read() and .readinto() return None, while .write() returns 0. In order to put an object in non-blocking mode, the user must extract the fileno and do it by hand.
At the Buffered I/O and Text I/O layers, if a read or write fails due a non-blocking condition, they raise an IOError with errno set to EAGAIN.
Originally, we considered propagating up the Raw I/O behavior, but many corner cases and problems were raised. To address these issues, significant changes would need to have been made to the Buffered I/O and Text I/O layers. For example, what should .flush() do on a Buffered non-blocking object? How would the user instruct the object to "Write as much as you can from your buffer, but don't block"? A non-blocking .flush() that doesn't necessarily flush all available data is counter-intuitive. Since non-blocking and blocking objects would have such different semantics at these layers, it was agreed to abandon efforts to combine them into a single type.
The open() Built-in Function
The open() built-in function is specified by the following pseudo-code:
def open(filename, mode="r", buffering=None, *,
encoding=None, errors=None, newline=None):
assert isinstance(filename, (str, int))
assert isinstance(mode, str)
assert buffering is None or isinstance(buffering, int)
assert encoding is None or isinstance(encoding, str)
assert newline in (None, "", "\n", "\r", "\r\n")
modes = set(mode)
if modes - set("arwb+t") or len(mode) > len(modes):
raise ValueError("invalid mode: %r" % mode)
reading = "r" in modes
writing = "w" in modes
binary = "b" in modes
appending = "a" in modes
updating = "+" in modes
text = "t" in modes or not binary
if text and binary:
raise ValueError("can't have text and binary mode at once")
if reading + writing + appending > 1:
raise ValueError("can't have read/write/append mode at once")
if not (reading or writing or appending):
raise ValueError("must have exactly one of read/write/append mode")
if binary and encoding is not None:
raise ValueError("binary modes doesn't take an encoding arg")
if binary and errors is not None:
raise ValueError("binary modes doesn't take an errors arg")
if binary and newline is not None:
raise ValueError("binary modes doesn't take a newline arg")
# XXX Need to spec the signature for FileIO()
raw = FileIO(filename, mode)
line_buffering = (buffering == 1 or buffering is None and raw.isatty())
if line_buffering or buffering is None:
buffering = 8*1024 # International standard buffer size
# XXX Try setting it to fstat().st_blksize
if buffering < 0:
raise ValueError("invalid buffering size")
if buffering == 0:
if binary:
return raw
raise ValueError("can't have unbuffered text I/O")
if updating:
buffer = BufferedRandom(raw, buffering)
elif writing or appending:
buffer = BufferedWriter(raw, buffering)
else:
assert reading
buffer = BufferedReader(raw, buffering)
if binary:
return buffer
assert text
return TextIOWrapper(buffer, encoding, errors, newline, line_buffering)
Copyright
This document has been placed in the public domain.
pep-3117 Postfix type declarations
| PEP: | 3117 |
|---|---|
| Title: | Postfix type declarations |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 01-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP proposes the addition of a postfix type declaration syntax to Python. It also specifies a new typedef statement which is used to create new mappings between types and declarators.
Its acceptance will greatly enhance the Python user experience as well as eliminate one of the warts that deter users of other programming languages from switching to Python.
Rationale
Python has long suffered from the lack of explicit type declarations. Being one of the few aspects in which the language deviates from its Zen, this wart has sparked many a discussion between Python heretics and members of the PSU (for a few examples, see [EX1], [EX2] or [EX3]), and it also made it a large-scale enterprise success unlikely.
However, if one wants to put an end to this misery, a decent Pythonic syntax must be found. In almost all languages that have them, type declarations lack this quality: they are verbose, often needing multiple words for a single type, or they are hard to comprehend (e.g., a certain language uses completely unrelated [1] adjectives like dim for type declaration).
Therefore, this PEP combines the move to type declarations with another bold move that will once again prove that Python is not only future-proof but future-embracing: the introduction of Unicode characters as an integral constituent of source code.
Unicode makes it possible to express much more with much less characters, which is in accordance with the Zen ("Readability counts.") [ZEN]. Additionally, it eliminates the need for a separate type declaration statement, and last but not least, it makes Python measure up to Perl 6, which already uses Unicode for its operators. [2]
Specification
When the type declaration mode is in operation, the grammar is changed so that each NAME must consist of two parts: a name and a type declarator, which is exactly one Unicode character.
The declarator uniquely specifies the type of the name, and if it occurs on the left hand side of an expression, this type is enforced: an InquisitionError exception is raised if the returned type doesn't match the declared type. [3]
Also, function call result types have to be specified. If the result of the call does not have the declared type, an InquisitionError is raised. Caution: the declarator for the result should not be confused with the declarator for the function object (see the example below).
Type declarators after names that are only read, not assigned to, are not strictly necessary but enforced anyway (see the Python Zen: "Explicit is better than implicit.").
The mapping between types and declarators is not static. It can be completely customized by the programmer, but for convenience there are some predefined mappings for some built-in types:
| Type | Declarator |
|---|---|
| object | � (REPLACEMENT CHARACTER) |
| int | ℕ (DOUBLE-STRUCK CAPITAL N) |
| float | ℮ (ESTIMATED SYMBOL) |
| bool | ✓ (CHECK MARK) |
| complex | ℂ (DOUBLE-STRUCK CAPITAL C) |
| str | ✎ (LOWER RIGHT PENCIL) |
| unicode | ✒ (BLACK NIB) |
| tuple | ⒯ (PARENTHESIZED LATIN SMALL LETTER T) |
| list | ♨ (HOT SPRINGS) |
| dict | ⧟ (DOUBLE-ENDED MULTIMAP) |
| set | ∅ (EMPTY SET) (Note: this is also for full sets) |
| frozenset | ☃ (SNOWMAN) |
| datetime | ⌚ (WATCH) |
| function | ƛ (LATIN SMALL LETTER LAMBDA WITH STROKE) |
| generator | ⚛ (ATOM SYMBOL) |
| Exception | ⌁ (ELECTRIC ARROW) |
The declarator for the None type is a zero-width space.
These characters should be obvious and easy to remember and type for every programmer.
Unicode replacement units
Since even in our modern, globalized world there are still some old-fashioned rebels who can't or don't want to use Unicode in their source code, and since Python is a forgiving language, a fallback is provided for those:
Instead of the single Unicode character, they can type name${UNICODE NAME OF THE DECLARATOR}$. For example, these two function definitions are equivalent:
def fooƛ(xℂ):
return None
and
def foo${LATIN SMALL LETTER LAMBDA WITH STROKE}$(x${DOUBLE-STRUCK CAPITAL C}$):
return None${ZERO WIDTH NO-BREAK SPACE}$
This is still easy to read and makes the full power of type-annotated Python available to ASCII believers.
The typedef statement
The mapping between types and declarators can be extended with this new statement.
The syntax is as follows:
typedef_stmt ::= "typedef" expr DECLARATOR
where expr resolves to a type object. For convenience, the typedef statement can also be mixed with the class statement for new classes, like so:
typedef class Foo☺(object�):
pass
Example
This is the standard os.path.normpath function, converted to type declaration syntax:
def normpathƛ(path✎)✎:
"""Normalize path, eliminating double slashes, etc."""
if path✎ == '':
return '.'
initial_slashes✓ = path✎.startswithƛ('/')✓
# POSIX allows one or two initial slashes, but treats three or more
# as single slash.
if (initial_slashes✓ and
path✎.startswithƛ('//')✓ and not path✎.startswithƛ('///')✓)✓:
initial_slashesℕ = 2
comps♨ = path✎.splitƛ('/')♨
new_comps♨ = []♨
for comp✎ in comps♨:
if comp✎ in ('', '.')⒯:
continue
if (comp✎ != '..' or (not initial_slashesℕ and not new_comps♨)✓ or
(new_comps♨ and new_comps♨[-1]✎ == '..')✓)✓:
new_comps♨.appendƛ(comp✎)
elif new_comps♨:
new_comps♨.popƛ()✎
comps♨ = new_comps♨
path✎ = '/'.join(comps♨)✎
if initial_slashesℕ:
path✎ = '/'*initial_slashesℕ + path✎
return path✎ or '.'
As you can clearly see, the type declarations add expressiveness, while at the same time they make the code look much more professional.
Compatibility issues
To enable type declaration mode, one has to write:
from __future__ import type_declarations
which enables Unicode parsing of the source [4], makes typedef a keyword and enforces correct types for all assignments and function calls.
Rejection
After careful considering, much soul-searching, gnashing of teeth and rending of garments, it has been decided to reject this PEP.
References
| [EX1] | http://mail.python.org/pipermail/python-list/2003-June/210588.html |
| [EX2] | http://mail.python.org/pipermail/python-list/2000-May/034685.html |
| [EX3] | http://groups.google.com/group/comp.lang.python/browse_frm/thread/6ae8c6add913635a/de40d4ffe9bd4304?lnk=gst&q=type+declarations&rnum=6 |
| [1] | Though, if you know the language in question, it may not be that unrelated. |
| [ZEN] | http://www.python.org/dev/peps/pep-0020/ |
| [2] | Well, it would, if there was a Perl 6. |
| [3] | Since the name TypeError is already in use, this name has been chosen for obvious reasons. |
| [4] | The encoding in which the code is written is read from a standard coding cookie. There will also be an autodetection mechanism, invoked by from __future__ import encoding_hell. |
Acknowledgements
Many thanks go to Armin Ronacher, Alexander Schremmer and Marek Kubica who helped find the most suitable and mnemonic declarator for built-in types.
Thanks also to the Unicode Consortium for including all those useful characters in the Unicode standard.
Copyright
This document has been placed in the public domain.
pep-3118 Revising the buffer protocol
| PEP: | 3118 |
|---|---|
| Title: | Revising the buffer protocol |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Travis Oliphant <oliphant at ee.byu.edu>, Carl Banks <pythondev at aerojockey.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Aug-2006 |
| Python-Version: | 3000 |
| Post-History: |
Contents
Abstract
This PEP proposes re-designing the buffer interface (PyBufferProcs function pointers) to improve the way Python allows memory sharing in Python 3.0
In particular, it is proposed that the character buffer portion of the API be eliminated and the multiple-segment portion be re-designed in conjunction with allowing for strided memory to be shared. In addition, the new buffer interface will allow the sharing of any multi-dimensional nature of the memory and what data-format the memory contains.
This interface will allow any extension module to either create objects that share memory or create algorithms that use and manipulate raw memory from arbitrary objects that export the interface.
Rationale
The Python 2.X buffer protocol allows different Python types to exchange a pointer to a sequence of internal buffers. This functionality is extremely useful for sharing large segments of memory between different high-level objects, but it is too limited and has issues:
There is the little used "sequence-of-segments" option (bf_getsegcount) that is not well motivated.
There is the apparently redundant character-buffer option (bf_getcharbuffer)
There is no way for a consumer to tell the buffer-API-exporting object it is "finished" with its view of the memory and therefore no way for the exporting object to be sure that it is safe to reallocate the pointer to the memory that it owns (for example, the array object reallocating its memory after sharing it with the buffer object which held the original pointer led to the infamous buffer-object problem).
Memory is just a pointer with a length. There is no way to describe what is "in" the memory (float, int, C-structure, etc.)
There is no shape information provided for the memory. But, several array-like Python types could make use of a standard way to describe the shape-interpretation of the memory (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video Libraries, ctypes, NumPy, data-base interfaces, etc.)
There is no way to share discontiguous memory (except through the sequence of segments notion).
There are two widely used libraries that use the concept of discontiguous memory: PIL and NumPy. Their view of discontiguous arrays is different, though. The proposed buffer interface allows sharing of either memory model. Exporters will typically use only one approach and consumers may choose to support discontiguous arrays of each type however they choose.
NumPy uses the notion of constant striding in each dimension as its basic concept of an array. With this concept, a simple sub-region of a larger array can be described without copying the data. Thus, stride information is the additional information that must be shared.
The PIL uses a more opaque memory representation. Sometimes an image is contained in a contiguous segment of memory, but sometimes it is contained in an array of pointers to the contiguous segments (usually lines) of the image. The PIL is where the idea of multiple buffer segments in the original buffer interface came from.
NumPy's strided memory model is used more often in computational libraries and because it is so simple it makes sense to support memory sharing using this model. The PIL memory model is sometimes used in C-code where a 2-d array can then be accessed using double pointer indirection: e.g. image[i][j].
The buffer interface should allow the object to export either of these memory models. Consumers are free to either require contiguous memory or write code to handle one or both of these memory models.
Proposal Overview
- Eliminate the char-buffer and multiple-segment sections of the buffer-protocol.
- Unify the read/write versions of getting the buffer.
- Add a new function to the interface that should be called when the consumer object is "done" with the memory area.
- Add a new variable to allow the interface to describe what is in memory (unifying what is currently done now in struct and array)
- Add a new variable to allow the protocol to share shape information
- Add a new variable for sharing stride information
- Add a new mechanism for sharing arrays that must be accessed using pointer indirection.
- Fix all objects in the core and the standard library to conform to the new interface
- Extend the struct module to handle more format specifiers
- Extend the buffer object into a new memory object which places a Python veneer around the buffer interface.
- Add a few functions to make it easy to copy contiguous data in and out of object supporting the buffer interface.
Specification
While the new specification allows for complicated memory sharing, simple contiguous buffers of bytes can still be obtained from an object. In fact, the new protocol allows a standard mechanism for doing this even if the original object is not represented as a contiguous chunk of memory.
The easiest way to obtain a simple contiguous chunk of memory is to use the provided C-API to obtain a chunk of memory.
Change the PyBufferProcs structure to
typedef struct {
getbufferproc bf_getbuffer;
releasebufferproc bf_releasebuffer;
} PyBufferProcs;
Both of these routines are optional for a type object
typedef int (*getbufferproc)(PyObject *obj, PyBuffer *view, int flags)
This function returns 0 on success and -1 on failure (and raises an error). The first variable is the "exporting" object. The second argument is the address to a bufferinfo structure. Both arguments must never be NULL.
The third argument indicates what kind of buffer the consumer is prepared to deal with and therefore what kind of buffer the exporter is allowed to return. The new buffer interface allows for much more complicated memory sharing possibilities. Some consumers may not be able to handle all the complexibity but may want to see if the exporter will let them take a simpler view to its memory.
In addition, some exporters may not be able to share memory in every possible way and may need to raise errors to signal to some consumers that something is just not possible. These errors should be PyErr_BufferError unless there is another error that is actually causing the problem. The exporter can use flags information to simplify how much of the PyBuffer structure is filled in with non-default values and/or raise an error if the object can't support a simpler view of its memory.
The exporter should always fill in all elements of the buffer structure (with defaults or NULLs if nothing else is requested). The PyBuffer_FillInfo function can be used for simple cases.
Access flags
Some flags are useful for requesting a specific kind of memory segment, while others indicate to the exporter what kind of information the consumer can deal with. If certain information is not asked for by the consumer, but the exporter cannot share its memory without that information, then a PyErr_BufferError should be raised.
PyBUF_SIMPLE
This is the default flag state (0). The returned buffer may or may not have writable memory. The format will be assumed to be unsigned bytes. This is a "stand-alone" flag constant. It never needs to be |'d to the others. The exporter will raise an error if it cannot provide such a contiguous buffer of bytes.
PyBUF_WRITABLE
The returned buffer must be writable. If it is not writable, then raise an error.
PyBUF_FORMAT
The returned buffer must have true format information if this flag is provided. This would be used when the consumer is going to be checking for what 'kind' of data is actually stored. An exporter should always be able to provide this information if requested. If format is not explicitly requested then the format must be returned as NULL (which means "B", or unsigned bytes)
PyBUF_ND
The returned buffer must provide shape information. The memory will be assumed C-style contiguous (last dimension varies the fastest). The exporter may raise an error if it cannot provide this kind of contiguous buffer. If this is not given then shape will be NULL.
PyBUF_STRIDES (implies PyBUF_ND)
The returned buffer must provide strides information (i.e. the strides cannot be NULL). This would be used when the consumer can handle strided, discontiguous arrays. Handling strides automatically assumes you can handle shape. The exporter may raise an error if cannot provide a strided-only representation of the data (i.e. without the suboffsets).
These flags indicate that the returned buffer must be respectively, C-contiguous (last dimension varies the fastest), Fortran contiguous (first dimension varies the fastest) or either one. All of these flags imply PyBUF_STRIDES and guarantee that the strides buffer info structure will be filled in correctly.
PyBUF_INDIRECT (implies PyBUF_STRIDES)
The returned buffer must have suboffsets information (which can be NULL if no suboffsets are needed). This would be used when the consumer can handle indirect array referencing implied by these suboffsets.
Specialized combinations of flags for specific kinds of memory_sharing.
Multi-dimensional (but contiguous)
PyBUF_CONTIG (PyBUF_ND | PyBUF_WRITABLE)PyBUF_CONTIG_RO (PyBUF_ND)Multi-dimensional using strides but aligned
PyBUF_STRIDED (PyBUF_STRIDES | PyBUF_WRITABLE)PyBUF_STRIDED_RO (PyBUF_STRIDES)Multi-dimensional using strides and not necessarily aligned
PyBUF_RECORDS (PyBUF_STRIDES | PyBUF_WRITABLE | PyBUF_FORMAT)PyBUF_RECORDS_RO (PyBUF_STRIDES | PyBUF_FORMAT)Multi-dimensional using sub-offsets
PyBUF_FULL (PyBUF_INDIRECT | PyBUF_WRITABLE | PyBUF_FORMAT)PyBUF_FULL_RO (PyBUF_INDIRECT | PyBUF_FORMAT)
Thus, the consumer simply wanting a contiguous chunk of bytes from the object would use PyBUF_SIMPLE, while a consumer that understands how to make use of the most complicated cases could use PyBUF_FULL.
The format information is only guaranteed to be non-NULL if PyBUF_FORMAT is in the flag argument, otherwise it is expected the consumer will assume unsigned bytes.
There is a C-API that simple exporting objects can use to fill-in the buffer info structure correctly according to the provided flags if a contiguous chunk of "unsigned bytes" is all that can be exported.
The Py_buffer struct
The bufferinfo structure is:
struct bufferinfo {
void *buf;
Py_ssize_t len;
int readonly;
const char *format;
int ndim;
Py_ssize_t *shape;
Py_ssize_t *strides;
Py_ssize_t *suboffsets;
Py_ssize_t itemsize;
void *internal;
} Py_buffer;
Before calling the bf_getbuffer function, the bufferinfo structure can be filled with whatever, but the buf field must be NULL when requesting a new buffer. Upon return from bf_getbuffer, the bufferinfo structure is filled in with relevant information about the buffer. This same bufferinfo structure must be passed to bf_releasebuffer (if available) when the consumer is done with the memory. The caller is responsible for keeping a reference to obj until releasebuffer is called (i.e. the call to bf_getbuffer does not alter the reference count of obj).
The members of the bufferinfo structure are:
- buf
- a pointer to the start of the memory for the object
- len
- the total bytes of memory the object uses. This should be the same as the product of the shape array multiplied by the number of bytes per item of memory.
- readonly
- an integer variable to hold whether or not the memory is readonly. 1 means the memory is readonly, zero means the memory is writable.
- format
- a NULL-terminated format-string (following the struct-style syntax including extensions) indicating what is in each element of memory. The number of elements is len / itemsize, where itemsize is the number of bytes implied by the format. This can be NULL which implies standard unsigned bytes ("B").
- ndim
- a variable storing the number of dimensions the memory represents. Must be >=0. A value of 0 means that shape and strides and suboffsets must be NULL (i.e. the memory represents a scalar).
- shape
- an array of Py_ssize_t of length ndims indicating the shape of the memory as an N-D array. Note that ((*shape)[0] * ... * (*shape)[ndims-1])*itemsize = len. If ndims is 0 (indicating a scalar), then this must be NULL.
- strides
- address of a Py_ssize_t* variable that will be filled with a pointer to an array of Py_ssize_t of length ndims (or NULL if ndims is 0). indicating the number of bytes to skip to get to the next element in each dimension. If this is not requested by the caller (PyBUF_STRIDES is not set), then this should be set to NULL which indicates a C-style contiguous array or a PyExc_BufferError raised if this is not possible.
- suboffsets
address of a Py_ssize_t * variable that will be filled with a pointer to an array of Py_ssize_t of length *ndims. If these suboffset numbers are >=0, then the value stored along the indicated dimension is a pointer and the suboffset value dictates how many bytes to add to the pointer after de-referencing. A suboffset value that it negative indicates that no de-referencing should occur (striding in a contiguous memory block). If all suboffsets are negative (i.e. no de-referencing is needed, then this must be NULL (the default value). If this is not requested by the caller (PyBUF_INDIRECT is not set), then this should be set to NULL or an PyExc_BufferError raised if this is not possible.
For clarity, here is a function that returns a pointer to the element in an N-D array pointed to by an N-dimesional index when there are both non-NULL strides and suboffsets:
void *get_item_pointer(int ndim, void *buf, Py_ssize_t *strides, Py_ssize_t *suboffsets, Py_ssize_t *indices) { char *pointer = (char*)buf; int i; for (i = 0; i < ndim; i++) { pointer += strides[i] * indices[i]; if (suboffsets[i] >=0 ) { pointer = *((char**)pointer) + suboffsets[i]; } } return (void*)pointer; }Notice the suboffset is added "after" the dereferencing occurs. Thus slicing in the ith dimension would add to the suboffsets in the (i-1)st dimension. Slicing in the first dimension would change the location of the starting pointer directly (i.e. buf would be modified).
- itemsize
- This is a storage for the itemsize (in bytes) of each element of the shared memory. It is technically un-necessary as it can be obtained using PyBuffer_SizeFromFormat, however an exporter may know this information without parsing the format string and it is necessary to know the itemsize for proper interpretation of striding. Therefore, storing it is more convenient and faster.
- internal
- This is for use internally by the exporting object. For example, this might be re-cast as an integer by the exporter and used to store flags about whether or not the shape, strides, and suboffsets arrays must be freed when the buffer is released. The consumer should never alter this value.
The exporter is responsible for making sure that any memory pointed to by buf, format, shape, strides, and suboffsets is valid until releasebuffer is called. If the exporter wants to be able to change an object's shape, strides, and/or suboffsets before releasebuffer is called then it should allocate those arrays when getbuffer is called (pointing to them in the buffer-info structure provided) and free them when releasebuffer is called.
Releasing the buffer
The same bufferinfo struct should be used in the release-buffer interface call. The caller is responsible for the memory of the Py_buffer structure itself.
typedef void (*releasebufferproc)(PyObject *obj, Py_buffer *view)
Callers of getbufferproc must make sure that this function is called when memory previously acquired from the object is no longer needed. The exporter of the interface must make sure that any memory pointed to in the bufferinfo structure remains valid until releasebuffer is called.
If the bf_releasebuffer function is not provided (i.e. it is NULL), then it does not ever need to be called.
Exporters will need to define a bf_releasebuffer function if they can re-allocate their memory, strides, shape, suboffsets, or format variables which they might share through the struct bufferinfo. Several mechanisms could be used to keep track of how many getbuffer calls have been made and shared. Either a single variable could be used to keep track of how many "views" have been exported, or a linked-list of bufferinfo structures filled in could be maintained in each object.
All that is specifically required by the exporter, however, is to ensure that any memory shared through the bufferinfo structure remains valid until releasebuffer is called on the bufferinfo structure exporting that memory.
New C-API calls are proposed
int PyObject_CheckBuffer(PyObject *obj)
Return 1 if the getbuffer function is available otherwise 0.
int PyObject_GetBuffer(PyObject *obj, Py_buffer *view,
int flags)
This is a C-API version of the getbuffer function call. It checks to make sure object has the required function pointer and issues the call. Returns -1 and raises an error on failure and returns 0 on success.
void PyBuffer_Release(PyObject *obj, Py_buffer *view)
This is a C-API version of the releasebuffer function call. It checks to make sure the object has the required function pointer and issues the call. This function always succeeds even if there is no releasebuffer function for the object.
PyObject *PyObject_GetMemoryView(PyObject *obj)
Return a memory-view object from an object that defines the buffer interface.
A memory-view object is an extended buffer object that could replace the buffer object (but doesn't have to as that could be kept as a simple 1-d memory-view object). Its C-structure is
typedef struct {
PyObject_HEAD
PyObject *base;
Py_buffer view;
} PyMemoryViewObject;
This is functionally similar to the current buffer object except a reference to base is kept and the memory view is not re-grabbed. Thus, this memory view object holds on to the memory of base until it is deleted.
This memory-view object will support multi-dimensional slicing and be the first object provided with Python to do so. Slices of the memory-view object are other memory-view objects with the same base but with a different view of the base object.
When an "element" from the memory-view is returned it is always a bytes object whose format should be interpreted by the format attribute of the memoryview object. The struct module can be used to "decode" the bytes in Python if desired. Or the contents can be passed to a NumPy array or other object consuming the buffer protocol.
The Python name will be
__builtin__.memoryview
Methods:
Attributes (taken from the memory of the base object):
- format
- itemsize
- shape
- strides
- suboffsets
- readonly
- ndim
Py_ssize_t PyBuffer_SizeFromFormat(const char *)
Return the implied itemsize of the data-format area from a struct-style description.
PyObject * PyMemoryView_GetContiguous(PyObject *obj, int buffertype,
char fortran)
Return a memoryview object to a contiguous chunk of memory represented by obj. If a copy must be made (because the memory pointed to by obj is not contiguous), then a new bytes object will be created and become the base object for the returned memory view object.
The buffertype argument can be PyBUF_READ, PyBUF_WRITE, PyBUF_UPDATEIFCOPY to determine whether the returned buffer should be readable, writable, or set to update the original buffer if a copy must be made. If buffertype is PyBUF_WRITE and the buffer is not contiguous an error will be raised. In this circumstance, the user can use PyBUF_UPDATEIFCOPY to ensure that a a writable temporary contiguous buffer is returned. The contents of this contiguous buffer will be copied back into the original object after the memoryview object is deleted as long as the original object is writable. If this is not allowed by the original object, then a BufferError is raised.
If the object is multi-dimensional, then if fortran is 'F', the first dimension of the underlying array will vary the fastest in the buffer. If fortran is 'C', then the last dimension will vary the fastest (C-style contiguous). If fortran is 'A', then it does not matter and you will get whatever the object decides is more efficient. If a copy is made, then the memory must be freed by calling PyMem_Free.
You receive a new reference to the memoryview object.
int PyObject_CopyToObject(PyObject *obj, void *buf, Py_ssize_t len,
char fortran)
Copy len bytes of data pointed to by the contiguous chunk of memory pointed to by buf into the buffer exported by obj. Return 0 on success and return -1 and raise an error on failure. If the object does not have a writable buffer, then an error is raised. If fortran is 'F', then if the object is multi-dimensional, then the data will be copied into the array in Fortran-style (first dimension varies the fastest). If fortran is 'C', then the data will be copied into the array in C-style (last dimension varies the fastest). If fortran is 'A', then it does not matter and the copy will be made in whatever way is more efficient.
int PyObject_CopyData(PyObject *dest, PyObject *src)
These last three C-API calls allow a standard way of getting data in and out of Python objects into contiguous memory areas no matter how it is actually stored. These calls use the extended buffer interface to perform their work.
int PyBuffer_IsContiguous(Py_buffer *view, char fortran)
Return 1 if the memory defined by the view object is C-style (fortran = 'C') or Fortran-style (fortran = 'F') contiguous or either one (fortran = 'A'). Return 0 otherwise.
void PyBuffer_FillContiguousStrides(int ndim, Py_ssize_t *shape,
Py_ssize_t *strides, Py_ssize_t itemsize,
char fortran)
Fill the strides array with byte-strides of a contiguous (C-style if fortran is 'C' or Fortran-style if fortran is 'F' array of the given shape with the given number of bytes per element.
int PyBuffer_FillInfo(Py_buffer *view, void *buf,
Py_ssize_t len, int readonly, int infoflags)
Fills in a buffer-info structure correctly for an exporter that can only share a contiguous chunk of memory of "unsigned bytes" of the given length. Returns 0 on success and -1 (with raising an error) on error.
PyExc_BufferError
A new error object for returning buffer errors which arise because an exporter cannot provide the kind of buffer that a consumer expects. This will also be raised when a consumer requests a buffer from an object that does not provide the protocol.
Additions to the struct string-syntax
The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere (in ctypes and NumPy for example). The Python 2.5 specification is at http://docs.python.org/library/struct.html.
Here are the proposed additions:
| Character | Description |
|---|---|
| 't' | bit (number before states how many bits) |
| '?' | platform _Bool type |
| 'g' | long double |
| 'c' | ucs-1 (latin-1) encoding |
| 'u' | ucs-2 |
| 'w' | ucs-4 |
| 'O' | pointer to Python Object |
| 'Z' | complex (whatever the next specifier is) |
| '&' | specific pointer (prefix before another character) |
| 'T{}' | structure (detailed layout inside {}) |
| '(k1,k2,...,kn)' | multi-dimensional array of whatever follows |
| ':name:' | optional name of the preceeding element |
| 'X{}' |
|
The struct module will be changed to understand these as well and return appropriate Python objects on unpacking. Unpacking a long-double will return a decimal object or a ctypes long-double. Unpacking 'u' or 'w' will return Python unicode. Unpacking a multi-dimensional array will return a list (of lists if >1d). Unpacking a pointer will return a ctypes pointer object. Unpacking a function pointer will return a ctypes call-object (perhaps). Unpacking a bit will return a Python Bool. White-space in the struct-string syntax will be ignored if it isn't already. Unpacking a named-object will return some kind of named-tuple-like object that acts like a tuple but whose entries can also be accessed by name. Unpacking a nested structure will return a nested tuple.
Endian-specification ('!', '@','=','>','<', '^') is also allowed inside the string so that it can change if needed. The previously-specified endian string is in force until changed. The default endian is '@' which means native data-types and alignment. If un-aligned, native data-types are requested, then the endian specification is '^'.
According to the struct-module, a number can preceed a character code to specify how many of that type there are. The (k1,k2,...,kn) extension also allows specifying if the data is supposed to be viewed as a (C-style contiguous, last-dimension varies the fastest) multi-dimensional array of a particular format.
Functions should be added to ctypes to create a ctypes object from a struct description, and add long-double, and ucs-2 to ctypes.
Examples of Data-Format Descriptions
Here are some examples of C-structures and how they would be represented using the struct-style syntax.
<named> is the constructor for a named-tuple (not-specified yet).
- float
- 'd' <--> Python float
- complex double
- 'Zd' <--> Python complex
- RGB Pixel data
- 'BBB' <--> (int, int, int) 'B:r: B:g: B:b:' <--> <named>((int, int, int), ('r','g','b'))
- Mixed endian (weird but possible)
- '>i:big: <i:little:' <--> <named>((int, int), ('big', 'little'))
- Nested structure
struct { int ival; struct { unsigned short sval; unsigned char bval; unsigned char cval; } sub; } """i:ival: T{ H:sval: B:bval: B:cval: }:sub: """- Nested array
struct { int ival; double data[16*4]; } """i:ival: (16,4)d:data: """
Note, that in the last example, the C-structure compared against is intentionally a 1-d array and not a 2-d array data[16][4]. The reason for this is to avoid the confusions between static multi-dimensional arrays in C (which are layed out contiguously) and dynamic multi-dimensional arrays which use the same syntax to access elements, data[0][1], but whose memory is not necessarily contiguous. The struct-syntax always uses contiguous memory and the multi-dimensional character is information about the memory to be communicated by the exporter.
In other words, the struct-syntax description does not have to match the C-syntax exactly as long as it describes the same memory layout. The fact that a C-compiler would think of the memory as a 1-d array of doubles is irrelevant to the fact that the exporter wanted to communicate to the consumer that this field of the memory should be thought of as a 2-d array where a new dimension is considered after every 4 elements.
Code to be affected
All objects and modules in Python that export or consume the old buffer interface will be modified. Here is a partial list.
- buffer object
- bytes object
- string object
- unicode object
- array module
- struct module
- mmap module
- ctypes module
Anything else using the buffer API.
Issues and Details
It is intended that this PEP will be back-ported to Python 2.6 by adding the C-API and the two functions to the existing buffer protocol.
Previous versions of this PEP proposed a read/write locking scheme, but it was later perceived as a) too complicated for common simple use cases that do not require any locking and b) too simple for use cases that required concurrent read/write access to a buffer with changing, short-living locks. It is therefore left to users to implement their own specific locking scheme around buffer objects if they require consistent views across concurrent read/write access. A future PEP may be proposed which includes a separate locking API after some experience with these user-schemes is obtained
The sharing of strided memory and suboffsets is new and can be seen as a modification of the multiple-segment interface. It is motivated by NumPy and the PIL. NumPy objects should be able to share their strided memory with code that understands how to manage strided memory because strided memory is very common when interfacing with compute libraries.
Also, with this approach it should be possible to write generic code that works with both kinds of memory without copying.
Memory management of the format string, the shape array, the strides array, and the suboffsets array in the bufferinfo structure is always the responsibility of the exporting object. The consumer should not set these pointers to any other memory or try to free them.
Several ideas were discussed and rejected:
Having a "releaser" object whose release-buffer was called. This was deemed unacceptable because it caused the protocol to be asymmetric (you called release on something different than you "got" the buffer from). It also complicated the protocol without providing a real benefit.
Passing all the struct variables separately into the function. This had the advantage that it allowed one to set NULL to variables that were not of interest, but it also made the function call more difficult. The flags variable allows the same ability of consumers to be "simple" in how they call the protocol.
Code
The authors of the PEP promise to contribute and maintain the code for this proposal but will welcome any help.
Examples
Ex. 1
This example shows how an image object that uses contiguous lines might expose its buffer:
struct rgba {
unsigned char r, g, b, a;
};
struct ImageObject {
PyObject_HEAD;
...
struct rgba** lines;
Py_ssize_t height;
Py_ssize_t width;
Py_ssize_t shape_array[2];
Py_ssize_t stride_array[2];
Py_ssize_t view_count;
};
"lines" points to malloced 1-D array of (struct rgba*). Each pointer in THAT block points to a seperately malloced array of (struct rgba).
In order to access, say, the red value of the pixel at x=30, y=50, you'd use "lines[50][30].r".
So what does ImageObject's getbuffer do? Leaving error checking out:
int Image_getbuffer(PyObject *self, Py_buffer *view, int flags) {
static Py_ssize_t suboffsets[2] = { 0, -1};
view->buf = self->lines;
view->len = self->height*self->width;
view->readonly = 0;
view->ndims = 2;
self->shape_array[0] = height;
self->shape_array[1] = width;
view->shape = &self->shape_array;
self->stride_array[0] = sizeof(struct rgba*);
self->stride_array[1] = sizeof(struct rgba);
view->strides = &self->stride_array;
view->suboffsets = suboffsets;
self->view_count ++;
return 0;
}
int Image_releasebuffer(PyObject *self, Py_buffer *view) {
self->view_count--;
return 0;
}
Ex. 2
This example shows how an object that wants to expose a contiguous chunk of memory (which will never be re-allocated while the object is alive) would do that.
int myobject_getbuffer(PyObject *self, Py_buffer *view, int flags) {
void *buf;
Py_ssize_t len;
int readonly=0;
buf = /* Point to buffer */
len = /* Set to size of buffer */
readonly = /* Set to 1 if readonly */
return PyObject_FillBufferInfo(view, buf, len, readonly, flags);
}
/* No releasebuffer is necessary because the memory will never
be re-allocated
*/
Ex. 3
A consumer that wants to only get a simple contiguous chunk of bytes from a Python object, obj would do the following:
Py_buffer view;
int ret;
if (PyObject_GetBuffer(obj, &view, Py_BUF_SIMPLE) < 0) {
/* error return */
}
/* Now, view.buf is the pointer to memory
view.len is the length
view.readonly is whether or not the memory is read-only.
*/
/* After using the information and you don't need it anymore */
if (PyBuffer_Release(obj, &view) < 0) {
/* error return */
}
Ex. 4
A consumer that wants to be able to use any object's memory but is writing an algorithm that only handle contiguous memory could do the following:
void *buf;
Py_ssize_t len;
char *format;
int copy;
copy = PyObject_GetContiguous(obj, &buf, &len, &format, 0, 'A');
if (copy < 0) {
/* error return */
}
/* process memory pointed to by buffer if format is correct */
/* Optional:
if, after processing, we want to copy data from buffer back
into the object
we could do
*/
if (PyObject_CopyToObject(obj, buf, len, 'A') < 0) {
/* error return */
}
/* Make sure that if a copy was made, the memory is freed */
if (copy == 1) PyMem_Free(buf);
Copyright
This PEP is placed in the public domain.
pep-3119 Introducing Abstract Base Classes
| PEP: | 3119 |
|---|---|
| Title: | Introducing Abstract Base Classes |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org>, Talin <talin at acm.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 18-Apr-2007 |
| Post-History: | 26-Apr-2007, 11-May-2007 |
Abstract
This is a proposal to add Abstract Base Class (ABC) support to Python 3000. It proposes:
- A way to overload isinstance() and issubclass().
- A new module abc which serves as an "ABC support framework". It defines a metaclass for use with ABCs and a decorator that can be used to define abstract methods.
- Specific ABCs for containers and iterators, to be added to the collections module.
Much of the thinking that went into the proposal is not about the specific mechanism of ABCs, as contrasted with Interfaces or Generic Functions (GFs), but about clarifying philosophical issues like "what makes a set", "what makes a mapping" and "what makes a sequence".
There's also a companion PEP 3141, which defines ABCs for numeric types.
Acknowledgements
Talin wrote the Rationale below [1] as well as most of the section on ABCs vs. Interfaces. For that alone he deserves co-authorship. The rest of the PEP uses "I" referring to the first author.
Rationale
In the domain of object-oriented programming, the usage patterns for interacting with an object can be divided into two basic categories, which are 'invocation' and 'inspection'.
Invocation means interacting with an object by invoking its methods. Usually this is combined with polymorphism, so that invoking a given method may run different code depending on the type of an object.
Inspection means the ability for external code (outside of the object's methods) to examine the type or properties of that object, and make decisions on how to treat that object based on that information.
Both usage patterns serve the same general end, which is to be able to support the processing of diverse and potentially novel objects in a uniform way, but at the same time allowing processing decisions to be customized for each different type of object.
In classical OOP theory, invocation is the preferred usage pattern, and inspection is actively discouraged, being considered a relic of an earlier, procedural programming style. However, in practice this view is simply too dogmatic and inflexible, and leads to a kind of design rigidity that is very much at odds with the dynamic nature of a language like Python.
In particular, there is often a need to process objects in a way that wasn't anticipated by the creator of the object class. It is not always the best solution to build in to every object methods that satisfy the needs of every possible user of that object. Moreover, there are many powerful dispatch philosophies that are in direct contrast to the classic OOP requirement of behavior being strictly encapsulated within an object, examples being rule or pattern-match driven logic.
On the other hand, one of the criticisms of inspection by classic OOP theorists is the lack of formalisms and the ad hoc nature of what is being inspected. In a language such as Python, in which almost any aspect of an object can be reflected and directly accessed by external code, there are many different ways to test whether an object conforms to a particular protocol or not. For example, if asking 'is this object a mutable sequence container?', one can look for a base class of 'list', or one can look for a method named '__getitem__'. But note that although these tests may seem obvious, neither of them are correct, as one generates false negatives, and the other false positives.
The generally agreed-upon remedy is to standardize the tests, and group them into a formal arrangement. This is most easily done by associating with each class a set of standard testable properties, either via the inheritance mechanism or some other means. Each test carries with it a set of promises: it contains a promise about the general behavior of the class, and a promise as to what other class methods will be available.
This PEP proposes a particular strategy for organizing these tests known as Abstract Base Classes, or ABC. ABCs are simply Python classes that are added into an object's inheritance tree to signal certain features of that object to an external inspector. Tests are done using isinstance(), and the presence of a particular ABC means that the test has passed.
In addition, the ABCs define a minimal set of methods that establish the characteristic behavior of the type. Code that discriminates objects based on their ABC type can trust that those methods will always be present. Each of these methods are accompanied by an generalized abstract semantic definition that is described in the documentation for the ABC. These standard semantic definitions are not enforced, but are strongly recommended.
Like all other things in Python, these promises are in the nature of a gentlemen's agreement, which in this case means that while the language does enforce some of the promises made in the ABC, it is up to the implementer of the concrete class to insure that the remaining ones are kept.
Specification
The specification follows the categories listed in the abstract:
- A way to overload isinstance() and issubclass().
- A new module abc which serves as an "ABC support framework". It defines a metaclass for use with ABCs and a decorator that can be used to define abstract methods.
- Specific ABCs for containers and iterators, to be added to the collections module.
Overloading isinstance() and issubclass()
During the development of this PEP and of its companion, PEP 3141, we repeatedly faced the choice between standardizing more, fine-grained ABCs or fewer, course-grained ones. For example, at one stage, PEP 3141 introduced the following stack of base classes used for complex numbers: MonoidUnderPlus, AdditiveGroup, Ring, Field, Complex (each derived from the previous). And the discussion mentioned several other algebraic categorizations that were left out: Algebraic, Transcendental, and IntegralDomain, and PrincipalIdealDomain. In earlier versions of the current PEP, we considered the use cases for separate classes like Set, ComposableSet, MutableSet, HashableSet, MutableComposableSet, HashableComposableSet.
The dilemma here is that we'd rather have fewer ABCs, but then what should a user do who needs a less refined ABC? Consider e.g. the plight of a mathematician who wants to define his own kind of Transcendental numbers, but also wants float and int to be considered Transcendental. PEP 3141 originally proposed to patch float.__bases__ for that purpose, but there are some good reasons to keep the built-in types immutable (for one, they are shared between all Python interpreters running in the same address space, as is used by mod_python [16]).
Another example would be someone who wants to define a generic function (PEP 3124) for any sequence that has an append() method. The Sequence ABC (see below) doesn't promise the append() method, while MutableSequence requires not only append() but also various other mutating methods.
To solve these and similar dilemmas, the next section will propose a metaclass for use with ABCs that will allow us to add an ABC as a "virtual base class" (not the same concept as in C++) to any class, including to another ABC. This allows the standard library to define ABCs Sequence and MutableSequence and register these as virtual base classes for built-in types like basestring, tuple and list, so that for example the following conditions are all true:
isinstance([], Sequence)
issubclass(list, Sequence)
issubclass(list, MutableSequence)
isinstance((), Sequence)
not issubclass(tuple, MutableSequence)
isinstance("", Sequence)
issubclass(bytearray, MutableSequence)
The primary mechanism proposed here is to allow overloading the built-in functions isinstance() and issubclass(). The overloading works as follows: The call isinstance(x, C) first checks whether C.__instancecheck__ exists, and if so, calls C.__instancecheck__(x) instead of its normal implementation. Similarly, the call issubclass(D, C) first checks whether C.__subclasscheck__ exists, and if so, calls C.__subclasscheck__(D) instead of its normal implementation.
Note that the magic names are not __isinstance__ and __issubclass__; this is because the reversal of the arguments could cause confusion, especially for the issubclass() overloader.
A prototype implementation of this is given in [12].
Here is an example with (naively simple) implementations of __instancecheck__ and __subclasscheck__:
class ABCMeta(type):
def __instancecheck__(cls, inst):
"""Implement isinstance(inst, cls)."""
return any(cls.__subclasscheck__(c)
for c in {type(inst), inst.__class__})
def __subclasscheck__(cls, sub):
"""Implement issubclass(sub, cls)."""
candidates = cls.__dict__.get("__subclass__", set()) | {cls}
return any(c in candidates for c in sub.mro())
class Sequence(metaclass=ABCMeta):
__subclass__ = {list, tuple}
assert issubclass(list, Sequence)
assert issubclass(tuple, Sequence)
class AppendableSequence(Sequence):
__subclass__ = {list}
assert issubclass(list, AppendableSequence)
assert isinstance([], AppendableSequence)
assert not issubclass(tuple, AppendableSequence)
assert not isinstance((), AppendableSequence)
The next section proposes a full-fledged implementation.
The abc Module: an ABC Support Framework
The new standard library module abc, written in pure Python, serves as an ABC support framework. It defines a metaclass ABCMeta and decorators @abstractmethod and @abstractproperty. A sample implementation is given by [13].
The ABCMeta class overrides __instancecheck__ and __subclasscheck__ and defines a register method. The register method takes one argument, which must be a class; after the call B.register(C), the call issubclass(C, B) will return True, by virtue of B.__subclasscheck__(C) returning True. Also, isinstance(x, B) is equivalent to issubclass(x.__class__, B) or issubclass(type(x), B). (It is possible type(x) and x.__class__ are not the same object, e.g. when x is a proxy object.)
These methods are intended to be be called on classes whose metaclass is (derived from) ABCMeta; for example:
from abc import ABCMeta
class MyABC(metaclass=ABCMeta):
pass
MyABC.register(tuple)
assert issubclass(tuple, MyABC)
assert isinstance((), MyABC)
The last two asserts are equivalent to the following two:
assert MyABC.__subclasscheck__(tuple) assert MyABC.__instancecheck__(())
Of course, you can also directly subclass MyABC:
class MyClass(MyABC):
pass
assert issubclass(MyClass, MyABC)
assert isinstance(MyClass(), MyABC)
Also, of course, a tuple is not a MyClass:
assert not issubclass(tuple, MyClass) assert not isinstance((), MyClass)
You can register another class as a subclass of MyClass:
MyClass.register(list) assert issubclass(list, MyClass) assert issubclass(list, MyABC)
You can also register another ABC:
class AnotherClass(metaclass=ABCMeta):
pass
AnotherClass.register(basestring)
MyClass.register(AnotherClass)
assert isinstance(str, MyABC)
That last assert requires tracing the following superclass-subclass relationships:
MyABC -> MyClass (using regular subclassing) MyClass -> AnotherClass (using registration) AnotherClass -> basestring (using registration) basestring -> str (using regular subclassing)
The abc module also defines a new decorator, @abstractmethod, to be used to declare abstract methods. A class containing at least one method declared with this decorator that hasn't been overridden yet cannot be instantiated. Such methods may be called from the overriding method in the subclass (using super or direct invocation). For example:
from abc import ABCMeta, abstractmethod
class A(metaclass=ABCMeta):
@abstractmethod
def foo(self): pass
A() # raises TypeError
class B(A):
pass
B() # raises TypeError
class C(A):
def foo(self): print(42)
C() # works
Note: The @abstractmethod decorator should only be used inside a class body, and only for classes whose metaclass is (derived from) ABCMeta. Dynamically adding abstract methods to a class, or attempting to modify the abstraction status of a method or class once it is created, are not supported. The @abstractmethod only affects subclasses derived using regular inheritance; "virtual subclasses" registered with the register() method are not affected.
Implementation: The @abstractmethod decorator sets the function attribute __isabstractmethod__ to the value True. The ABCMeta.__new__ method computes the type attribute __abstractmethods__ as the set of all method names that have an __isabstractmethod__ attribute whose value is true. It does this by combining the __abstractmethods__ attributes of the base classes, adding the names of all methods in the new class dict that have a true __isabstractmethod__ attribute, and removing the names of all methods in the new class dict that don't have a true __isabstractmethod__ attribute. If the resulting __abstractmethods__ set is non-empty, the class is considered abstract, and attempts to instantiate it will raise TypeError. (If this were implemented in CPython, an internal flag Py_TPFLAGS_ABSTRACT could be used to speed up this check [6].)
Discussion: Unlike Java's abstract methods or C++'s pure abstract methods, abstract methods as defined here may have an implementation. This implementation can be called via the super mechanism from the class that overrides it. This could be useful as an end-point for a super-call in framework using cooperative multiple-inheritance [7], [8].
A second decorator, @abstractproperty, is defined in order to define abstract data attributes. Its implementation is a subclass of the built-in property class that adds an __isabstractmethod__ attribute:
class abstractproperty(property):
__isabstractmethod__ = True
It can be used in two ways:
class C(metaclass=ABCMeta):
# A read-only property:
@abstractproperty
def readonly(self):
return self.__x
# A read-write property (cannot use decorator syntax):
def getx(self):
return self.__x
def setx(self, value):
self.__x = value
x = abstractproperty(getx, setx)
Similar to abstract methods, a subclass inheriting an abstract property (declared using either the decorator syntax or the longer form) cannot be instantiated unless it overrides that abstract property with a concrete property.
ABCs for Containers and Iterators
The collections module will define ABCs necessary and sufficient to work with sets, mappings, sequences, and some helper types such as iterators and dictionary views. All ABCs have the above-mentioned ABCMeta as their metaclass.
The ABCs provide implementations of their abstract methods that are technically valid but fairly useless; e.g. __hash__ returns 0, and __iter__ returns an empty iterator. In general, the abstract methods represent the behavior of an empty container of the indicated type.
Some ABCs also provide concrete (i.e. non-abstract) methods; for example, the Iterator class has an __iter__ method returning itself, fulfilling an important invariant of iterators (which in Python 2 has to be implemented anew by each iterator class). These ABCs can be considered "mix-in" classes.
No ABCs defined in the PEP override __init__, __new__, __str__ or __repr__. Defining a standard constructor signature would unnecessarily constrain custom container types, for example Patricia trees or gdbm files. Defining a specific string representation for a collection is similarly left up to individual implementations.
Note: There are no ABCs for ordering operations (__lt__, __le__, __ge__, __gt__). Defining these in a base class (abstract or not) runs into problems with the accepted type for the second operand. For example, if class Ordering defined __lt__, one would assume that for any Ordering instances x and y, x < y would be defined (even if it just defines a partial ordering). But this cannot be the case: If both list and str derived from Ordering, this would imply that [1, 2] < (1, 2) should be defined (and presumably return False), while in fact (in Python 3000!) such "mixed-mode comparisons" operations are explicitly forbidden and raise TypeError. See PEP 3100 and [14] for more information. (This is a special case of a more general issue with operations that take another argument of the same type).
One Trick Ponies
These abstract classes represent single methods like __iter__ or __len__.
- Hashable
The base class for classes defining __hash__. The __hash__ method should return an integer. The abstract __hash__ method always returns 0, which is a valid (albeit inefficient) implementation. Invariant: If classes C1 and C2 both derive from Hashable, the condition o1 == o2 must imply hash(o1) == hash(o2) for all instances o1 of C1 and all instances o2 of C2. In other words, two objects should never compare equal if they have different hash values.
Another constraint is that hashable objects, once created, should never change their value (as compared by ==) or their hash value. If a class cannot guarantee this, it should not derive from Hashable; if it cannot guarantee this for certain instances, __hash__ for those instances should raise a TypeError exception.
Note: being an instance of this class does not imply that an object is immutable; e.g. a tuple containing a list as a member is not immutable; its __hash__ method raises TypeError. (This is because it recursively tries to compute the hash of each member; if a member is unhashable it raises TypeError.)
- Iterable
- The base class for classes defining __iter__. The __iter__ method should always return an instance of Iterator (see below). The abstract __iter__ method returns an empty iterator.
- Iterator
- The base class for classes defining __next__. This derives from Iterable. The abstract __next__ method raises StopIteration. The concrete __iter__ method returns self. Note the distinction between Iterable and Iterator: an Iterable can be iterated over, i.e. supports the __iter__ methods; an Iterator is what the built-in function iter() returns, i.e. supports the __next__ method.
- Sized
- The base class for classes defining __len__. The __len__ method should return an Integer (see "Numbers" below) >= 0. The abstract __len__ method returns 0. Invariant: If a class C derives from Sized as well as from Iterable, the invariant sum(1 for x in c) == len(c) should hold for any instance c of C.
- Container
- The base class for classes defining __contains__. The __contains__ method should return a bool. The abstract __contains__ method returns False. Invariant: If a class C derives from Container as well as from Iterable, then (x in c for x in c) should be a generator yielding only True values for any instance c of C.
Open issues: Conceivably, instead of using the ABCMeta metaclass, these classes could override __instancecheck__ and __subclasscheck__ to check for the presence of the applicable special method; for example:
class Sized(metaclass=ABCMeta):
@abstractmethod
def __hash__(self):
return 0
@classmethod
def __instancecheck__(cls, x):
return hasattr(x, "__len__")
@classmethod
def __subclasscheck__(cls, C):
return hasattr(C, "__bases__") and hasattr(C, "__len__")
This has the advantage of not requiring explicit registration. However, the semantics are hard to get exactly right given the confusing semantics of instance attributes vs. class attributes, and that a class is an instance of its metaclass; the check for __bases__ is only an approximation of the desired semantics. Strawman: Let's do it, but let's arrange it in such a way that the registration API also works.
Sets
These abstract classes represent read-only sets and mutable sets. The most fundamental set operation is the membership test, written as x in s and implemented by s.__contains__(x). This operation is already defined by the Container class defined above. Therefore, we define a set as a sized, iterable container for which certain invariants from mathematical set theory hold.
The built-in type set derives from MutableSet. The built-in type frozenset derives from Set and Hashable.
- Set
This is a sized, iterable container, i.e., a subclass of Sized, Iterable and Container. Not every subclass of those three classes is a set though! Sets have the additional invariant that each element occurs only once (as can be determined by iteration), and in addition sets define concrete operators that implement the inequality operations as subclass/superclass tests. In general, the invariants for finite sets in mathematics hold. [11]
Sets with different implementations can be compared safely, (usually) efficiently and correctly using the mathematical definitions of the subclass/superclass operations for finite sets. The ordering operations have concrete implementations; subclasses may override these for speed but should maintain the semantics. Because Set derives from Sized, __eq__ may take a shortcut and return False immediately if two sets of unequal length are compared. Similarly, __le__ may return False immediately if the first set has more members than the second set. Note that set inclusion implements only a partial ordering; e.g. {1, 2} and {1, 3} are not ordered (all three of <, == and > return False for these arguments). Sets cannot be ordered relative to mappings or sequences, but they can be compared to those for equality (and then they always compare unequal).
This class also defines concrete operators to compute union, intersection, symmetric and asymmetric difference, respectively __or__, __and__, __xor__ and __sub__. These operators should return instances of Set. The default implementations call the overridable class method _from_iterable() with an iterable argument. This factory method's default implementation returns a frozenset instance; it may be overridden to return another appropriate Set subclass.
Finally, this class defines a concrete method _hash which computes the hash value from the elements. Hashable subclasses of Set can implement __hash__ by calling _hash or they can reimplement the same algorithm more efficiently; but the algorithm implemented should be the same. Currently the algorithm is fully specified only by the source code [15].
Note: the issubset and issuperset methods found on the set type in Python 2 are not supported, as these are mostly just aliases for __le__ and __ge__.
- MutableSet
This is a subclass of Set implementing additional operations to add and remove elements. The supported methods have the semantics known from the set type in Python 2 (except for discard, which is modeled after Java):
- .add(x)
- Abstract method returning a bool that adds the element x if it isn't already in the set. It should return True if x was added, False if it was already there. The abstract implementation raises NotImplementedError.
- .discard(x)
- Abstract method returning a bool that removes the element x if present. It should return True if the element was present and False if it wasn't. The abstract implementation raises NotImplementedError.
- .pop()
- Concrete method that removes and returns an arbitrary item. If the set is empty, it raises KeyError. The default implementation removes the first item returned by the set's iterator.
- .toggle(x)
- Concrete method returning a bool that adds x to the set if it wasn't there, but removes it if it was there. It should return True if x was added, False if it was removed.
- .clear()
- Concrete method that empties the set. The default implementation repeatedly calls self.pop() until KeyError is caught. (Note: this is likely much slower than simply creating a new set, even if an implementation overrides it with a faster approach; but in some cases object identity is important.)
This also supports the in-place mutating operations |=, &=, ^=, -=. These are concrete methods whose right operand can be an arbitrary Iterable, except for &=, whose right operand must be a Container. This ABC does not provide the named methods present on the built-in concrete set type that perform (almost) the same operations.
Mappings
These abstract classes represent read-only mappings and mutable mappings. The Mapping class represents the most common read-only mapping API.
The built-in type dict derives from MutableMapping.
- Mapping
A subclass of Container, Iterable and Sized. The keys of a mapping naturally form a set. The (key, value) pairs (which must be tuples) are also referred to as items. The items also form a set. Methods:
- .__getitem__(key)
- Abstract method that returns the value corresponding to key, or raises KeyError. The implementation always raises KeyError.
- .get(key, default=None)
- Concrete method returning self[key] if this does not raise KeyError, and the default value if it does.
- .__contains__(key)
- Concrete method returning True if self[key] does not raise KeyError, and False if it does.
- .__len__()
- Abstract method returning the number of distinct keys (i.e., the length of the key set).
- .__iter__()
- Abstract method returning each key in the key set exactly once.
- .keys()
- Concrete method returning the key set as a Set. The default concrete implementation returns a "view" on the key set (meaning if the underlying mapping is modified, the view's value changes correspondingly); subclasses are not required to return a view but they should return a Set.
- .items()
- Concrete method returning the items as a Set. The default concrete implementation returns a "view" on the item set; subclasses are not required to return a view but they should return a Set.
- .values()
- Concrete method returning the values as a sized, iterable container (not a set!). The default concrete implementation returns a "view" on the values of the mapping; subclasses are not required to return a view but they should return a sized, iterable container.
The following invariants should hold for any mapping m:
len(m.values()) == len(m.keys()) == len(m.items()) == len(m) [value for value in m.values()] == [m[key] for key in m.keys()] [item for item in m.items()] == [(key, m[key]) for key in m.keys()]
i.e. iterating over the items, keys and values should return results in the same order.
- MutableMapping
- A subclass of Mapping that also implements some standard mutating methods. Abstract methods include __setitem__, __delitem__. Concrete methods include pop, popitem, clear, update. Note: setdefault is not included. Open issues: Write out the specs for the methods.
Sequences
These abstract classes represent read-only sequences and mutable sequences.
The built-in list and bytes types derive from MutableSequence. The built-in tuple and str types derive from Sequence and Hashable.
- Sequence
A subclass of Iterable, Sized, Container. It defines a new abstract method __getitem__ that has a somewhat complicated signature: when called with an integer, it returns an element of the sequence or raises IndexError; when called with a slice object, it returns another Sequence. The concrete __iter__ method iterates over the elements using __getitem__ with integer arguments 0, 1, and so on, until IndexError is raised. The length should be equal to the number of values returned by the iterator.
Open issues: Other candidate methods, which can all have default concrete implementations that only depend on __len__ and __getitem__ with an integer argument: __reversed__, index, count, __add__, __mul__.
- MutableSequence
- A subclass of Sequence adding some standard mutating methods. Abstract mutating methods: __setitem__ (for integer indices as well as slices), __delitem__ (ditto), insert. Concrete mutating methods: append, reverse, extend, pop, remove. Concrete mutating operators: +=, *= (these mutate the object in place). Note: this does not define sort() -- that is only required to exist on genuine list instances.
Strings
Python 3000 will likely have at least two built-in string types: byte strings (bytes), deriving from MutableSequence, and (Unicode) character strings (str), deriving from Sequence and Hashable.
Open issues: define the base interfaces for these so alternative implementations and subclasses know what they are in for. This may be the subject of a new PEP or PEPs (PEP 358 should be co-opted for the bytes type).
ABCs vs. Alternatives
In this section I will attempt to compare and contrast ABCs to other approaches that have been proposed.
ABCs vs. Duck Typing
Does the introduction of ABCs mean the end of Duck Typing? I don't think so. Python will not require that a class derives from BasicMapping or Sequence when it defines a __getitem__ method, nor will the x[y] syntax require that x is an instance of either ABC. You will still be able to assign any "file-like" object to sys.stdout, as long as it has a write method.
Of course, there will be some carrots to encourage users to derive from the appropriate base classes; these vary from default implementations for certain functionality to an improved ability to distinguish between mappings and sequences. But there are no sticks. If hasattr(x, "__len__") works for you, great! ABCs are intended to solve problems that don't have a good solution at all in Python 2, such as distinguishing between mappings and sequences.
ABCs vs. Generic Functions
ABCs are compatible with Generic Functions (GFs). For example, my own Generic Functions implementation [4] uses the classes (types) of the arguments as the dispatch key, allowing derived classes to override base classes. Since (from Python's perspective) ABCs are quite ordinary classes, using an ABC in the default implementation for a GF can be quite appropriate. For example, if I have an overloaded prettyprint function, it would make total sense to define pretty-printing of sets like this:
@prettyprint.register(Set)
def pp_set(s):
return "{" + ... + "}" # Details left as an exercise
and implementations for specific subclasses of Set could be added easily.
I believe ABCs also won't present any problems for RuleDispatch, Phillip Eby's GF implementation in PEAK [5].
Of course, GF proponents might claim that GFs (and concrete, or implementation, classes) are all you need. But even they will not deny the usefulness of inheritance; and one can easily consider the ABCs proposed in this PEP as optional implementation base classes; there is no requirement that all user-defined mappings derive from BasicMapping.
ABCs vs. Interfaces
ABCs are not intrinsically incompatible with Interfaces, but there is considerable overlap. For now, I'll leave it to proponents of Interfaces to explain why Interfaces are better. I expect that much of the work that went into e.g. defining the various shades of "mapping-ness" and the nomenclature could easily be adapted for a proposal to use Interfaces instead of ABCs.
"Interfaces" in this context refers to a set of proposals for additional metadata elements attached to a class which are not part of the regular class hierarchy, but do allow for certain types of inheritance testing.
Such metadata would be designed, at least in some proposals, so as to be easily mutable by an application, allowing application writers to override the normal classification of an object.
The drawback to this idea of attaching mutable metadata to a class is that classes are shared state, and mutating them may lead to conflicts of intent. Additionally, the need to override the classification of an object can be done more cleanly using generic functions: In the simplest case, one can define a "category membership" generic function that simply returns False in the base implementation, and then provide overrides that return True for any classes of interest.
References
| [1] | An Introduction to ABC's, by Talin (http://mail.python.org/pipermail/python-3000/2007-April/006614.html) |
| [2] | Incomplete implementation prototype, by GvR (http://svn.python.org/view/sandbox/trunk/abc/) |
| [3] | Possible Python 3K Class Tree?, wiki page created by Bill Janssen (http://wiki.python.org/moin/AbstractBaseClasses) |
| [4] | Generic Functions implementation, by GvR (http://svn.python.org/view/sandbox/trunk/overload/) |
| [5] | Charming Python: Scaling a new PEAK, by David Mertz (http://www-128.ibm.com/developerworks/library/l-cppeak2/) |
| [6] | Implementation of @abstractmethod (http://python.org/sf/1706989) |
| [7] | Unifying types and classes in Python 2.2, by GvR (http://www.python.org/download/releases/2.2.3/descrintro/) |
| [8] | Putting Metaclasses to Work: A New Dimension in Object-Oriented Programming, by Ira R. Forman and Scott H. Danforth (http://www.amazon.com/gp/product/0201433052) |
| [9] | Partial order, in Wikipedia (http://en.wikipedia.org/wiki/Partial_order) |
| [10] | Total order, in Wikipedia (http://en.wikipedia.org/wiki/Total_order) |
| [11] | Finite set, in Wikipedia (http://en.wikipedia.org/wiki/Finite_set) |
| [12] | Make isinstance/issubclass overloadable (http://python.org/sf/1708353) |
| [13] | ABCMeta sample implementation (http://svn.python.org/view/sandbox/trunk/abc/xyz.py) |
| [14] | python-dev email ("Comparing heterogeneous types") http://mail.python.org/pipermail/python-dev/2004-June/045111.html |
| [15] | Function frozenset_hash() in Object/setobject.c (http://svn.python.org/view/python/trunk/Objects/setobject.c) |
| [16] | Multiple interpreters in mod_python (http://www.modpython.org/live/current/doc-html/pyapi-interps.html) |
Copyright
This document has been placed in the public domain.
pep-3120 Using UTF-8 as the default source encoding
| PEP: | 3120 |
|---|---|
| Title: | Using UTF-8 as the default source encoding |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 15-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Specification
This PEP proposes to change the default source encoding from ASCII to UTF-8. Support for alternative source encodings [1] continues to exist; an explicit encoding declaration takes precedence over the default.
A Bit of History
In Python 1, the source encoding was unspecified, except that the source encoding had to be a superset of the system's basic execution character set (i.e. an ASCII superset, on most systems). The source encoding was only relevant for the lexis itself (bytes representing letters for keywords, identifiers, punctuation, line breaks, etc). The contents of a string literal was copied literally from the file on source.
In Python 2.0, the source encoding changed to Latin-1 as a side effect of introducing Unicode. For Unicode string literals, the characters were still copied literally from the source file, but widened on a character-by-character basis. As Unicode gives a fixed interpretation to code points, this algorithm effectively fixed a source encoding, at least for files containing non-ASCII characters in Unicode literals.
PEP 263 identified the problem that you can use only those Unicode characters in a Unicode literal which are also in Latin-1, and introduced a syntax for declaring the source encoding. If no source encoding was given, the default should be ASCII. For compatibility with Python 2.0 and 2.1, files were interpreted as Latin-1 for a transitional period. This transition ended with Python 2.5, which gives an error if non-ASCII characters are encountered and no source encoding is declared.
Rationale
With PEP 263, using arbitrary non-ASCII characters in a Python file is possible, but tedious. One has to explicitly add an encoding declaration. Even though some editors (like IDLE and Emacs) support the declarations of PEP 263, many editors still do not (and never will); users have to explicitly adjust the encoding which the editor assumes on a file-by-file basis.
When the default encoding is changed to UTF-8, adding non-ASCII text to Python files becomes easier and more portable: On some systems, editors will automatically choose UTF-8 when saving text (e.g. on Unix systems where the locale uses UTF-8). On other systems, editors will guess the encoding when reading the file, and UTF-8 is easy to guess. Yet other editors support associating a default encoding with a file extension, allowing users to associate .py with UTF-8.
For Python 2, an important reason for using non-UTF-8 encodings was that byte string literals would be in the source encoding at run-time, allowing then to output them to a file or render them to the user as-is. With Python 3, all strings will be Unicode strings, so the original encoding of the source will have no impact at run-time.
Implementation
The parser needs to be changed to accept bytes > 127 if no source encoding is specified; instead of giving an error, it needs to check that the bytes are well-formed UTF-8 (decoding is not necessary, as the parser converts all source code to UTF-8, anyway).
IDLE needs to be changed to use UTF-8 as the default encoding.
Copyright
This document has been placed in the public domain.
pep-3121 Extension Module Initialization and Finalization
| PEP: | 3121 |
|---|---|
| Title: | Extension Module Initialization and Finalization |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Accepted |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
Extension module initialization currently has a few deficiencies. There is no cleanup for modules, the entry point name might give naming conflicts, the entry functions don't follow the usual calling convention, and multiple interpreters are not supported well. This PEP addresses these issues.
Problems
Module Finalization
Currently, extension modules are initialized usually once and then "live" forever. The only exception is when Py_Finalize() is called: then the initialization routine is invoked a second time. This is bad from a resource management point of view: memory and other resources might get allocated each time initialization is called, but there is no way to reclaim them. As a result, there is currently no way to completely release all resources Python has allocated.
Entry point name conflicts
The entry point is currently called init<module>. This might conflict with other symbols also called init<something>. In particular, initsocket is known to have conflicted in the past (this specific problem got resolved as a side effect of renaming the module to _socket).
Entry point signature
The entry point is currently a procedure (returning void). This deviates from the usual calling conventions; callers can find out whether there was an error during initialization only by checking PyErr_Occurred. The entry point should return a PyObject*, which will be the module created, or NULL in case of an exception.
Multiple Interpreters
Currently, extension modules share their state across all interpreters. This allows for undesirable information leakage across interpreters: one script could permanently corrupt objects in an extension module, possibly breaking all scripts in other interpreters.
Specification
The module initialization routines change their signature to:
PyObject *PyInit_<modulename>()
The initialization routine will be invoked once per interpreter, when the module is imported. It should return a new module object each time.
In order to store per-module state in C variables, each module object will contain a block of memory that is interpreted only by the module. The amount of memory used for the module is specified at the point of creation of the module.
In addition to the initialization function, a module may implement a number of additional callback functions, which are invoked when the module's tp_traverse, tp_clear, and tp_free functions are invoked, and when the module is reloaded.
The entire module definition is combined in a struct PyModuleDef:
struct PyModuleDef{
PyModuleDef_Base m_base; /* To be filled out by the interpreter */
Py_ssize_t m_size; /* Size of per-module data */
PyMethodDef *m_methods;
inquiry m_reload;
traverseproc m_traverse;
inquiry m_clear;
freefunc m_free;
};
Creation of a module is changed to expect an optional PyModuleDef*. The module state will be null-initialized.
Each module method will be passed the module object as the first parameter. To access the module data, a function:
void* PyModule_GetState(PyObject*);
will be provided. In addition, to lookup a module more efficiently than going through sys.modules, a function:
PyObject* PyState_FindModule(struct PyModuleDef*);
will be provided. This lookup function will use an index located in the m_base field, to find the module by index, not by name.
As all Python objects should be controlled through the Python memory management, usage of "static" type objects is discouraged, unless the type object itself has no memory-managed state. To simplify definition of heap types, a new method:
PyTypeObject* PyType_Copy(PyTypeObject*);
is added.
Example
xxmodule.c would be changed to remove the initxx function, and add the following code instead:
struct xxstate{
PyObject *ErrorObject;
PyObject *Xxo_Type;
};
#define xxstate(o) ((struct xxstate*)PyModule_GetState(o))
static int xx_traverse(PyObject *m, visitproc v,
void *arg)
{
Py_VISIT(xxstate(m)->ErrorObject);
Py_VISIT(xxstate(m)->Xxo_Type);
return 0;
}
static int xx_clear(PyObject *m)
{
Py_CLEAR(xxstate(m)->ErrorObject);
Py_CLEAR(xxstate(m)->Xxo_Type);
return 0;
}
static struct PyModuleDef xxmodule = {
{}, /* m_base */
sizeof(struct xxstate),
&xx_methods,
0, /* m_reload */
xx_traverse,
xx_clear,
0, /* m_free - not needed, since all is done in m_clear */
}
PyObject*
PyInit_xx()
{
PyObject *res = PyModule_New("xx", &xxmodule);
if (!res) return NULL;
xxstate(res)->ErrorObject = PyErr_NewException("xx.error", NULL, NULL);
if (!xxstate(res)->ErrorObject) {
Py_DECREF(res);
return NULL;
}
xxstate(res)->XxoType = PyType_Copy(&Xxo_Type);
if (!xxstate(res)->Xxo_Type) {
Py_DECREF(res);
return NULL;
}
return res;
}
Discussion
Tim Peters reports in [1] that PythonLabs considered such a feature at one point, and lists the following additional hooks which aren't currently supported in this PEP:
- when the module object is deleted from sys.modules
- when Py_Finalize is called
- when Python exits
- when the Python DLL is unloaded (Windows only)
References
| [1] | Tim Peters, reporting earlier conversation about such a feature http://mail.python.org/pipermail/python-3000/2006-April/000726.html |
Copyright
This document has been placed in the public domain.
pep-3122 Delineation of the main module
| PEP: | 3122 |
|---|---|
| Title: | Delineation of the main module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brett Cannon |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Apr-2007 |
| Post-History: |
Contents
Attention!
This PEP has been rejected. Guido views running scripts within a package as an anti-pattern [3].
Abstract
Because of how name resolution works for relative imports in a world where PEP 328 is implemented, the ability to execute modules within a package ceases being possible. This failing stems from the fact that the module being executed as the "main" module replaces its __name__ attribute with "__main__" instead of leaving it as the absolute name of the module. This breaks import's ability to resolve relative imports from the main module into absolute names.
In order to resolve this issue, this PEP proposes to change how the main module is delineated. By leaving the __name__ attribute in a module alone and setting sys.main to the name of the main module this will allow at least some instances of executing a module within a package that uses relative imports.
This PEP does not address the idea of introducing a module-level function that is automatically executed like PEP 299 proposes.
The Problem
With the introduction of PEP 328, relative imports became dependent on the __name__ attribute of the module performing the import. This is because the use of dots in a relative import are used to strip away parts of the calling module's name to calculate where in the package hierarchy an import should fall (prior to PEP 328 relative imports could fail and would fall back on absolute imports which had a chance of succeeding).
For instance, consider the import from .. import spam made from the bacon.ham.beans module (bacon.ham.beans is not a package itself, i.e., does not define __path__). Name resolution of the relative import takes the caller's name (bacon.ham.beans), splits on dots, and then slices off the last n parts based on the level (which is 2). In this example both ham and beans are dropped and spam is joined with what is left (bacon). This leads to the proper import of the module bacon.spam.
This reliance on the __name__ attribute of a module when handling relative imports becomes an issue when executing a script within a package. Because the executing script has its name set to '__main__', import cannot resolve any relative imports, leading to an ImportError.
For example, assume we have a package named bacon with an __init__.py file containing:
from . import spam
Also create a module named spam within the bacon package (it can be an empty file). Now if you try to execute the bacon package (either through python bacon/__init__.py or python -m bacon) you will get an ImportError about trying to do a relative import from within a non-package. Obviously the import is valid, but because of the setting of __name__ to '__main__' import thinks that bacon/__init__.py is not in a package since no dots exist in __name__. To see how the algorithm works in more detail, see importlib.Import._resolve_name() in the sandbox [2].
Currently a work-around is to remove all relative imports in the module being executed and make them absolute. This is unfortunate, though, as one should not be required to use a specific type of resource in order to make a module in a package be able to be executed.
The Solution
The solution to the problem is to not change the value of __name__ in modules. But there still needs to be a way to let executing code know it is being executed as a script. This is handled with a new attribute in the sys module named main.
When a module is being executed as a script, sys.main will be set to the name of the module. This changes the current idiom of:
if __name__ == '__main__':
...
to:
import sys
if __name__ == sys.main:
...
The newly proposed solution does introduce an added line of boilerplate which is a module import. But as the solution does not introduce a new built-in or module attribute (as discussed in Rejected Ideas) it has been deemed worth the extra line.
Another issue with the proposed solution (which also applies to all rejected ideas as well) is that it does not directly solve the problem of discovering the name of a file. Consider python bacon/spam.py. By the file name alone it is not obvious whether bacon is a package. In order to properly find this out both the current direction must exist on sys.path as well as bacon/__init__.py existing.
But this is the simple example. Consider python ../spam.py. From the file name alone it is not at all clear if spam.py is in a package or not. One possible solution is to find out what the absolute name of .., check if a file named __init__.py exists, and then look if the directory is on sys.path. If it is not, then continue to walk up the directory until no more __init__.py files are found or the directory is found on sys.path.
This could potentially be an expensive process. If the package depth happens to be deep then it could require a large amount of disk access to discover where the package is anchored on sys.path, if at all. The stat calls alone can be expensive if the file system the executed script is on is something like NFS.
Because of these issues, only when the -m command-line argument (introduced by PEP 338) is used will __name__ be set. Otherwise the fallback semantics of setting __name__ to "__main__" will occur. sys.main will still be set to the proper value, regardless of what __name__ is set to.
Implementation
When the -m option is used, sys.main will be set to the argument passed in. sys.argv will be adjusted as it is currently. Then the equivalent of __import__(self.main) will occur. This differs from current semantics as the runpy module fetches the code object for the file specified by the module name in order to explicitly set __name__ and other attributes. This is no longer needed as import can perform its normal operation in this situation.
If a file name is specified, then sys.main will be set to "__main__". The specified file will then be read and have a code object created and then be executed with __name__ set to "__main__". This mirrors current semantics.
Transition Plan
In order for Python 2.6 to be able to support both the current semantics and the proposed semantics, sys.main will always be set to "__main__". Otherwise no change will occur for Python 2.6. This unfortunately means that no benefit from this change will occur in Python 2.6, but it maximizes compatibility for code that is to work as much as possible with 2.6 and 3.0.
To help transition to the new idiom, 2to3 [1] will gain a rule to transform the current if __name__ == '__main__': ... idiom to the new one. This will not help with code that checks __name__ outside of the idiom, though.
Rejected Ideas
__main__ built-in
A counter-proposal to introduce a built-in named __main__. The value of the built-in would be the name of the module being executed (just like the proposed sys.main). This would lead to a new idiom of:
if __name__ == __main__:
...
A drawback is that the syntactic difference is subtle; the dropping of quotes around "__main__". Some believe that for existing Python programmers bugs will be introduced where the quotation marks will be put on by accident. But one could argue that the bug would be discovered quickly through testing as it is a very shallow bug.
While the name of built-in could obviously be different (e.g., main) the other drawback is that it introduces a new built-in. With a simple solution such as sys.main being possible without adding another built-in to Python, this proposal was rejected.
__main__ module attribute
Another proposal was to add a __main__ attribute to every module. For the one that was executing as the main module, the attribute would have a true value while all other modules had a false value. This has a nice consequence of simplify the main module idiom to:
if __main__:
...
The drawback was the introduction of a new module attribute. It also required more integration with the import machinery than the proposed solution.
Use __file__ instead of __name__
Any of the proposals could be changed to use the __file__ attribute on modules instead of __name__, including the current semantics. The problem with this is that with the proposed solutions there is the issue of modules having no __file__ attribute defined or having the same value as other modules.
The problem that comes up with the current semantics is you still have to try to resolve the file path to a module name for the import to work.
Special string subclass for __name__ that overrides __eq__
One proposal was to define a subclass of str that overrode the __eq__ method so that it would compare equal to "__main__" as well as the actual name of the module. In all other respects the subclass would be the same as str.
This was rejected as it seemed like too much of a hack.
References
| [1] | 2to3 tool (http://svn.python.org/view/sandbox/trunk/2to3/) [ViewVC] |
| [2] | importlib (http://svn.python.org/view/sandbox/trunk/import_in_py/importlib.py?view=markup) [ViewVC] |
| [3] | Python-Dev email: "PEP to change how the main module is delineated" (http://mail.python.org/pipermail/python-3000/2007-April/006793.html) |
Copyright
This document has been placed in the public domain.
pep-3123 Making PyObject_HEAD conform to standard C
| PEP: | 3123 |
|---|---|
| Title: | Making PyObject_HEAD conform to standard C |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 27-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Abstract
Python currently relies on undefined C behavior, with its usage of PyObject_HEAD. This PEP proposes to change that into standard C.
Rationale
Standard C defines that an object must be accessed only through a pointer of its type, and that all other accesses are undefined behavior, with a few exceptions. In particular, the following code has undefined behavior:
struct FooObject{
PyObject_HEAD
int data;
};
PyObject *foo(struct FooObject*f){
return (PyObject*)f;
}
int bar(){
struct FooObject *f = malloc(sizeof(struct FooObject));
struct PyObject *o = foo(f);
f->ob_refcnt = 0;
o->ob_refcnt = 1;
return f->ob_refcnt;
}
The problem here is that the storage is both accessed as if it where struct PyObject, and as struct FooObject.
Historically, compilers did not have any problems with this code. However, modern compilers use that clause as an optimization opportunity, finding that f->ob_refcnt and o->ob_refcnt cannot possibly refer to the same memory, and that therefore the function should return 0, without having to fetch the value of ob_refcnt at all in the return statement. For GCC, Python now uses -fno-strict-aliasing to work around that problem; with other compilers, it may just see undefined behavior. Even with GCC, using -fno-strict-aliasing may pessimize the generated code unnecessarily.
Specification
Standard C has one specific exception to its aliasing rules precisely designed to support the case of Python: a value of a struct type may also be accessed through a pointer to the first field. E.g. if a struct starts with an int, the struct * may also be cast to an int *, allowing to write int values into the first field.
For Python, PyObject_HEAD and PyObject_VAR_HEAD will be changed to not list all fields anymore, but list a single field of type PyObject/PyVarObject:
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size;
} PyVarObject;
#define PyObject_HEAD PyObject ob_base;
#define PyObject_VAR_HEAD PyVarObject ob_base;
Types defined as fixed-size structure will then include PyObject as its first field, PyVarObject for variable-sized objects. E.g.:
typedef struct {
PyObject ob_base;
PyObject *start, *stop, *step;
} PySliceObject;
typedef struct {
PyVarObject ob_base;
PyObject **ob_item;
Py_ssize_t allocated;
} PyListObject;
The above definitions of PyObject_HEAD are normative, so extension authors MAY either use the macro, or put the ob_base field explicitly into their structs.
As a convention, the base field SHOULD be called ob_base. However, all accesses to ob_refcnt and ob_type MUST cast the object pointer to PyObject* (unless the pointer is already known to have that type), and SHOULD use the respective accessor macros. To simplify access to ob_type, ob_refcnt, and ob_size, macros:
#define Py_TYPE(o) (((PyObject*)(o))->ob_type) #define Py_REFCNT(o) (((PyObject*)(o))->ob_refcnt) #define Py_SIZE(o) (((PyVarObject*)(o))->ob_size)
are added. E.g. the code blocks
#define PyList_CheckExact(op) ((op)->ob_type == &PyList_Type) return func->ob_type->tp_name;
needs to be changed to:
#define PyList_CheckExact(op) (Py_TYPE(op) == &PyList_Type) return Py_TYPE(func)->tp_name;
For initialization of type objects, the current sequence
PyObject_HEAD_INIT(NULL) 0, /* ob_size */
becomes incorrect, and must be replaced with
PyVarObject_HEAD_INIT(NULL, 0)
Compatibility with Python 2.6
To support modules that compile with both Python 2.6 and Python 3.0, the Py_* macros are added to Python 2.6. The macros Py_INCREF and Py_DECREF will be changed to cast their argument to PyObject *, so that module authors can also explicitly declare the ob_base field in modules designed for Python 2.6.
Copyright
This document has been placed in the public domain.
pep-3124 Overloading, Generic Functions, Interfaces, and Adaptation
| PEP: | 3124 |
|---|---|
| Title: | Overloading, Generic Functions, Interfaces, and Adaptation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Phillip J. Eby <pje at telecommunity.com> |
| Discussions-To: | Python 3000 List <python-3000 at python.org> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 3107 3115 3119 |
| Created: | 28-Apr-2007 |
| Post-History: | 30-Apr-2007 |
| Replaces: | 245 246 |
Contents
Abstract
This PEP proposes a new standard library module, overloading, to provide generic programming features including dynamic overloading (aka generic functions), interfaces, adaptation, method combining (ala CLOS and AspectJ), and simple forms of aspect-oriented programming (AOP).
The proposed API is also open to extension; that is, it will be possible for library developers to implement their own specialized interface types, generic function dispatchers, method combination algorithms, etc., and those extensions will be treated as first-class citizens by the proposed API.
The API will be implemented in pure Python with no C, but may have some dependency on CPython-specific features such as sys._getframe and the func_code attribute of functions. It is expected that e.g. Jython and IronPython will have other ways of implementing similar functionality (perhaps using Java or C#).
Rationale and Goals
Python has always provided a variety of built-in and standard-library generic functions, such as len(), iter(), pprint.pprint(), and most of the functions in the operator module. However, it currently:
- does not have a simple or straightforward way for developers to create new generic functions,
- does not have a standard way for methods to be added to existing generic functions (i.e., some are added using registration functions, others require defining __special__ methods, possibly by monkeypatching), and
- does not allow dispatching on multiple argument types (except in a limited form for arithmetic operators, where "right-hand" (__r*__) methods can be used to do two-argument dispatch.
In addition, it is currently a common anti-pattern for Python code to inspect the types of received arguments, in order to decide what to do with the objects. For example, code may wish to accept either an object of some type, or a sequence of objects of that type.
Currently, the "obvious way" to do this is by type inspection, but this is brittle and closed to extension. A developer using an already-written library may be unable to change how their objects are treated by such code, especially if the objects they are using were created by a third party.
Therefore, this PEP proposes a standard library module to address these, and related issues, using decorators and argument annotations (PEP 3107). The primary features to be provided are:
- a dynamic overloading facility, similar to the static overloading found in languages such as Java and C++, but including optional method combination features as found in CLOS and AspectJ.
- a simple "interfaces and adaptation" library inspired by Haskell's typeclasses (but more dynamic, and without any static type-checking), with an extension API to allow registering user-defined interface types such as those found in PyProtocols and Zope.
- a simple "aspect" implementation to make it easy to create stateful adapters and to do other stateful AOP.
These features are to be provided in such a way that extended implementations can be created and used. For example, it should be possible for libraries to define new dispatching criteria for generic functions, and new kinds of interfaces, and use them in place of the predefined features. For example, it should be possible to use a zope.interface interface object to specify the desired type of a function argument, as long as the zope.interface package registered itself correctly (or a third party did the registration).
In this way, the proposed API simply offers a uniform way of accessing the functionality within its scope, rather than prescribing a single implementation to be used for all libraries, frameworks, and applications.
User API
The overloading API will be implemented as a single module, named overloading, providing the following features:
Overloading/Generic Functions
The @overload decorator allows you to define alternate implementations of a function, specialized by argument type(s). A function with the same name must already exist in the local namespace. The existing function is modified in-place by the decorator to add the new implementation, and the modified function is returned by the decorator. Thus, the following code:
from overloading import overload
from collections import Iterable
def flatten(ob):
"""Flatten an object to its component iterables"""
yield ob
@overload
def flatten(ob: Iterable):
for o in ob:
for ob in flatten(o):
yield ob
@overload
def flatten(ob: basestring):
yield ob
creates a single flatten() function whose implementation roughly equates to:
def flatten(ob):
if isinstance(ob, basestring) or not isinstance(ob, Iterable):
yield ob
else:
for o in ob:
for ob in flatten(o):
yield ob
except that the flatten() function defined by overloading remains open to extension by adding more overloads, while the hardcoded version cannot be extended.
For example, if someone wants to use flatten() with a string-like type that doesn't subclass basestring, they would be out of luck with the second implementation. With the overloaded implementation, however, they can either write this:
@overload
def flatten(ob: MyString):
yield ob
or this (to avoid copying the implementation):
from overloading import RuleSet RuleSet(flatten).copy_rules((basestring,), (MyString,))
(Note also that, although PEP 3119 proposes that it should be possible for abstract base classes like Iterable to allow classes like MyString to claim subclass-hood, such a claim is global, throughout the application. In contrast, adding a specific overload or copying a rule is specific to an individual function, and therefore less likely to have undesired side effects.)
@overload vs. @when
The @overload decorator is a common-case shorthand for the more general @when decorator. It allows you to leave out the name of the function you are overloading, at the expense of requiring the target function to be in the local namespace. It also doesn't support adding additional criteria besides the ones specified via argument annotations. The following function definitions have identical effects, except for name binding side-effects (which will be described below):
from overloading import when
@overload
def flatten(ob: basestring):
yield ob
@when(flatten)
def flatten(ob: basestring):
yield ob
@when(flatten)
def flatten_basestring(ob: basestring):
yield ob
@when(flatten, (basestring,))
def flatten_basestring(ob):
yield ob
The first definition above will bind flatten to whatever it was previously bound to. The second will do the same, if it was already bound to the when decorator's first argument. If flatten is unbound or bound to something else, it will be rebound to the function definition as given. The last two definitions above will always bind flatten_basestring to the function definition as given.
Using this approach allows you to both give a method a descriptive name (often useful in tracebacks!) and to reuse the method later.
Except as otherwise specified, all overloading decorators have the same signature and binding rules as @when. They accept a function and an optional "predicate" object.
The default predicate implementation is a tuple of types with positional matching to the overloaded function's arguments. However, an arbitrary number of other kinds of of predicates can be created and registered using the Extension API, and will then be usable with @when and other decorators created by this module (like @before, @after, and @around).
Method Combination and Overriding
When an overloaded function is invoked, the implementation with the signature that most specifically matches the calling arguments is the one used. If no implementation matches, a NoApplicableMethods error is raised. If more than one implementation matches, but none of the signatures are more specific than the others, an AmbiguousMethods error is raised.
For example, the following pair of implementations are ambiguous, if the foo() function is ever called with two integer arguments, because both signatures would apply, but neither signature is more specific than the other (i.e., neither implies the other):
def foo(bar:int, baz:object):
pass
@overload
def foo(bar:object, baz:int):
pass
In contrast, the following pair of implementations can never be ambiguous, because one signature always implies the other; the int/int signature is more specific than the object/object signature:
def foo(bar:object, baz:object):
pass
@overload
def foo(bar:int, baz:int):
pass
A signature S1 implies another signature S2, if whenever S1 would apply, S2 would also. A signature S1 is "more specific" than another signature S2, if S1 implies S2, but S2 does not imply S1.
Although the examples above have all used concrete or abstract types as argument annotations, there is no requirement that the annotations be such. They can also be "interface" objects (discussed in the Interfaces and Adaptation section), including user-defined interface types. (They can also be other objects whose types are appropriately registered via the Extension API.)
Proceeding to the "Next" Method
If the first parameter of an overloaded function is named __proceed__, it will be passed a callable representing the next most-specific method. For example, this code:
def foo(bar:object, baz:object):
print "got objects!"
@overload
def foo(__proceed__, bar:int, baz:int):
print "got integers!"
return __proceed__(bar, baz)
Will print "got integers!" followed by "got objects!".
If there is no next most-specific method, __proceed__ will be bound to a NoApplicableMethods instance. When called, a new NoApplicableMethods instance will be raised, with the arguments passed to the first instance.
Similarly, if the next most-specific methods have ambiguous precedence with respect to each other, __proceed__ will be bound to an AmbiguousMethods instance, and if called, it will raise a new instance.
Thus, a method can either check if __proceed__ is an error instance, or simply invoke it. The NoApplicableMethods and AmbiguousMethods error classes have a common DispatchError base class, so isinstance(__proceed__, overloading.DispatchError) is sufficient to identify whether __proceed__ can be safely called.
(Implementation note: using a magic argument name like __proceed__ could potentially be replaced by a magic function that would be called to obtain the next method. A magic function, however, would degrade performance and might be more difficult to implement on non-CPython platforms. Method chaining via magic argument names, however, can be efficiently implemented on any Python platform that supports creating bound methods from functions -- one simply recursively binds each function to be chained, using the following function or error as the im_self of the bound method.)
"Before" and "After" Methods
In addition to the simple next-method chaining shown above, it is sometimes useful to have other ways of combining methods. For example, the "observer pattern" can sometimes be implemented by adding extra methods to a function, that execute before or after the normal implementation.
To support these use cases, the overloading module will supply @before, @after, and @around decorators, that roughly correspond to the same types of methods in the Common Lisp Object System (CLOS), or the corresponding "advice" types in AspectJ.
Like @when, all of these decorators must be passed the function to be overloaded, and can optionally accept a predicate as well:
from overloading import before, after
def begin_transaction(db):
print "Beginning the actual transaction"
@before(begin_transaction)
def check_single_access(db: SingletonDB):
if db.inuse:
raise TransactionError("Database already in use")
@after(begin_transaction)
def start_logging(db: LoggableDB):
db.set_log_level(VERBOSE)
@before and @after methods are invoked either before or after the main function body, and are never considered ambiguous. That is, it will not cause any errors to have multiple "before" or "after" methods with identical or overlapping signatures. Ambiguities are resolved using the order in which the methods were added to the target function.
"Before" methods are invoked most-specific method first, with ambiguous methods being executed in the order they were added. All "before" methods are called before any of the function's "primary" methods (i.e. normal @overload methods) are executed.
"After" methods are invoked in the reverse order, after all of the function's "primary" methods are executed. That is, they are executed least-specific methods first, with ambiguous methods being executed in the reverse of the order in which they were added.
The return values of both "before" and "after" methods are ignored, and any uncaught exceptions raised by any methods (primary or other) immediately end the dispatching process. "Before" and "after" methods cannot have __proceed__ arguments, as they are not responsible for calling any other methods. They are simply called as a notification before or after the primary methods.
Thus, "before" and "after" methods can be used to check or establish preconditions (e.g. by raising an error if the conditions aren't met) or to ensure postconditions, without needing to duplicate any existing functionality.
"Around" Methods
The @around decorator declares a method as an "around" method. "Around" methods are much like primary methods, except that the least-specific "around" method has higher precedence than the most-specific "before" method.
Unlike "before" and "after" methods, however, "Around" methods are responsible for calling their __proceed__ argument, in order to continue the invocation process. "Around" methods are usually used to transform input arguments or return values, or to wrap specific cases with special error handling or try/finally conditions, e.g.:
from overloading import around
@around(commit_transaction)
def lock_while_committing(__proceed__, db: SingletonDB):
with db.global_lock:
return __proceed__(db)
They can also be used to replace the normal handling for a specific case, by not invoking the __proceed__ function.
The __proceed__ given to an "around" method will either be the next applicable "around" method, a DispatchError instance, or a synthetic method object that will call all the "before" methods, followed by the primary method chain, followed by all the "after" methods, and return the result from the primary method chain.
Thus, just as with normal methods, __proceed__ can be checked for DispatchError-ness, or simply invoked. The "around" method should return the value returned by __proceed__, unless of course it wishes to modify or replace it with a different return value for the function as a whole.
Custom Combinations
The decorators described above (@overload, @when, @before, @after, and @around) collectively implement what in CLOS is called the "standard method combination" -- the most common patterns used in combining methods.
Sometimes, however, an application or library may have use for a more sophisticated type of method combination. For example, if you would like to have "discount" methods that return a percentage off, to be subtracted from the value returned by the primary method(s), you might write something like this:
from overloading import always_overrides, merge_by_default
from overloading import Around, Before, After, Method, MethodList
class Discount(MethodList):
"""Apply return values as discounts"""
def __call__(self, *args, **kw):
retval = self.tail(*args, **kw)
for sig, body in self.sorted():
retval -= retval * body(*args, **kw)
return retval
# merge discounts by priority
merge_by_default(Discount)
# discounts have precedence over before/after/primary methods
always_overrides(Discount, Before)
always_overrides(Discount, After)
always_overrides(Discount, Method)
# but not over "around" methods
always_overrides(Around, Discount)
# Make a decorator called "discount" that works just like the
# standard decorators...
discount = Discount.make_decorator('discount')
# and now let's use it...
def price(product):
return product.list_price
@discount(price)
def ten_percent_off_shoes(product: Shoe)
return Decimal('0.1')
Similar techniques can be used to implement a wide variety of CLOS-style method qualifiers and combination rules. The process of creating custom method combination objects and their corresponding decorators is described in more detail under the Extension API section.
Note, by the way, that the @discount decorator shown will work correctly with any new predicates defined by other code. For example, if zope.interface were to register its interface types to work correctly as argument annotations, you would be able to specify discounts on the basis of its interface types, not just classes or overloading-defined interface types.
Similarly, if a library like RuleDispatch or PEAK-Rules were to register an appropriate predicate implementation and dispatch engine, one would then be able to use those predicates for discounts as well, e.g.:
from somewhere import Pred # some predicate implementation
@discount(
price,
Pred("isinstance(product,Shoe) and"
" product.material.name=='Blue Suede'")
)
def forty_off_blue_suede_shoes(product):
return Decimal('0.4')
The process of defining custom predicate types and dispatching engines is also described in more detail under the Extension API section.
Overloading Inside Classes
All of the decorators above have a special additional behavior when they are directly invoked within a class body: the first parameter (other than __proceed__, if present) of the decorated function will be treated as though it had an annotation equal to the class in which it was defined.
That is, this code:
class And(object):
# ...
@when(get_conjuncts)
def __conjuncts(self):
return self.conjuncts
produces the same effect as this (apart from the existence of a private method):
class And(object):
# ...
@when(get_conjuncts)
def get_conjuncts_of_and(ob: And):
return ob.conjuncts
This behavior is both a convenience enhancement when defining lots of methods, and a requirement for safely distinguishing multi-argument overloads in subclasses. Consider, for example, the following code:
class A(object):
def foo(self, ob):
print "got an object"
@overload
def foo(__proceed__, self, ob:Iterable):
print "it's iterable!"
return __proceed__(self, ob)
class B(A):
foo = A.foo # foo must be defined in local namespace
@overload
def foo(__proceed__, self, ob:Iterable):
print "B got an iterable!"
return __proceed__(self, ob)
Due to the implicit class rule, calling B().foo([]) will print "B got an iterable!" followed by "it's iterable!", and finally, "got an object", while A().foo([]) would print only the messages defined in A.
Conversely, without the implicit class rule, the two "Iterable" methods would have the exact same applicability conditions, so calling either A().foo([]) or B().foo([]) would result in an AmbiguousMethods error.
It is currently an open issue to determine the best way to implement this rule in Python 3.0. Under Python 2.x, a class' metaclass was not chosen until the end of the class body, which means that decorators could insert a custom metaclass to do processing of this sort. (This is how RuleDispatch, for example, implements the implicit class rule.)
PEP 3115, however, requires that a class' metaclass be determined before the class body has executed, making it impossible to use this technique for class decoration any more.
At this writing, discussion on this issue is ongoing.
Interfaces and Adaptation
The overloading module provides a simple implementation of interfaces and adaptation. The following example defines an IStack interface, and declares that list objects support it:
from overloading import abstract, Interface
class IStack(Interface):
@abstract
def push(self, ob)
"""Push 'ob' onto the stack"""
@abstract
def pop(self):
"""Pop a value and return it"""
when(IStack.push, (list, object))(list.append)
when(IStack.pop, (list,))(list.pop)
mylist = []
mystack = IStack(mylist)
mystack.push(42)
assert mystack.pop()==42
The Interface class is a kind of "universal adapter". It accepts a single argument: an object to adapt. It then binds all its methods to the target object, in place of itself. Thus, calling mystack.push(42) is the same as calling IStack.push(mylist, 42).
The @abstract decorator marks a function as being abstract: i.e., having no implementation. If an @abstract function is called, it raises NoApplicableMethods. To become executable, overloaded methods must be added using the techniques previously described. (That is, methods can be added using @when, @before, @after, @around, or any custom method combination decorators.)
In the example above, the list.append method is added as a method for IStack.push() when its arguments are a list and an arbitrary object. Thus, IStack.push(mylist, 42) is translated to list.append(mylist, 42), thereby implementing the desired operation.
Abstract and Concrete Methods
Note, by the way, that the @abstract decorator is not limited to use in interface definitions; it can be used anywhere that you wish to create an "empty" generic function that initially has no methods. In particular, it need not be used inside a class.
Also note that interface methods need not be abstract; one could, for example, write an interface like this:
class IWriteMapping(Interface):
@abstract
def __setitem__(self, key, value):
"""This has to be implemented"""
def update(self, other:IReadMapping):
for k, v in IReadMapping(other).items():
self[k] = v
As long as __setitem__ is defined for some type, the above interface will provide a usable update() implementation. However, if some specific type (or pair of types) has a more efficient way of handling update() operations, an appropriate overload can still be registered for use in that case.
Subclassing and Re-assembly
Interfaces can be subclassed:
class ISizedStack(IStack):
@abstract
def __len__(self):
"""Return the number of items on the stack"""
# define __len__ support for ISizedStack
when(ISizedStack.__len__, (list,))(list.__len__)
Or assembled by combining functions from existing interfaces:
class Sizable(Interface):
__len__ = ISizedStack.__len__
# list now implements Sizable as well as ISizedStack, without
# making any new declarations!
A class can be considered to "adapt to" an interface at a given point in time, if no method defined in the interface is guaranteed to raise a NoApplicableMethods error if invoked on an instance of that class at that point in time.
In normal usage, however, it is "easier to ask forgiveness than permission". That is, it is easier to simply use an interface on an object by adapting it to the interface (e.g. IStack(mylist)) or invoking interface methods directly (e.g. IStack.push(mylist, 42)), than to try to figure out whether the object is adaptable to (or directly implements) the interface.
Implementing an Interface in a Class
It is possible to declare that a class directly implements an interface, using the declare_implementation() function:
from overloading import declare_implementation
class Stack(object):
def __init__(self):
self.data = []
def push(self, ob):
self.data.append(ob)
def pop(self):
return self.data.pop()
declare_implementation(IStack, Stack)
The declare_implementation() call above is roughly equivalent to the following steps:
when(IStack.push, (Stack,object))(lambda self, ob: self.push(ob)) when(IStack.pop, (Stack,))(lambda self, ob: self.pop())
That is, calling IStack.push() or IStack.pop() on an instance of any subclass of Stack, will simply delegate to the actual push() or pop() methods thereof.
For the sake of efficiency, calling IStack(s) where s is an instance of Stack, may return s rather than an IStack adapter. (Note that calling IStack(x) where x is already an IStack adapter will always return x unchanged; this is an additional optimization allowed in cases where the adaptee is known to directly implement the interface, without adaptation.)
For convenience, it may be useful to declare implementations in the class header, e.g.:
class Stack(metaclass=Implementer, implements=IStack):
...
Instead of calling declare_implementation() after the end of the suite.
Interfaces as Type Specifiers
Interface subclasses can be used as argument annotations to indicate what type of objects are acceptable to an overload, e.g.:
@overload
def traverse(g: IGraph, s: IStack):
g = IGraph(g)
s = IStack(s)
# etc....
Note, however, that the actual arguments are not changed or adapted in any way by the mere use of an interface as a type specifier. You must explicitly cast the objects to the appropriate interface, as shown above.
Note, however, that other patterns of interface use are possible. For example, other interface implementations might not support adaptation, or might require that function arguments already be adapted to the specified interface. So the exact semantics of using an interface as a type specifier are dependent on the interface objects you actually use.
For the interface objects defined by this PEP, however, the semantics are as described above. An interface I1 is considered "more specific" than another interface I2, if the set of descriptors in I1's inheritance hierarchy are a proper superset of the descriptors in I2's inheritance hierarchy.
So, for example, ISizedStack is more specific than both ISizable and ISizedStack, irrespective of the inheritance relationships between these interfaces. It is purely a question of what operations are included within those interfaces -- and the names of the operations are unimportant.
Interfaces (at least the ones provided by overloading) are always considered less-specific than concrete classes. Other interface implementations can decide on their own specificity rules, both between interfaces and other interfaces, and between interfaces and classes.
Non-Method Attributes in Interfaces
The Interface implementation actually treats all attributes and methods (i.e. descriptors) in the same way: their __get__ (and __set__ and __delete__, if present) methods are called with the wrapped (adapted) object as "self". For functions, this has the effect of creating a bound method linking the generic function to the wrapped object.
For non-function attributes, it may be easiest to specify them using the property built-in, and the corresponding fget, fset, and fdel attributes:
class ILength(Interface):
@property
@abstract
def length(self):
"""Read-only length attribute"""
# ILength(aList).length == list.__len__(aList)
when(ILength.length.fget, (list,))(list.__len__)
Alternatively, methods such as _get_foo() and _set_foo() may be defined as part of the interface, and the property defined in terms of those methods, but this is a bit more difficult for users to implement correctly when creating a class that directly implements the interface, as they would then need to match all the individual method names, not just the name of the property or attribute.
Aspects
The adaptation system described above assumes that adapters are "stateless", which is to say that adapters have no attributes or state apart from that of the adapted object. This follows the "typeclass/instance" model of Haskell, and the concept of "pure" (i.e., transitively composable) adapters.
However, there are occasionally cases where, to provide a complete implementation of some interface, some sort of additional state is required.
One possibility of course, would be to attach monkeypatched "private" attributes to the adaptee. But this is subject to name collisions, and complicates the process of initialization (since any code using these attributes has to check for their existence and initialize them if necessary). It also doesn't work on objects that don't have a __dict__ attribute.
So the Aspect class is provided to make it easy to attach extra information to objects that either:
- have a __dict__ attribute (so aspect instances can be stored in it, keyed by aspect class),
- support weak referencing (so aspect instances can be managed using a global but thread-safe weak-reference dictionary), or
- implement or can be adapt to the overloading.IAspectOwner interface (technically, #1 or #2 imply this).
Subclassing Aspect creates an adapter class whose state is tied to the life of the adapted object.
For example, suppose you would like to count all the times a certain method is called on instances of Target (a classic AOP example). You might do something like:
from overloading import Aspect
class Count(Aspect):
count = 0
@after(Target.some_method)
def count_after_call(self:Target, *args, **kw):
Count(self).count += 1
The above code will keep track of the number of times that Target.some_method() is successfully called on an instance of Target (i.e., it will not count errors unless they occur in a more-specific "after" method). Other code can then access the count using Count(someTarget).count.
Aspect instances can of course have __init__ methods, to initialize any data structures. They can use either __slots__ or dictionary-based attributes for storage.
While this facility is rather primitive compared to a full-featured AOP tool like AspectJ, persons who wish to build pointcut libraries or other AspectJ-like features can certainly use Aspect objects and method-combination decorators as a base for building more expressive AOP tools.
- XXX spec out full aspect API, including keys, N-to-1 aspects, manual
- attach/detach/delete of aspect instances, and the IAspectOwner interface.
Extension API
TODO: explain how all of these work
implies(o1, o2)
declare_implementation(iface, class)
predicate_signatures(ob)
parse_rule(ruleset, body, predicate, actiontype, localdict, globaldict)
combine_actions(a1, a2)
rules_for(f)
Rule objects
ActionDef objects
RuleSet objects
Method objects
MethodList objects
IAspectOwner
Overloading Usage Patterns
In discussion on the Python-3000 list, the proposed feature of allowing arbitrary functions to be overloaded has been somewhat controversial, with some people expressing concern that this would make programs more difficult to understand.
The general thrust of this argument is that one cannot rely on what a function does, if it can be changed from anywhere in the program at any time. Even though in principle this can already happen through monkeypatching or code substitution, it is considered poor practice to do so.
However, providing support for overloading any function (or so the argument goes), is implicitly blessing such changes as being an acceptable practice.
This argument appears to make sense in theory, but it is almost entirely mooted in practice for two reasons.
First, people are generally not perverse, defining a function to do one thing in one place, and then summarily defining it to do the opposite somewhere else! The principal reasons to extend the behavior of a function that has not been specifically made generic are to:
- Add special cases not contemplated by the original function's author, such as support for additional types.
- Be notified of an action in order to cause some related operation to be performed, either before the original operation is performed, after it, or both. This can include general-purpose operations like adding logging, timing, or tracing, as well as application-specific behavior.
None of these reasons for adding overloads imply any change to the intended default or overall behavior of the existing function, however. Just as a base class method may be overridden by a subclass for these same two reasons, so too may a function be overloaded to provide for such enhancements.
In other words, universal overloading does not equal arbitrary overloading, in the sense that we need not expect people to randomly redefine the behavior of existing functions in illogical or unpredictable ways. If they did so, it would be no less of a bad practice than any other way of writing illogical or unpredictable code!
However, to distinguish bad practice from good, it is perhaps necessary to clarify further what good practice for defining overloads is. And that brings us to the second reason why generic functions do not necessarily make programs harder to understand: overloading patterns in actual programs tend to follow very predictable patterns. (Both in Python and in languages that have no non-generic functions.)
If a module is defining a new generic operation, it will usually also define any required overloads for existing types in the same place. Likewise, if a module is defining a new type, then it will usually define overloads there for any generic functions that it knows or cares about.
As a result, the vast majority of overloads can be found adjacent to either the function being overloaded, or to a newly-defined type for which the overload is adding support. Thus, overloads are highly- discoverable in the common case, as you are either looking at the function or the type, or both.
It is only in rather infrequent cases that one will have overloads in a module that contains neither the function nor the type(s) for which the overload is added. This would be the case if, say, a third-party created a bridge of support between one library's types and another library's generic function(s). In such a case, however, best practice suggests prominently advertising this, especially by way of the module name.
For example, PyProtocols defines such bridge support for working with Zope interfaces and legacy Twisted interfaces, using modules called protocols.twisted_support and protocols.zope_support. (These bridges are done with interface adapters, rather than generic functions, but the basic principle is the same.)
In short, understanding programs in the presence of universal overloading need not be any more difficult, given that the vast majority of overloads will either be adjacent to a function, or the definition of a type that is passed to that function.
And, in the absence of incompetence or deliberate intention to be obscure, the few overloads that are not adjacent to the relevant type(s) or function(s), will generally not need to be understood or known about outside the scope where those overloads are defined. (Except in the "support modules" case, where best practice suggests naming them accordingly.)
Implementation Notes
Most of the functionality described in this PEP is already implemented in the in-development version of the PEAK-Rules framework. In particular, the basic overloading and method combination framework (minus the @overload decorator) already exists there. The implementation of all of these features in peak.rules.core is 656 lines of Python at this writing.
peak.rules.core currently relies on the DecoratorTools and BytecodeAssembler modules, but both of these dependencies can be replaced, as DecoratorTools is used mainly for Python 2.3 compatibility and to implement structure types (which can be done with named tuples in later versions of Python). The use of BytecodeAssembler can be replaced using an "exec" or "compile" workaround, given a reasonable effort. (It would be easier to do this if the func_closure attribute of function objects was writable.)
The Interface class has been previously prototyped, but is not included in PEAK-Rules at the present time.
The "implicit class rule" has previously been implemented in the RuleDispatch library. However, it relies on the __metaclass__ hook that is currently eliminated in PEP 3115.
I don't currently know how to make @overload play nicely with classmethod and staticmethod in class bodies. It's not really clear if it needs to, however.
Copyright
This document has been placed in the public domain.
pep-3125 Remove Backslash Continuation
| PEP: | 3125 |
|---|---|
| Title: | Remove Backslash Continuation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jim J. Jewett <JimJJewett at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Apr-2007 |
| Post-History: | 29-Apr-2007, 30-Apr-2007, 04-May-2007 |
Contents
Rejection Notice
This PEP is rejected. There wasn't enough support in favor, the feature to be removed isn't all that harmful, and there are some use cases that would become harder.
Abstract
Python initially inherited its parsing from C. While this has been generally useful, there are some remnants which have been less useful for Python, and should be eliminated.
This PEP proposes elimination of terminal \ as a marker for line continuation.
Motivation
One goal for Python 3000 should be to simplify the language by removing unnecessary or duplicated features. There are currently several ways to indicate that a logical line is continued on the following physical line.
The other continuation methods are easily explained as a logical consequence of the semantics they provide; \ is simply an escape character that needs to be memorized.
Existing Line Continuation Methods
Parenthetical Expression - ([{}])
Open a parenthetical expression. It doesn't matter whether people view the "line" as continuing; they do immediately recognize that the expression needs to be closed before the statement can end.
Examples using each of (), [], and {}:
def fn(long_argname1,
long_argname2):
settings = {"background": "random noise",
"volume": "barely audible"}
restrictions = ["Warrantee void if used",
"Notice must be received by yesterday",
"Not responsible for sales pitch"]
Note that it is always possible to parenthesize an expression, but it can seem odd to parenthesize an expression that needs parentheses only for the line break:
assert val>4, (
"val is too small")
Triple-Quoted Strings
Open a triple-quoted string; again, people recognize that the string needs to finish before the next statement starts.
banner_message = """
Satisfaction Guaranteed,
or DOUBLE YOUR MONEY BACK!!!
some minor restrictions apply"""
Terminal \ in the general case
A terminal \ indicates that the logical line is continued on the following physical line (after whitespace). There are no particular semantics associated with this. This form is never required, although it may look better (particularly for people with a C language background) in some cases:
>>> assert val>4, \
"val is too small"
Also note that the \ must be the final character in the line. If your editor navigation can add whitespace to the end of a line, that invisible change will alter the semantics of the program. Fortunately, the typical result is only a syntax error, rather than a runtime bug:
>>> assert val>4, \
"val is too small"
SyntaxError: unexpected character after line continuation character
This PEP proposes to eliminate this redundant and potentially confusing alternative.
Terminal \ within a string
A terminal \ within a single-quoted string, at the end of the line. This is arguably a special case of the terminal \, but it is a special case that may be worth keeping.
>>> "abd\ def" 'abd def'
- Pro: Many of the objections to removing \ termination were really just objections to removing it within literal strings; several people clarified that they want to keep this literal-string usage, but don't mind losing the general case.
- Pro: The use of \ for an escape character within strings is well known.
- Contra: But note that this particular usage is odd, because the escaped character (the newline) is invisible, and the special treatment is to delete the character. That said, the \ of \(newline) is still an escape which changes the meaning of the following character.
Alternate Proposals
Several people have suggested alternative ways of marking the line end. Most of these were rejected for not actually simplifying things.
The one exception was to let any unfinished expression signify a line continuation, possibly in conjunction with increased indentation.
This is attractive because it is a generalization of the rule for parentheses.
The initial objections to this were:
The amount of whitespace may be contentious; expression continuation should not be confused with opening a new suite.
The "expression continuation" markers are not as clearly marked in Python as the grouping punctuation "(), [], {}" marks are:
# Plus needs another operand, so the line continues "abc" + "def" # String ends an expression, so the line does not # not continue. The next line is a syntax error because # unary plus does not apply to strings. "abc" + "def"Guido objected for technical reasons. [1] The most obvious implementation would require allowing INDENT or DEDENT tokens anywhere, or at least in a widely expanded (and ill-defined) set of locations. While this is of concern only for the internal parsing mechanism (rather than for users), it would be a major new source of complexity.
Andrew Koenig then pointed out [2] a better implementation strategy, and said that it had worked quite well in other languages. [3] The improved suggestion boiled down to:
The whitespace that follows an (operator or) open bracket or parenthesis can include newline characters.
It would be implemented at a very low lexical level -- even before the decision is made to turn a newline followed by spaces into an INDENT or DEDENT token.
There is still some concern that it could mask bugs, as in this example [4]:
# Used to be y+1, the 1 got dropped. Syntax Error (today) # would become nonsense. x = y+ f(x)
Requiring that the continuation be indented more than the initial line would add both safety and complexity.
Open Issues
- Should \-continuation be removed even inside strings?
- Should the continuation markers be expanded from just ([{}]) to include lines ending with an operator?
- As a safety measure, should the continuation line be required to be more indented than the initial line?
References
| [1] | (email subject) PEP 30XZ: Simplified Parsing, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/007063.html |
| [2] | (email subject) PEP-3125 -- remove backslash continuation, Koenig http://mail.python.org/pipermail/python-3000/2007-May/007237.html |
| [3] | The Snocone Programming Language, Koenig http://www.snobol4.com/report.htm |
| [4] | (email subject) PEP-3125 -- remove backslash continuation, van Rossum http://mail.python.org/pipermail/python-3000/2007-May/007244.html |
Copyright
This document has been placed in the public domain.
pep-3126 Remove Implicit String Concatenation
| PEP: | 3126 |
|---|---|
| Title: | Remove Implicit String Concatenation |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jim J. Jewett <JimJJewett at gmail.com>, Raymond Hettinger <python at rcn.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-Apr-2007 |
| Post-History: | 29-Apr-2007, 30-Apr-2007, 07-May-2007 |
Contents
Rejection Notice
This PEP is rejected. There wasn't enough support in favor, the feature to be removed isn't all that harmful, and there are some use cases that would become harder.
Abstract
Python inherited many of its parsing rules from C. While this has been generally useful, there are some individual rules which are less useful for python, and should be eliminated.
This PEP proposes to eliminate implicit string concatenation based only on the adjacency of literals.
Instead of:
"abc" "def" == "abcdef"
authors will need to be explicit, and either add the strings:
"abc" + "def" == "abcdef"
or join them:
"".join(["abc", "def"]) == "abcdef"
Motivation
One goal for Python 3000 should be to simplify the language by removing unnecessary features. Implicit string concatenation should be dropped in favor of existing techniques. This will simplify the grammar and simplify a user's mental picture of Python. The latter is important for letting the language "fit in your head". A large group of current users do not even know about implicit concatenation. Of those who do know about it, a large portion never use it or habitually avoid it. Of those who both know about it and use it, very few could state with confidence the implicit operator precedence and under what circumstances it is computed when the definition is compiled versus when it is run.
History or Future
Many Python parsing rules are intentionally compatible with C. This is a useful default, but Special Cases need to be justified based on their utility in Python. We should no longer assume that python programmers will also be familiar with C, so compatibility between languages should be treated as a tie-breaker, rather than a justification.
In C, implicit concatenation is the only way to join strings without using a (run-time) function call to store into a variable. In Python, the strings can be joined (and still recognized as immutable) using more standard Python idioms, such + or "".join.
Problem
Implicit String concatentation leads to tuples and lists which are shorter than they appear; this is turn can lead to confusing, or even silent, errors. For example, given a function which accepts several parameters, but offers a default value for some of them:
def f(fmt, *args):
print fmt % args
This looks like a valid call, but isn't:
>>> f("User %s got a message %s",
"Bob"
"Time for dinner")
Traceback (most recent call last):
File "<pyshell#8>", line 2, in <module>
"Bob"
File "<pyshell#3>", line 2, in f
print fmt % args
TypeError: not enough arguments for format string
Calls to this function can silently do the wrong thing:
def g(arg1, arg2=None):
...
# silently transformed into the possibly very different
# g("arg1 on this linearg2 on this line", None)
g("arg1 on this line"
"arg2 on this line")
To quote Jason Orendorff [#Orendorff]
Oh. I just realized this happens a lot out here. Where I work, we use scons, and each SConscript has a long list of filenames:
sourceFiles = [ 'foo.c' 'bar.c', #...many lines omitted... 'q1000x.c']It's a common mistake to leave off a comma, and then scons complains that it can't find 'foo.cbar.c'. This is pretty bewildering behavior even if you are a Python programmer, and not everyone here is.
Solution
In Python, strings are objects and they support the __add__ operator, so it is possible to write:
"abc" + "def"
Because these are literals, this addition can still be optimized away by the compiler; the CPython compiler already does so. [2]
Other existing alternatives include multiline (triple-quoted) strings, and the join method:
"""This string extends across multiple lines, but you may want to use something like Textwrap.dedent to clear out the leading spaces and/or reformat. """ >>> "".join(["empty", "string", "joiner"]) == "emptystringjoiner" True >>> " ".join(["space", "string", "joiner"]) == "space string joiner" >>> "\n".join(["multiple", "lines"]) == "multiple\nlines" == ( """multiple lines""") True
Concerns
Operator Precedence
Guido indicated [2] that this change should be handled by PEP, because there were a few edge cases with other string operators, such as the %. (Assuming that str % stays -- it may be eliminated in favor of PEP 3101 -- Advanced String Formatting. [3] [4])
The resolution is to use parentheses to enforce precedence -- the same solution that can be used today:
# Clearest, works today, continues to work, optimization is
# already possible.
("abc %s def" + "ghi") % var
# Already works today; precedence makes the optimization more
# difficult to recognize, but does not change the semantics.
"abc" + "def %s ghi" % var
as opposed to:
# Already fails because modulus (%) is higher precedence than
# addition (+)
("abc %s def" + "ghi" % var)
# Works today only because adjacency is higher precedence than
# modulus. This will no longer be available.
"abc %s" "def" % var
# So the 2-to-3 translator can automically replace it with the
# (already valid):
("abc %s" + "def") % var
Long Commands
... build up (what I consider to be) readable SQL queries [5]:
rows = self.executesql("select cities.city, state, country" " from cities, venues, events, addresses" " where cities.city like %s" " and events.active = 1" " and venues.address = addresses.id" " and addresses.city = cities.id" " and events.venue = venues.id", (city,))
Alternatives again include triple-quoted strings, +, and .join:
query="""select cities.city, state, country
from cities, venues, events, addresses
where cities.city like %s
and events.active = 1"
and venues.address = addresses.id
and addresses.city = cities.id
and events.venue = venues.id"""
query=( "select cities.city, state, country"
+ " from cities, venues, events, addresses"
+ " where cities.city like %s"
+ " and events.active = 1"
+ " and venues.address = addresses.id"
+ " and addresses.city = cities.id"
+ " and events.venue = venues.id"
)
query="\n".join(["select cities.city, state, country",
" from cities, venues, events, addresses",
" where cities.city like %s",
" and events.active = 1",
" and venues.address = addresses.id",
" and addresses.city = cities.id",
" and events.venue = venues.id"])
# And yes, you *could* inline any of the above querystrings
# the same way the original was inlined.
rows = self.executesql(query, (city,))
Regular Expressions
Complex regular expressions are sometimes stated in terms of several implicitly concatenated strings with each regex component on a different line and followed by a comment. The plus operator can be inserted here but it does make the regex harder to read. One alternative is to use the re.VERBOSE option. Another alternative is to build-up the regex with a series of += lines:
# Existing idiom which relies on implicit concatenation
r = ('a{20}' # Twenty A's
'b{5}' # Followed by Five B's
)
# Mechanical replacement
r = ('a{20}' +# Twenty A's
'b{5}' # Followed by Five B's
)
# already works today
r = '''a{20} # Twenty A's
b{5} # Followed by Five B's
''' # Compiled with the re.VERBOSE flag
# already works today
r = 'a{20}' # Twenty A's
r += 'b{5}' # Followed by Five B's
Internationalization
Some internationalization tools -- notably xgettext -- have already been special-cased for implicit concatenation, but not for Python's explicit concatenation. [6]
These tools will fail to extract the (already legal):
_("some string" +
" and more of it")
but often have a special case for:
_("some string"
" and more of it")
It should also be possible to just use an overly long line (xgettext limits messages to 2048 characters [8], which is less than Python's enforced limit) or triple-quoted strings, but these solutions sacrifice some readability in the code:
# Lines over a certain length are unpleasant.
_("some string and more of it")
# Changing whitespace is not ideal.
_("""Some string
and more of it""")
_("""Some string
and more of it""")
_("Some string \
and more of it")
I do not see a good short-term resolution for this.
Transition
The proposed new constructs are already legal in current Python, and can be used immediately.
The 2 to 3 translator can be made to mechanically change:
"str1" "str2"
("line1" #comment
"line2")
into:
("str1" + "str2")
("line1" +#comments
"line2")
If users want to use one of the other idioms, they can; as these idioms are all already legal in python 2, the edits can be made to the original source, rather than patching up the translator.
Open Issues
Is there a better way to support external text extraction tools, or at least xgettext [7] in particular?
References
| [1] | Implicit String Concatenation, Orendorff http://mail.python.org/pipermail/python-ideas/2007-April/000397.html |
| [2] | (1, 2) Reminder: Py3k PEPs due by April, Hettinger, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/006563.html |
| [3] | PEP 3101, Advanced String Formatting, Talin http://www.python.org/dev/peps/pep-3101/ |
| [4] | ps to question Re: Need help completing ABC pep, van Rossum http://mail.python.org/pipermail/python-3000/2007-April/006737.html |
| [5] | (email Subject) PEP 30XZ: Simplified Parsing, Skip, http://mail.python.org/pipermail/python-3000/2007-May/007261.html |
| [6] | (email Subject) PEP 30XZ: Simplified Parsing http://mail.python.org/pipermail/python-3000/2007-May/007305.html |
| [7] | GNU gettext manual http://www.gnu.org/software/gettext/ |
| [8] | Unix man page for xgettext -- Notes section http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+xgettext |
Copyright
This document has been placed in the public domain.
pep-3127 Integer Literal Support and Syntax
| PEP: | 3127 |
|---|---|
| Title: | Integer Literal Support and Syntax |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Patrick Maupin <pmaupin at gmail.com> |
| Discussions-To: | Python-3000 at python.org |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 14-Mar-2007 |
| Python-Version: | 3.0 |
| Post-History: | 18-Mar-2007 |
Abstract
This PEP proposes changes to the Python core to rationalize the treatment of string literal representations of integers in different radices (bases). These changes are targeted at Python 3.0, but the backward-compatible parts of the changes should be added to Python 2.6, so that all valid 3.0 integer literals will also be valid in 2.6.
The proposal is that:
- octal literals must now be specified with a leading "0o" or "0O" instead of "0";
- binary literals are now supported via a leading "0b" or "0B"; and
- provision will be made for binary numbers in string formatting.
Motivation
This PEP was motivated by two different issues:
- The default octal representation of integers is silently confusing to people unfamiliar with C-like languages. It is extremely easy to inadvertently create an integer object with the wrong value, because '013' means 'decimal 11', not 'decimal 13', to the Python language itself, which is not the meaning that most humans would assign to this literal.
- Some Python users have a strong desire for binary support in the language.
Specification
Grammar specification
The grammar will be changed. For Python 2.6, the changed and new token definitions will be:
integer ::= decimalinteger | octinteger | hexinteger |
bininteger | oldoctinteger
octinteger ::= "0" ("o" | "O") octdigit+
bininteger ::= "0" ("b" | "B") bindigit+
oldoctinteger ::= "0" octdigit+
bindigit ::= "0" | "1"
For Python 3.0, "oldoctinteger" will not be supported, and an exception will be raised if a literal has a leading "0" and a second character which is a digit.
For both versions, this will require changes to PyLong_FromString as well as the grammar.
The documentation will have to be changed as well: grammar.txt, as well as the integer literal section of the reference manual.
PEP 306 should be checked for other issues, and that PEP should be updated if the procedure described therein is insufficient.
int() specification
int(s, 0) will also match the new grammar definition.
This should happen automatically with the changes to PyLong_FromString required for the grammar change.
Also the documentation for int() should be changed to explain that int(s) operates identically to int(s, 10), and the word "guess" should be removed from the description of int(s, 0).
long() specification
For Python 2.6, the long() implementation and documentation should be changed to reflect the new grammar.
Tokenizer exception handling
If an invalid token contains a leading "0", the exception error message should be more informative than the current "SyntaxError: invalid token". It should explain that decimal numbers may not have a leading zero, and that octal numbers require an "o" after the leading zero.
int() exception handling
The ValueError raised for any call to int() with a string should at least explicitly contain the base in the error message, e.g.:
ValueError: invalid literal for base 8 int(): 09
oct() function
oct() should be updated to output '0o' in front of the octal digits (for 3.0, and 2.6 compatibility mode).
Output formatting
In 3.0, the string % operator alternate syntax for the 'o' option will need to be updated to add '0o' in front, instead of '0'. In 2.6, alternate octal formatting will continue to add only '0'. In neither 2.6 nor 3.0 will the % operator support binary output. This is because binary output is already supported by PEP 3101 (str.format), which is the prefered string formatting method.
Transition from 2.6 to 3.0
The 2to3 translator will have to insert 'o' into any octal string literal.
The Py3K compatible option to Python 2.6 should cause attempts to use oldoctinteger literals to raise an exception.
Rationale
Most of the discussion on these issues occurred on the Python-3000 mailing list starting 14-Mar-2007, prompted by an observation that the average human being would be completely mystified upon finding that prepending a "0" to a string of digits changes the meaning of that digit string entirely.
It was pointed out during this discussion that a similar, but shorter, discussion on the subject occurred in January of 2006, prompted by a discovery of the same issue.
Background
For historical reasons, Python's string representation of integers in different bases (radices), for string formatting and token literals, borrows heavily from C. [1] [2] Usage has shown that the historical method of specifying an octal number is confusing, and also that it would be nice to have additional support for binary literals.
Throughout this document, unless otherwise noted, discussions about the string representation of integers relate to these features:
Literal integer tokens, as used by normal module compilation, by eval(), and by int(token, 0). (int(token) and int(token, 2-36) are not modified by this proposal.)
- Under 2.6, long() is treated the same as int()
Formatting of integers into strings, either via the % string operator or the new PEP 3101 advanced string formatting method.
It is presumed that:
- All of these features should have an identical set of supported radices, for consistency.
- Python source code syntax and int(mystring, 0) should continue to share identical behavior.
Removal of old octal syntax
This PEP proposes that the ability to specify an octal number by using a leading zero will be removed from the language in Python 3.0 (and the Python 3.0 preview mode of 2.6), and that a SyntaxError will be raised whenever a leading "0" is immediately followed by another digit.
During the present discussion, it was almost universally agreed that:
eval('010') == 8
should no longer be true, because that is confusing to new users. It was also proposed that:
eval('0010') == 10
should become true, but that is much more contentious, because it is so inconsistent with usage in other computer languages that mistakes are likely to be made.
Almost all currently popular computer languages, including C/C++, Java, Perl, and JavaScript, treat a sequence of digits with a leading zero as an octal number. Proponents of treating these numbers as decimal instead have a very valid point -- as discussed in Supported radices, below, the entire non-computer world uses decimal numbers almost exclusively. There is ample anecdotal evidence that many people are dismayed and confused if they are confronted with non-decimal radices.
However, in most situations, most people do not write gratuitous zeros in front of their decimal numbers. The primary exception is when an attempt is being made to line up columns of numbers. But since PEP 8 specifically discourages the use of spaces to try to align Python code, one would suspect the same argument should apply to the use of leading zeros for the same purpose.
Finally, although the email discussion often focused on whether anybody actually uses octal any more, and whether we should cater to those old-timers in any case, that is almost entirely besides the point.
Assume the rare complete newcomer to computing who does, either occasionally or as a matter of habit, use leading zeros for decimal numbers. Python could either:
- silently do the wrong thing with his numbers, as it does now;
- immediately disabuse him of the notion that this is viable syntax (and yes, the SyntaxWarning should be more gentle than it currently is, but that is a subject for a different PEP); or
- let him continue to think that computers are happy with multi-digit decimal integers which start with "0".
Some people passionately believe that (c) is the correct answer, and they would be absolutely right if we could be sure that new users will never blossom and grow and start writing AJAX applications.
So while a new Python user may (currently) be mystified at the delayed discovery that his numbers don't work properly, we can fix it by explaining to him immediately that Python doesn't like leading zeros (hopefully with a reasonable message!), or we can delegate this teaching experience to the JavaScript interpreter in the Internet Explorer browser, and let him try to debug his issue there.
Supported radices
This PEP proposes that the supported radices for the Python language will be 2, 8, 10, and 16.
Once it is agreed that the old syntax for octal (radix 8) representation of integers must be removed from the language, the next obvious question is "Do we actually need a way to specify (and display) numbers in octal?"
This question is quickly followed by "What radices does the language need to support?" Because computers are so adept at doing what you tell them to, a tempting answer in the discussion was "all of them." This answer has obviously been given before -- the int() constructor will accept an explicit radix with a value between 2 and 36, inclusive, with the latter number bearing a suspicious arithmetic similarity to the sum of the number of numeric digits and the number of same-case letters in the ASCII alphabet.
But the best argument for inclusion will have a use-case to back it up, so the idea of supporting all radices was quickly rejected, and the only radices left with any real support were decimal, hexadecimal, octal, and binary.
Just because a particular radix has a vocal supporter on the mailing list does not mean that it really should be in the language, so the rest of this section is a treatise on the utility of these particular radices, vs. other possible choices.
Humans use other numeric bases constantly. If I tell you that it is 12:30 PM, I have communicated quantitative information arguably composed of three separate bases (12, 60, and 2), only one of which is in the "agreed" list above. But the communication of that information used two decimal digits each for the base 12 and base 60 information, and, perversely, two letters for information which could have fit in a single decimal digit.
So, in general, humans communicate "normal" (non-computer) numerical information either via names (AM, PM, January, ...) or via use of decimal notation. Obviously, names are seldom used for large sets of items, so decimal is used for everything else. There are studies which attempt to explain why this is so, typically reaching the expected conclusion that the Arabic numeral system is well-suited to human cognition. [3]
There is even support in the history of the design of computers to indicate that decimal notation is the correct way for computers to communicate with humans. One of the first modern computers, ENIAC [4] computed in decimal, even though there were already existing computers which operated in binary.
Decimal computer operation was important enough that many computers, including the ubiquitous PC, have instructions designed to operate on "binary coded decimal" (BCD) [5], a representation which devotes 4 bits to each decimal digit. These instructions date from a time when the most strenuous calculations ever performed on many numbers were the calculations actually required to perform textual I/O with them. It is possible to display BCD without having to perform a divide/remainder operation on every displayed digit, and this was a huge computational win when most hardware didn't have fast divide capability. Another factor contributing to the use of BCD is that, with BCD calculations, rounding will happen exactly the same way that a human would do it, so BCD is still sometimes used in fields like finance, despite the computational and storage superiority of binary.
So, if it weren't for the fact that computers themselves normally use binary for efficient computation and data storage, string representations of integers would probably always be in decimal.
Unfortunately, computer hardware doesn't think like humans, so programmers and hardware engineers must often resort to thinking like the computer, which means that it is important for Python to have the ability to communicate binary data in a form that is understandable to humans.
The requirement that the binary data notation must be cognitively easy for humans to process means that it should contain an integral number of binary digits (bits) per symbol, while otherwise conforming quite closely to the standard tried-and-true decimal notation (position indicates power, larger magnitude on the left, not too many symbols in the alphabet, etc.).
The obvious "sweet spot" for this binary data notation is thus octal, which packs the largest integral number of bits possible into a single symbol chosen from the Arabic numeral alphabet.
In fact, some computer architectures, such as the PDP8 and the 8080/Z80, were defined in terms of octal, in the sense of arranging the bitfields of instructions in groups of three, and using octal representations to describe the instruction set.
Even today, octal is important because of bit-packed structures which consist of 3 bits per field, such as Unix file permission masks.
But octal has a drawback when used for larger numbers. The number of bits per symbol, while integral, is not itself a power of two. This limitation (given that the word size of most computers these days is a power of two) has resulted in hexadecimal, which is more popular than octal despite the fact that it requires a 60% larger alphabet than decimal, because each symbol contains 4 bits.
Some numbers, such as Unix file permission masks, are easily decoded by humans when represented in octal, but difficult to decode in hexadecimal, while other numbers are much easier for humans to handle in hexadecimal.
Unfortunately, there are also binary numbers used in computers which are not very well communicated in either hexadecimal or octal. Thankfully, fewer people have to deal with these on a regular basis, but on the other hand, this means that several people on the discussion list questioned the wisdom of adding a straight binary representation to Python.
One example of where these numbers is very useful is in reading and writing hardware registers. Sometimes hardware designers will eschew human readability and opt for address space efficiency, by packing multiple bit fields into a single hardware register at unaligned bit locations, and it is tedious and error-prone for a human to reconstruct a 5 bit field which consists of the upper 3 bits of one hex digit, and the lower 2 bits of the next hex digit.
Even if the ability of Python to communicate binary information to humans is only useful for a small technical subset of the population, it is exactly that population subset which contains most, if not all, members of the Python core team, so even straight binary, the least useful of these notations, has several enthusiastic supporters and few, if any, staunch opponents, among the Python community.
Syntax for supported radices
This proposal is to to use a "0o" prefix with either uppercase or lowercase "o" for octal, and a "0b" prefix with either uppercase or lowercase "b" for binary.
There was strong support for not supporting uppercase, but this is a separate subject for a different PEP, as 'j' for complex numbers, 'e' for exponent, and 'r' for raw string (to name a few) already support uppercase.
The syntax for delimiting the different radices received a lot of attention in the discussion on Python-3000. There are several (sometimes conflicting) requirements and "nice-to-haves" for this syntax:
- It should be as compatible with other languages and previous versions of Python as is reasonable, both for the input syntax and for the output (e.g. string % operator) syntax.
- It should be as obvious to the casual observer as possible.
- It should be easy to visually distinguish integers formatted in the different bases.
Proposed syntaxes included things like arbitrary radix prefixes, such as 16r100 (256 in hexadecimal), and radix suffixes, similar to the 100h assembler-style suffix. The debate on whether the letter "O" could be used for octal was intense -- an uppercase "O" looks suspiciously similar to a zero in some fonts. Suggestions were made to use a "c" (the second letter of "oCtal"), or even to use a "t" for "ocTal" and an "n" for "biNary" to go along with the "x" for "heXadecimal".
For the string % operator, "o" was already being used to denote octal. Binary formatting is not being added to the % operator because PEP 3101 (Advanced String Formatting) already supports binary, % formatting will be deprecated in the future.
At the end of the day, since uppercase "O" can look like a zero and uppercase "B" can look like an 8, it was decided that these prefixes should be lowercase only, but, like 'r' for raw string, that can be a preference or style-guide issue.
Open Issues
It was suggested in the discussion that lowercase should be used for all numeric and string special modifiers, such as 'x' for hexadecimal, 'r' for raw strings, 'e' for exponentiation, and 'j' for complex numbers. This is an issue for a separate PEP.
This PEP takes no position on uppercase or lowercase for input, just noting that, for consistency, if uppercase is not to be removed from input parsing for other letters, it should be added for octal and binary, and documenting the changes under this assumption, as there is not yet a PEP about the case issue.
Output formatting may be a different story -- there is already ample precedence for case sensitivity in the output format string, and there would need to be a consensus that there is a valid use-case for the "alternate form" of the string % operator to support uppercase 'B' or 'O' characters for binary or octal output. Currently, PEP 3101 does not even support this alternate capability, and the hex() function does not allow the programmer to specify the case of the 'x' character.
There are still some strong feelings that '0123' should be allowed as a literal decimal in Python 3.0. If this is the right thing to do, this can easily be covered in an additional PEP. This proposal only takes the first step of making '0123' not be a valid octal number, for reasons covered in the rationale.
Is there (or should there be) an option for the 2to3 translator which only makes the 2.6 compatible changes? Should this be run on 2.6 library code before the 2.6 release?
Should a bin() function which matches hex() and oct() be added?
Is hex() really that useful once we have advanced string formatting?
References
| [1] | GNU libc manual printf integer format conversions (http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html) |
| [2] | Python string formatting operations (http://docs.python.org/library/stdtypes.html#string-formatting-operations) |
| [3] | The Representation of Numbers, Jiajie Zhang and Donald A. Norman (http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf) |
| [4] | ENIAC page at wikipedia (http://en.wikipedia.org/wiki/ENIAC) |
| [5] | BCD page at wikipedia (http://en.wikipedia.org/wiki/Binary-coded_decimal) |
Copyright
This document has been placed in the public domain.
pep-3128 BList: A Faster List-like Type
| PEP: | 3128 |
|---|---|
| Title: | BList: A Faster List-like Type |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Daniel Stutzbach <daniel at stutzbachenterprises.com> |
| Discussions-To: | Python 3000 List <python-3000 at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-Apr-2007 |
| Python-Version: | 2.6 and/or 3.0 |
| Post-History: | 30-Apr-2007 |
Contents
Rejection Notice
Rejected based on Raymond Hettinger's sage advice [4]:
After looking at the source, I think this has almost zero chance for replacing list(). There is too much value in a simple C API, low space overhead for small lists, good performance is common use cases, and having performance that is easily understood. The BList implementation lacks these virtues and it trades-off a little performance in common cases in for much better performance in uncommon cases. As a Py3.0 PEP, I think it can be rejected.
Depending on its success as a third-party module, it still has a chance for inclusion in the collections module. The essential criteria for that is whether it is a superior choice for some real-world use cases. I've scanned my own code and found no instances where BList would have been preferable to a regular list. However, that scan has a selection bias because it doesn't reflect what I would have written had BList been available. So, after a few months, I intend to poll comp.lang.python for BList success stories. If they exist, then I have no problem with inclusion in the collections module. After all, its learning curve is near zero -- the only cost is the clutter factor stemming from indecision about the most appropriate data structure for a given task.
Abstract
The common case for list operations is on small lists. The current array-based list implementation excels at small lists due to the strong locality of reference and infrequency of memory allocation operations. However, an array takes O(n) time to insert and delete elements, which can become problematic as the list gets large.
This PEP introduces a new data type, the BList, that has array-like and tree-like aspects. It enjoys the same good performance on small lists as the existing array-based implementation, but offers superior asymptotic performance for most operations. This PEP proposes replacing the makes two mutually exclusive proposals for including the BList type in Python:
- Add it to the collections module, or
- Replace the existing list type
Motivation
The BList grew out of the frustration of needing to rewrite intuitive algorithms that worked fine for small inputs but took O(n**2) time for large inputs due to the underlying O(n) behavior of array-based lists. The deque type, introduced in Python 2.4, solved the most common problem of needing a fast FIFO queue. However, the deque type doesn't help if we need to repeatedly insert or delete elements from the middle of a long list.
A wide variety of data structure provide good asymptotic performance for insertions and deletions, but they either have O(n) performance for other operations (e.g., linked lists) or have inferior performance for small lists (e.g., binary trees and skip lists).
The BList type proposed in this PEP is based on the principles of B+Trees, which have array-like and tree-like aspects. The BList offers array-like performance on small lists, while offering O(log n) asymptotic performance for all insert and delete operations. Additionally, the BList implements copy-on-write under-the-hood, so even operations like getslice take O(log n) time. The table below compares the asymptotic performance of the current array-based list implementation with the asymptotic performance of the BList.
| Operation | Array-based list | BList |
|---|---|---|
| Copy | O(n) | O(1) |
| Append | O(1) | O(log n) |
| Insert | O(n) | O(log n) |
| Get Item | O(1) | O(log n) |
| Set Item | O(1) | O(log n) |
| Del Item | O(n) | O(log n) |
| Iteration | O(n) | O(n) |
| Get Slice | O(k) | O(log n) |
| Del Slice | O(n) | O(log n) |
| Set Slice | O(n+k) | O(log k + log n) |
| Extend | O(k) | O(log k + log n) |
| Sort | O(n log n) | O(n log n) |
| Multiply | O(nk) | O(log k) |
An extensive empirical comparison of Python's array-based list and the BList are available at [2].
Use Case Trade-offs
The BList offers superior performance for many, but not all, operations. Choosing the correct data type for a particular use case depends on which operations are used. Choosing the correct data type as a built-in depends on balancing the importance of different use cases and the magnitude of the performance differences.
For the common uses cases of small lists, the array-based list and the BList have similar performance characteristics.
For the slightly less common case of large lists, there are two common uses cases where the existing array-based list outperforms the existing BList reference implementation. These are:
- A large LIFO stack, where there are many .append() and .pop(-1) operations. Each operation is O(1) for an array-based list, but O(log n) for the BList.
- A large list that does not change size. The getitem and setitem calls are O(1) for an array-based list, but O(log n) for the BList.
In performance tests on a 10,000 element list, BLists exhibited a 50% and 5% increase in execution time for these two uses cases, respectively.
The performance for the LIFO use case could be improved to O(n) time, by caching a pointer to the right-most leaf within the root node. For lists that do not change size, the common case of sequential access could also be improved to O(n) time via caching in the root node. However, the performance of these approaches has not been empirically tested.
Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when switching from the array-based list to BLists. In performance tests on a 10,000 element list, operations such as getslice, setslice, and FIFO-style insert and deletes on a BList take only 1% of the time needed on array-based lists.
In light of the large performance speed-ups for many operations, the small performance costs for some operations will be worthwhile for many (but not all) applications.
Implementation
The BList is based on the B+Tree data structure. The BList is a wide, bushy tree where each node contains an array of up to 128 pointers to its children. If the node is a leaf, its children are the user-visible objects that the user has placed in the list. If node is not a leaf, its children are other BList nodes that are not user-visible. If the list contains only a few elements, they will all be a children of single node that is both the root and a leaf. Since a node is little more than array of pointers, small lists operate in effectively the same way as an array-based data type and share the same good performance characteristics.
The BList maintains a few invariants to ensure good (O(log n)) asymptotic performance regardless of the sequence of insert and delete operations. The principle invariants are as follows:
- Each node has at most 128 children.
- Each non-root node has at least 64 children.
- The root node has at least 2 children, unless the list contains fewer than 2 elements.
- The tree is of uniform depth.
If an insert would cause a node to exceed 128 children, the node spawns a sibling and transfers half of its children to the sibling. The sibling is inserted into the node's parent. If the node is the root node (and thus has no parent), a new parent is created and the depth of the tree increases by one.
If a deletion would cause a node to have fewer than 64 children, the node moves elements from one of its siblings if possible. If both of its siblings also only have 64 children, then two of the nodes merge and the empty one is removed from its parent. If the root node is reduced to only one child, its single child becomes the new root (i.e., the depth of the tree is reduced by one).
In addition to tree-like asymptotic performance and array-like performance on small-lists, BLists support transparent copy-on-write. If a non-root node needs to be copied (as part of a getslice, copy, setslice, etc.), the node is shared between multiple parents instead of being copied. If it needs to be modified later, it will be copied at that time. This is completely behind-the-scenes; from the user's point of view, the BList works just like a regular Python list.
Memory Usage
In the worst case, the leaf nodes of a BList have only 64 children each, rather than a full 128, meaning that memory usage is around twice that of a best-case array implementation. Non-leaf nodes use up a negligible amount of additional memory, since there are at least 63 times as many leaf nodes as non-leaf nodes.
The existing array-based list implementation must grow and shrink as items are added and removed. To be efficient, it grows and shrinks only when the list has grow or shrunk exponentially. In the worst case, it, too, uses twice as much memory as the best case.
In summary, the BList's memory footprint is not significantly different from the existing array-based implementation.
Backwards Compatibility
If the BList is added to the collections module, backwards compatibility is not an issue. This section focuses on the option of replacing the existing array-based list with the BList. For users of the Python interpreter, a BList has an identical interface to the current list-implementation. For virtually all operations, the behavior is identical, aside from execution speed.
For the C API, BList has a different interface than the existing list-implementation. Due to its more complex structure, the BList does not lend itself well to poking and prodding by external sources. Thankfully, the existing list-implementation defines an API of functions and macros for accessing data from list objects. Google Code Search suggests that the majority of third-party modules uses the well-defined API rather than relying on the list's structure directly. The table below summarizes the search queries and results:
| Search String | Number of Results |
|---|---|
| PyList_GetItem | 2,000 |
| PySequence_GetItem | 800 |
| PySequence_Fast_GET_ITEM | 100 |
| PyList_GET_ITEM | 400 |
| [^a-zA-Z_]ob_item | 100 |
This can be achieved in one of two ways:
Redefine the various accessor functions and macros in listobject.h to access a BList instead. The interface would be unchanged. The functions can easily be redefined. The macros need a bit more care and would have to resort to function calls for large lists.
The macros would need to evaluate their arguments more than once, which could be a problem if the arguments have side effects. A Google Code Search for "PyList_GET_ITEM([^)]+(" found only a handful of cases where this occurs, so the impact appears to be low.
The few extension modules that use list's undocumented structure directly, instead of using the API, would break. The core code itself uses the accessor macros fairly consistently and should be easy to port.
Deprecate the existing list type, but continue to include it. Extension modules wishing to use the new BList type must do so explicitly. The BList C interface can be changed to match the existing PyList interface so that a simple search-replace will be sufficient for 99% of module writers.
Existing modules would continue to compile and work without change, but they would need to make a deliberate (but small) effort to migrate to the BList.
The downside of this approach is that mixing modules that use BLists and array-based lists might lead to slow down if conversions are frequently necessary.
Reference Implementation
A reference implementations of the BList is available for CPython at [1].
The source package also includes a pure Python implementation, originally developed as a prototype for the CPython version. Naturally, the pure Python version is rather slow and the asymptotic improvements don't win out until the list is quite large.
When compiled with Py_DEBUG, the C implementation checks the BList invariants when entering and exiting most functions.
An extensive set of test cases is also included in the source package. The test cases include the existing Python sequence and list test cases as a subset. When the interpreter is built with Py_DEBUG, the test cases also check for reference leaks.
Porting to Other Python Variants
If the BList is added to the collections module, other Python variants can support it in one of three ways:
- Make blist an alias for list. The asymptotic performance won't be as good, but it'll work.
- Use the pure Python reference implementation. The performance for small lists won't be as good, but it'll work.
- Port the reference implementation.
Discussion
This proposal has been discussed briefly on the Python-3000 mailing list [3]. Although a number of people favored the proposal, there were also some objections. Below summarizes the pros and cons as observed by posters to the thread.
General comments:
- Pro: Will outperform the array-based list in most cases
- Pro: "I've implemented variants of this ... a few different times"
- Con: Desirability and performance in actual applications is unproven
Comments on adding BList to the collections module:
- Pro: Matching the list-API reduces the learning curve to near-zero
- Pro: Useful for intermediate-level users; won't get in the way of beginners
- Con: Proliferation of data types makes the choices for developers harder.
Comments on replacing the array-based list with the BList:
- Con: Impact on extension modules (addressed in Backwards Compatibility)
- Con: The use cases where BLists are slower are important (see Use Case Trade-Offs for how these might be addressed).
- Con: The array-based list code is simple and easy to maintain
To assess the desirability and performance in actual applications, Raymond Hettinger suggested releasing the BList as an extension module (now available at [1]). If it proves useful, he felt it would be a strong candidate for inclusion in 2.6 as part of the collections module. If widely popular, then it could be considered for replacing the array-based list, but not otherwise.
Guido van Rossum commented that he opposed the proliferation of data types, but favored replacing the array-based list if backwards compatibility could be addressed and the BList's performance was uniformly better.
On-going Tasks
- Reduce the memory footprint of small lists
- Implement TimSort for BLists, so that best-case sorting is O(n) instead of O(log n).
- Implement __reversed__
- Cache a pointer in the root to the rightmost leaf, to make LIFO operation O(n) time.
References
| [1] | (1, 2) Reference Implementations for C and Python: http://www.python.org/pypi/blist/ |
| [2] | Empirical performance comparison between Python's array-based list and the blist: http://stutzbachenterprises.com/blist/ |
| [3] | Discussion on python-3000 starting at post: http://mail.python.org/pipermail/python-3000/2007-April/006757.html |
| [4] | Raymond Hettinger's feedback on python-3000: http://mail.python.org/pipermail/python-3000/2007-May/007491.html |
Copyright
This document has been placed in the public domain.
pep-3129 Class Decorators
| PEP: | 3129 |
|---|---|
| Title: | Class Decorators |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 1-May-2007 |
| Python-Version: | 3.0 |
| Post-History: | 7-May-2007 |
Abstract
This PEP proposes class decorators, an extension to the function and method decorators introduced in PEP 318.
Rationale
When function decorators were originally debated for inclusion in Python 2.4, class decorators were seen as obscure and unnecessary [1] thanks to metaclasses. After several years' experience with the Python 2.4.x series of releases and an increasing familiarity with function decorators and their uses, the BDFL and the community re-evaluated class decorators and recommended their inclusion in Python 3.0 [2].
The motivating use-case was to make certain constructs more easily expressed and less reliant on implementation details of the CPython interpreter. While it is possible to express class decorator-like functionality using metaclasses, the results are generally unpleasant and the implementation highly fragile [3]. In addition, metaclasses are inherited, whereas class decorators are not, making metaclasses unsuitable for some, single class-specific uses of class decorators. The fact that large-scale Python projects like Zope were going through these wild contortions to achieve something like class decorators won over the BDFL.
Semantics
The semantics and design goals of class decorators are the same as for function decorators ([4], [5]); the only difference is that you're decorating a class instead of a function. The following two snippets are semantically identical:
class A: pass A = foo(bar(A)) @foo @bar class A: pass
For a detailed examination of decorators, please refer to PEP 318.
Implementation
Adapting Python's grammar to support class decorators requires modifying two rules and adding a new rule:
funcdef: [decorators] 'def' NAME parameters ['->' test] ':' suite
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt |
with_stmt | funcdef | classdef
need to be changed to
decorated: decorators (classdef | funcdef)
funcdef: 'def' NAME parameters ['->' test] ':' suite
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt |
with_stmt | funcdef | classdef | decorated
Adding decorated is necessary to avoid an ambiguity in the grammar.
The Python AST and bytecode must be modified accordingly.
A reference implementation [6] has been provided by Jack Diederich.
Acceptance
There was virtually no discussion following the posting of this PEP, meaning that everyone agreed it should be accepted.
The patch was committed to Subversion as revision 55430.
References
| [1] | http://www.python.org/dev/peps/pep-0318/#motivation |
| [2] | http://mail.python.org/pipermail/python-dev/2006-March/062942.html |
| [3] | http://mail.python.org/pipermail/python-dev/2006-March/062888.html |
| [4] | http://www.python.org/dev/peps/pep-0318/#current-syntax |
| [5] | http://www.python.org/dev/peps/pep-0318/#design-goals |
| [6] | http://python.org/sf/1671208 |
Copyright
This document has been placed in the public domain.
pep-3130 Access to Current Module/Class/Function
| PEP: | 3130 |
|---|---|
| Title: | Access to Current Module/Class/Function |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jim J. Jewett <jimjjewett at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 22-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: | 22-Apr-2007 |
Rejection Notice
This PEP is rejected. It is not clear how it should be
implemented or what the precise semantics should be in edge cases,
and there aren't enough important use cases given. response has
been lukewarm at best.
Abstract
It is common to need a reference to the current module, class,
or function, but there is currently no entirely correct way to
do this. This PEP proposes adding the keywords __module__,
__class__, and __function__.
Rationale for __module__
Many modules export various functions, classes, and other objects,
but will perform additional activities (such as running unit
tests) when run as a script. The current idiom is to test whether
the module's name has been set to magic value.
if __name__ == "__main__": ...
More complicated introspection requires a module to (attempt to)
import itself. If importing the expected name actually produces
a different module, there is no good workaround.
# __import__ lets you use a variable, but... it gets more
# complicated if the module is in a package.
__import__(__name__)
# So just go to sys modules... and hope that the module wasn't
# hidden/removed (perhaps for security), that __name__ wasn't
# changed, and definitely hope that no other module with the
# same name is now available.
class X(object):
pass
import sys
mod = sys.modules[__name__]
mod = sys.modules[X.__class__.__module__]
Proposal: Add a __module__ keyword which refers to the module
currently being defined (executed). (But see open issues.)
# XXX sys.main is still changing as draft progresses. May
# really need sys.modules[sys.main]
if __module__ is sys.main: # assumes PEP (3122), Cannon
...
Rationale for __class__
Class methods are passed the current instance; from this they can
determine self.__class__ (or cls, for class methods).
Unfortunately, this reference is to the object's actual class,
which may be a subclass of the defining class. The current
workaround is to repeat the name of the class, and assume that the
name will not be rebound.
class C(B):
def meth(self):
super(C, self).meth() # Hope C is never rebound.
class D(C):
def meth(self):
# ?!? issubclass(D,C), so it "works":
super(C, self).meth()
Proposal: Add a __class__ keyword which refers to the class
currently being defined (executed). (But see open issues.)
class C(B):
def meth(self):
super(__class__, self).meth()
Note that super calls may be further simplified by the "New Super"
PEP (Spealman). The __class__ (or __this_class__) attribute came
up in attempts to simplify the explanation and/or implementation
of that PEP, but was separated out as an independent decision.
Note that __class__ (or __this_class__) is not quite the same as
the __thisclass__ property on bound super objects. The existing
super.__thisclass__ property refers to the class from which the
Method Resolution Order search begins. In the above class D, it
would refer to (the current reference of name) C.
Rationale for __function__
Functions (including methods) often want access to themselves,
usually for a private storage location or true recursion. While
there are several workarounds, all have their drawbacks.
def counter(_total=[0]):
# _total shouldn't really appear in the
# signature at all; the list wrapping and
# [0] unwrapping obscure the code
_total[0] += 1
return _total[0]
@annotate(total=0)
def counter():
# Assume name counter is never rebound:
counter.total += 1
return counter.total
# class exists only to provide storage:
class _wrap(object):
__total = 0
def f(self):
self.__total += 1
return self.__total
# set module attribute to a bound method:
accum = _wrap().f
# This function calls "factorial", which should be itself --
# but the same programming styles that use heavy recursion
# often have a greater willingness to rebind function names.
def factorial(n):
return (n * factorial(n-1) if n else 1)
Proposal: Add a __function__ keyword which refers to the function
(or method) currently being defined (executed). (But see open
issues.)
@annotate(total=0)
def counter():
# Always refers to this function obj:
__function__.total += 1
return __function__.total
def factorial(n):
return (n * __function__(n-1) if n else 1)
Backwards Compatibility
While a user could be using these names already, double-underscore
names ( __anything__ ) are explicitly reserved to the interpreter.
It is therefore acceptable to introduce special meaning to these
names within a single feature release.
Implementation
Ideally, these names would be keywords treated specially by the
bytecode compiler.
Guido has suggested [1] using a cell variable filled in by the
metaclass.
Michele Simionato has provided a prototype using bytecode hacks
[2]. This does not require any new bytecode operators; it just
modifies the which specific sequence of existing operators gets
run.
Open Issues
- Are __module__, __class__, and __function__ the right names? In
particular, should the names include the word "this", either as
__this_module__, __this_class__, and __this_function__, (format
discussed on the python-3000 and python-ideas lists) or as
__thismodule__, __thisclass__, and __thisfunction__ (inspired
by, but conflicting with, current usage of super.__thisclass__).
- Are all three keywords needed, or should this enhancement be
limited to a subset of the objects? Should methods be treated
separately from other functions?
References
[1] Fixing super anyone? Guido van Rossum
http://mail.python.org/pipermail/python-3000/2007-April/006671.html
[2] Descriptor/Decorator challenge, Michele Simionato
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a6010c7494871bb1/62a2da68961caeb6?lnk=gst&q=simionato+challenge&rnum=1&hl=en#62a2da68961caeb6
Copyright
This document has been placed in the public domain.
pep-3131 Supporting Non-ASCII Identifiers
| PEP: | 3131 |
|---|---|
| Title: | Supporting Non-ASCII Identifiers |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Martin von Lรถwis <martin at v.loewis.de> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 1-May-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP suggests to support non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.
Rationale
Python code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves.
For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words.
Common Objections
Some objections are often raised against proposals similar to this one.
People claim that they will not be able to use a library if to do so they have to use characters they cannot type on their keyboards. However, it is the choice of the designer of the library to decide on various constraints for using the library: people may not be able to use the library because they cannot get physical access to the source code (because it is not published), or because licensing prohibits usage, or because the documentation is in a language they cannot understand. A developer wishing to make a library widely available needs to make a number of explicit choices (such as publication, licensing, language of documentation, and language of identifiers). It should always be the choice of the author to make these decisions - not the choice of the language designers.
In particular, projects wishing to have wide usage probably might want to establish a policy that all identifiers, comments, and documentation is written in English (see the GNU coding style guide for an example of such a policy). Restricting the language to ASCII-only identifiers does not enforce comments and documentation to be English, or the identifiers actually to be English words, so an additional policy is necessary, anyway.
Specification of Language Changes
The syntax of identifiers in Python will be based on the Unicode standard annex UAX-31 [1], with elaboration and changes as defined below.
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.5. This specification only introduces additional characters from outside the ASCII range. For other characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module.
The identifier syntax is <XID_Start> <XID_Continue>*.
The exact specification of what characters have the XID_Start or XID_Continue properties can be found in the DerivedCoreProperties file of the Unicode data in use by Python (4.1 at the time this PEP was written), see [6]. For reference, the construction rules for these sets are given below. The XID_* properties are derived from ID_Start/ID_Continue, which are derived themselves.
ID_Start is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, and characters carrying the Other_ID_Start property. XID_Start then closes this set under normalization, by removing all characters whose NFKC normalization is not of the form ID_Start ID_Continue* anymore.
ID_Continue is defined as all characters in ID_Start, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), and characters carryig the Other_ID_Continue property. Again, XID_Continue closes this set under NFKC-normalization; it also adds U+00B7 to support Catalan.
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
A non-normative HTML file listing all valid identifier characters for Unicode 4.1 can be found at http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
Policy Specification
As an addition to the Python Coding style, the following policy is prescribed: All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.
As an option, this specification can be applied to Python 2.x. In that case, ASCII-only identifiers would continue to be represented as byte string objects in namespace dictionaries; identifiers with non-ASCII characters would be represented as Unicode strings.
Implementation
The following changes will need to be made to the parser:
- If a non-ASCII character is found in the UTF-8 representation of the source code, a forward scan is made to find the first ASCII non-identifier character (e.g. a space or punctuation character)
- The entire UTF-8 string is passed to a function to normalize the string to NFKC, and then verify that it follows the identifier syntax. No such callout is made for pure-ASCII identifiers, which continue to be parsed the way they are today. The Unicode database must start including the Other_ID_{Start|Continue} property.
- If this specification is implemented for 2.x, reflective libraries (such as pydoc) must be verified to continue to work when Unicode strings appear in __dict__ slots as keys.
Open Issues
John Nagle suggested consideration of Unicode Technical Standard #39, [2], which discusses security mechanisms for Unicode identifiers. It's not clear how that can precisely apply to this PEP; possible consequences are
- warn about characters listed as "restricted" in xidmodifications.txt
- warn about identifiers using mixed scripts
- somehow perform Confusable Detection
In the latter two approaches, it's not clear how precisely the algorithm should work. For mixed scripts, certain kinds of mixing should probably allowed - are these the "Common" and "Inherited" scripts mentioned in section 5? For Confusable Detection, it seems one needs two identifiers to compare them for confusion - is it possible to somehow apply it to a single identifier only, and warn?
In follow-up discussion, it turns out that John Nagle actually meant to suggest UTR#36, level "Highly Restrictive", [3].
Several people suggested to allow and ignore formatting control characters (general category Cf), as is done in Java, JavaScript, and C#. It's not clear whether this would improve things (it might for RTL languages); if there is a need, these can be added later.
Some people would like to see an option on selecting support for this PEP at run-time; opinions vary on what precisely that option should be, and what precisely its default value should be. Guido van Rossum commented in [5] that a global flag passed to the interpreter is not acceptable, as it would apply to all modules.
Discussion
Ka-Ping Yee summarizes discussion and further objection in [4] as such:
Should identifiers be allowed to contain any Unicode letter?
Drawbacks of allowing non-ASCII identifiers wholesale:
- Python will lose the ability to make a reliable round trip to a human-readable display on screen or on paper.
- Python will become vulnerable to a new class of security exploits; code and submitted patches will be much harder to inspect.
- Humans will no longer be able to validate Python syntax.
- Unicode is young; its problems are not yet well understood and solved; tool support is weak.
- Languages with non-ASCII identifiers use different character sets and normalization schemes; PEP 3131's choices are non-obvious.
- The Unicode bidi algorithm yields an extremely confusing display order for RTL text when digits or operators are nearby.
Should the default behaviour accept only ASCII identifiers, or should it accept identifiers containing non-ASCII characters?
Arguments for ASCII only by default:
- Non-ASCII identifiers by default makes common practice/assumptions subtly/unknowingly wrong; rarely wrong is worse than obviously wrong.
- Better to raise a warning than to fail silently when encountering an probably unexpected situation.
- All of current usage is ASCII-only; the vast majority of future usage will be ASCII-only.
- It is the pockets of Unicode adoption that are parochial, not the ASCII advocates.
- Python should audit for ASCII-only identifiers for the same reasons that it audits for tab-space consistency
- Incremental change is safer.
- An ASCII-only default favors open-source development and sharing of source code.
- Existing projects won't have to waste any brainpower worrying about the implications of Unicode identifiers.
Should non-ASCII identifiers be optional?
Various voices in support of a flag (although there's been debate over which should be the default, no one seems to be saying that there shouldn't be an off switch)
Should the identifier character set be configurable?
Various voices proposing and supporting a selectable character set, so that users can get all the benefits of using their own language without the drawbacks of confusable/unfamiliar characters
Which identifier characters should be allowed?
- What to do about bidi format control characters?
- What about other ID_Continue characters? What about characters that look like punctuation? What about other recommendations in UTS #39? What about mixed-script identifiers?
Which normalization form should be used, NFC or NFKC?
Should source code be required to be in normalized form?
References
| [1] | http://www.unicode.org/reports/tr31/ |
| [2] | http://www.unicode.org/reports/tr39/ |
| [3] | http://www.unicode.org/reports/tr36/ |
| [4] | http://mail.python.org/pipermail/python-3000/2007-June/008161.html |
| [5] | http://mail.python.org/pipermail/python-3000/2007-May/007925.html |
| [6] | http://www.unicode.org/Public/4.1.0/ucd/DerivedCoreProperties.txt |
Copyright
This document has been placed in the public domain.
pep-3132 Extended Iterable Unpacking
| PEP: | 3132 |
|---|---|
| Title: | Extended Iterable Unpacking |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Georg Brandl <georg at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: |
Contents
Abstract
This PEP proposes a change to iterable unpacking syntax, allowing to specify a "catch-all" name which will be assigned a list of all items not assigned to a "regular" name.
An example says more than a thousand words:
>>> a, *b, c = range(5) >>> a 0 >>> c 4 >>> b [1, 2, 3]
Rationale
Many algorithms require splitting a sequence in a "first, rest" pair. With the new syntax,
first, rest = seq[0], seq[1:]
is replaced by the cleaner and probably more efficient:
first, *rest = seq
For more complex unpacking patterns, the new syntax looks even cleaner, and the clumsy index handling is not necessary anymore.
Also, if the right-hand value is not a list, but an iterable, it has to be converted to a list before being able to do slicing; to avoid creating this temporary list, one has to resort to
it = iter(seq) first = it.next() rest = list(it)
Specification
A tuple (or list) on the left side of a simple assignment (unpacking is not defined for augmented assignment) may contain at most one expression prepended with a single asterisk (which is henceforth called a "starred" expression, while the other expressions in the list are called "mandatory"). This designates a subexpression that will be assigned a list of all items from the iterable being unpacked that are not assigned to any of the mandatory expressions, or an empty list if there are no such items.
For example, if seq is a slicable sequence, all the following assignments are equivalent if seq has at least three elements:
a, b, c = seq[0], list(seq[1:-1]), seq[-1] a, *b, c = seq [a, *b, c] = seq
It is an error (as it is currently) if the iterable doesn't contain enough items to assign to all the mandatory expressions.
It is also an error to use the starred expression as a lone assignment target, as in
*a = range(5)
This, however, is valid syntax:
*a, = range(5)
Note that this proposal also applies to tuples in implicit assignment context, such as in a for statement:
for a, *b in [(1, 2, 3), (4, 5, 6, 7)]:
print(b)
would print out
[2, 3] [5, 6, 7]
Starred expressions are only allowed as assignment targets, using them anywhere else (except for star-args in function calls, of course) is an error.
Implementation
Grammar change
This feature requires a new grammar rule:
star_expr: ['*'] expr
In these two rules, expr is changed to star_expr:
comparison: star_expr (comp_op star_expr)*
exprlist: star_expr (',' star_expr)* [',']
Changes to the Compiler
A new ASDL expression type Starred is added which represents a starred expression. Note that the starred expression element introduced here is universal and could later be used for other purposes in non-assignment context, such as the yield *iterable proposal.
The compiler is changed to recognize all cases where a starred expression is invalid and flag them with syntax errors.
A new bytecode instruction, UNPACK_EX, is added, whose argument has the number of mandatory targets before the starred target in the lower 8 bits and the number of mandatory targets after the starred target in the upper 8 bits. For unpacking sequences without starred expressions, the old UNPACK_ITERABLE opcode is kept.
Changes to the Bytecode Interpreter
The function unpack_iterable() in ceval.c is changed to handle the extended unpacking, via an argcntafter parameter. In the UNPACK_EX case, the function will do the following:
- collect all items for mandatory targets before the starred one
- collect all remaining items from the iterable in a list
- pop items for mandatory targets after the starred one from the list
- push the single items and the resized list on the stack
Shortcuts for unpacking iterables of known types, such as lists or tuples, can be added.
The current implementation can be found at the SourceForge Patch tracker [SFPATCH]. It now includes a minimal test case.
Acceptance
After a short discussion on the python-3000 list [1], the PEP was accepted by Guido in its current form. Possible changes discussed were:
- Only allow a starred expression as the last item in the exprlist. This would simplify the unpacking code a bit and allow for the starred expression to be assigned an iterator. This behavior was rejected because it would be too surprising.
- Try to give the starred target the same type as the source iterable, for example, b in a, *b = 'hello' would be assigned the string 'ello'. This may seem nice, but is impossible to get right consistently with all iterables.
- Make the starred target a tuple instead of a list. This would be consistent with a function's *args, but make further processing of the result harder.
References
| [SFPATCH] | http://python.org/sf/1711529 |
| [1] | http://mail.python.org/pipermail/python-3000/2007-May/007198.html |
Copyright
This document has been placed in the public domain.
pep-3133 Introducing Roles
| PEP: | 3133 |
|---|---|
| Title: | Introducing Roles |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Requires: | 3115 3129 |
| Created: | 1-May-2007 |
| Python-Version: | 3.0 |
| Post-History: | 13-May-2007 |
Contents
Rejection Notice
This PEP has helped push PEP 3119 towards a saner, more minimalistic approach. But given the latest version of PEP 3119 I much prefer that. GvR.
Abstract
Python's existing object model organizes objects according to their implementation. It is often desirable -- especially in duck typing-based language like Python -- to organize objects by the part they play in a larger system (their intent), rather than by how they fulfill that part (their implementation). This PEP introduces the concept of roles, a mechanism for organizing objects according to their intent rather than their implementation.
Rationale
In the beginning were objects. They allowed programmers to marry function and state, and to increase code reusability through concepts like polymorphism and inheritance, and lo, it was good. There came a time, however, when inheritance and polymorphism weren't enough. With the invention of both dogs and trees, we were no longer able to be content with knowing merely, "Does it understand 'bark'?" We now needed to know what a given object thought that "bark" meant.
One solution, the one detailed here, is that of roles, a mechanism orthogonal and complementary to the traditional class/instance system. Whereas classes concern themselves with state and implementation, the roles mechanism deals exclusively with the behaviours embodied in a given class.
This system was originally called "traits" and implemented for Squeak Smalltalk [4]. It has since been adapted for use in Perl 6 [3] where it is called "roles", and it is primarily from there that the concept is now being interpreted for Python 3. Python 3 will preserve the name "roles".
In a nutshell: roles tell you what an object does, classes tell you how an object does it.
In this PEP, I will outline a system for Python 3 that will make it possible to easily determine whether a given object's understanding of "bark" is tree-like or dog-like. (There might also be more serious examples.)
A Note on Syntax
A syntax proposals in this PEP are tentative and should be considered to be strawmen. The necessary bits that this PEP depends on -- namely PEP 3115's class definition syntax and PEP 3129's class decorators -- are still being formalized and may change. Function names will, of course, be subject to lengthy bikeshedding debates.
Performing Your Role
Static Role Assignment
Let's start out by defining Tree and Dog classes
class Tree(Vegetable):
def bark(self):
return self.is_rough()
class Dog(Animal):
def bark(self):
return self.goes_ruff()
While both implement a bark() method with the same signature, they do wildly different things. We need some way of differentiating what we're expecting. Relying on inheritance and a simple isinstance() test will limit code reuse and/or force any dog-like classes to inherit from Dog, whether or not that makes sense. Let's see if roles can help.
@perform_role(Doglike) class Dog(Animal): ... @perform_role(Treelike) class Tree(Vegetable): ... @perform_role(SitThere) class Rock(Mineral): ...
We use class decorators from PEP 3129 to associate a particular role or roles with a class. Client code can now verify that an incoming object performs the Doglike role, allowing it to handle Wolf, LaughingHyena and Aibo [1] instances, too.
Roles can be composed via normal inheritance:
@perform_role(Guard, MummysLittleDarling)
class GermanShepherd(Dog):
def guard(self, the_precious):
while True:
if intruder_near(the_precious):
self.growl()
def get_petted(self):
self.swallow_pride()
Here, GermanShepherd instances perform three roles: Guard and MummysLittleDarling are applied directly, whereas Doglike is inherited from Dog.
Assigning Roles at Runtime
Roles can be assigned at runtime, too, by unpacking the syntactic sugar provided by decorators.
Say we import a Robot class from another module, and since we know that Robot already implements our Guard interface, we'd like it to play nicely with guard-related code, too.
>>> perform(Guard)(Robot)
This takes effect immediately and impacts all instances of Robot.
Asking Questions About Roles
Just because we've told our robot army that they're guards, we'd like to check in on them occasionally and make sure they're still at their task.
>>> performs(our_robot, Guard) True
What about that one robot over there?
>>> performs(that_robot_over_there, Guard) True
The performs() function is used to ask if a given object fulfills a given role. It cannot be used, however, to ask a class if its instances fulfill a role:
>>> performs(Robot, Guard) False
This is because the Robot class is not interchangeable with a Robot instance.
Defining New Roles
Empty Roles
Roles are defined like a normal class, but use the Role metaclass.
class Doglike(metaclass=Role): ...
Metaclasses are used to indicate that Doglike is a Role in the same way 5 is an int and tuple is a type.
Composing Roles via Inheritance
Roles may inherit from other roles; this has the effect of composing them. Here, instances of Dog will perform both the Doglike and FourLegs roles.
class FourLegs(metaclass=Role): pass class Doglike(FourLegs, Carnivor): pass @perform_role(Doglike) class Dog(Mammal): pass
Requiring Concrete Methods
So far we've only defined empty roles -- not very useful things. Let's now require that all classes that claim to fulfill the Doglike role define a bark() method:
class Doglike(FourLegs):
def bark(self):
pass
No decorators are required to flag the method as "abstract", and the method will never be called, meaning whatever code it contains (if any) is irrelevant. Roles provide only abstract methods; concrete default implementations are left to other, better-suited mechanisms like mixins.
Once you have defined a role, and a class has claimed to perform that role, it is essential that that claim be verified. Here, the programmer has misspelled one of the methods required by the role.
@perform_role(FourLegs)
class Horse(Mammal):
def run_like_teh_wind(self)
...
This will cause the role system to raise an exception, complaining that you're missing a run_like_the_wind() method. The role system carries out these checks as soon as a class is flagged as performing a given role.
Concrete methods are required to match exactly the signature demanded by the role. Here, we've attempted to fulfill our role by defining a concrete version of bark(), but we've missed the mark a bit.
@perform_role(Doglike)
class Coyote(Mammal):
def bark(self, target=moon):
pass
This method's signature doesn't match exactly with what the Doglike role was expecting, so the role system will throw a bit of a tantrum.
Mechanism
The following are strawman proposals for how roles might be expressed in Python. The examples here are phrased in a way that the roles mechanism may be implemented without changing the Python interpreter. (Examples adapted from an article on Perl 6 roles by Curtis Poe [2].)
Static class role assignment
@perform_role(Thieving) class Elf(Character): ...
perform_role() accepts multiple arguments, such that this is also legal:
@perform_role(Thieving, Spying, Archer) class Elf(Character): ...
The Elf class now performs both the Thieving, Spying, and Archer roles.
Querying instances
if performs(my_elf, Thieving): ...
The second argument to performs() may also be anything with a __contains__() method, meaning the following is legal:
if performs(my_elf, set([Thieving, Spying, BoyScout])): ...
Like isinstance(), the object needs only to perform a single role out of the set in order for the expression to be true.
Relationship to Abstract Base Classes
Early drafts of this PEP [5] envisioned roles as competing with the abstract base classes proposed in PEP 3119. After further discussion and deliberation, a compromise and a delegation of responsibilities and use-cases has been worked out as follows:
Roles provide a way of indicating a object's semantics and abstract capabilities. A role may define abstract methods, but only as a way of delineating an interface through which a particular set of semantics are accessed. An Ordering role might require that some set of ordering operators be defined.
class Ordering(metaclass=Role): def __ge__(self, other): pass def __le__(self, other): pass def __ne__(self, other): pass # ...and so onIn this way, we're able to indicate an object's role or function within a larger system without constraining or concerning ourselves with a particular implementation.
Abstract base classes, by contrast, are a way of reusing common, discrete units of implementation. For example, one might define an OrderingMixin that implements several ordering operators in terms of other operators.
class OrderingMixin: def __ge__(self, other): return self > other or self == other def __le__(self, other): return self < other or self == other def __ne__(self, other): return not self == other # ...and so onUsing this abstract base class - more properly, a concrete mixin - allows a programmer to define a limited set of operators and let the mixin in effect "derive" the others.
By combining these two orthogonal systems, we're able to both a) provide functionality, and b) alert consumer systems to the presence and availability of this functionality. For example, since the OrderingMixin class above satisfies the interface and semantics expressed in the Ordering role, we say the mixin performs the role:
@perform_role(Ordering)
class OrderingMixin:
def __ge__(self, other):
return self > other or self == other
def __le__(self, other):
return self < other or self == other
def __ne__(self, other):
return not self == other
# ...and so on
Now, any class that uses the mixin will automatically -- that is, without further programmer effort -- be tagged as performing the Ordering role.
The separation of concerns into two distinct, orthogonal systems is desirable because it allows us to use each one separately. Take, for example, a third-party package providing a RecursiveHash role that indicates a container takes its contents into account when determining its hash value. Since Python's built-in tuple and frozenset classes follow this semantic, the RecursiveHash role can be applied to them.
>>> perform_role(RecursiveHash)(tuple) >>> perform_role(RecursiveHash)(frozenset)
Now, any code that consumes RecursiveHash objects will now be able to consume tuples and frozensets.
Open Issues
Allowing Instances to Perform Different Roles Than Their Class
Perl 6 allows instances to perform different roles than their class. These changes are local to the single instance and do not affect other instances of the class. For example:
my_elf = Elf() my_elf.goes_on_quest() my_elf.becomes_evil() now_performs(my_elf, Thieving) # Only this one elf is a thief my_elf.steals(["purses", "candy", "kisses"])
In Perl 6, this is done by creating an anonymous class that inherits from the instance's original parent and performs the additional role(s). This is possible in Python 3, though whether it is desirable is still is another matter.
Inclusion of this feature would, of course, make it much easier to express the works of Charles Dickens in Python:
>>> from literature import role, BildungsRoman >>> from dickens import Urchin, Gentleman >>> >>> with BildungsRoman() as OliverTwist: ... mr_brownlow = Gentleman() ... oliver, artful_dodger = Urchin(), Urchin() ... now_performs(artful_dodger, [role.Thief, role.Scoundrel]) ... ... oliver.has_adventures_with(ArtfulDodger) ... mr_brownlow.adopt_orphan(oliver) ... now_performs(oliver, role.RichWard)
Requiring Attributes
Neal Norwitz has requested the ability to make assertions about the presence of attributes using the same mechanism used to require methods. Since roles take effect at class definition-time, and since the vast majority of attributes are defined at runtime by a class's __init__() method, there doesn't seem to be a good way to check for attributes at the same time as methods.
It may still be desirable to include non-enforced attributes in the role definition, if only for documentation purposes.
Roles of Roles
Under the proposed semantics, it is possible for roles to have roles of their own.
@perform_role(Y) class X(metaclass=Role): ...
While this is possible, it is meaningless, since roles are generally not instantiated. There has been some off-line discussion about giving meaning to this expression, but so far no good ideas have emerged.
class_performs()
It is currently not possible to ask a class if its instances perform a given role. It may be desirable to provide an analogue to performs() such that
>>> isinstance(my_dwarf, Dwarf) True >>> performs(my_dwarf, Surly) True >>> performs(Dwarf, Surly) False >>> class_performs(Dwarf, Surly) True
Prettier Dynamic Role Assignment
An early draft of this PEP included a separate mechanism for dynamically assigning a role to a class. This was spelled
>>> now_perform(Dwarf, GoldMiner)
This same functionality already exists by unpacking the syntactic sugar provided by decorators:
>>> perform_role(GoldMiner)(Dwarf)
At issue is whether dynamic role assignment is sufficiently important to warrant a dedicated spelling.
Syntax Support
Though the phrasings laid out in this PEP are designed so that the roles system could be shipped as a stand-alone package, it may be desirable to add special syntax for defining, assigning and querying roles. One example might be a role keyword, which would translate
class MyRole(metaclass=Role): ...
into
role MyRole: ...
Assigning a role could take advantage of the class definition arguments proposed in PEP 3115:
class MyClass(performs=MyRole): ...
Implementation
A reference implementation is forthcoming.
Acknowledgements
Thanks to Jeffery Yasskin, Talin and Guido van Rossum for several hours of in-person discussion to iron out the differences, overlap and finer points of roles and abstract base classes.
References
| [1] | http://en.wikipedia.org/wiki/AIBO |
| [2] | http://www.perlmonks.org/?node_id=384858 |
| [3] | http://dev.perl.org/perl6/doc/design/syn/S12.html |
| [4] | http://www.iam.unibe.ch/~scg/Archive/Papers/Scha03aTraits.pdf |
| [5] | http://mail.python.org/pipermail/python-3000/2007-April/007026.html |
Copyright
This document has been placed in the public domain.
pep-3134 Exception Chaining and Embedded Tracebacks
| PEP: | 3134 |
|---|---|
| Title: | Exception Chaining and Embedded Tracebacks |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ka-Ping Yee |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 12-May-2005 |
| Python-Version: | 3.0 |
| Post-History: |
Numbering Note
This PEP started its life as PEP 344. Since it is now targeted
for Python 3000, it has been moved into the 3xxx space.
Abstract
This PEP proposes three standard attributes on exception instances:
the '__context__' attribute for implicitly chained exceptions, the
'__cause__' attribute for explicitly chained exceptions, and the
'__traceback__' attribute for the traceback. A new "raise ... from"
statement sets the '__cause__' attribute.
Motivation
During the handling of one exception (exception A), it is possible
that another exception (exception B) may occur. In today's Python
(version 2.4), if this happens, exception B is propagated outward
and exception A is lost. In order to debug the problem, it is
useful to know about both exceptions. The '__context__' attribute
retains this information automatically.
Sometimes it can be useful for an exception handler to intentionally
re-raise an exception, either to provide extra information or to
translate an exception to another type. The '__cause__' attribute
provides an explicit way to record the direct cause of an exception.
In today's Python implementation, exceptions are composed of three
parts: the type, the value, and the traceback. The 'sys' module,
exposes the current exception in three parallel variables, exc_type,
exc_value, and exc_traceback, the sys.exc_info() function returns a
tuple of these three parts, and the 'raise' statement has a
three-argument form accepting these three parts. Manipulating
exceptions often requires passing these three things in parallel,
which can be tedious and error-prone. Additionally, the 'except'
statement can only provide access to the value, not the traceback.
Adding the '__traceback__' attribute to exception values makes all
the exception information accessible from a single place.
History
Raymond Hettinger [1] raised the issue of masked exceptions on
Python-Dev in January 2003 and proposed a PyErr_FormatAppend()
function that C modules could use to augment the currently active
exception with more information. Brett Cannon [2] brought up
chained exceptions again in June 2003, prompting a long discussion.
Greg Ewing [3] identified the case of an exception occuring in a
'finally' block during unwinding triggered by an original exception,
as distinct from the case of an exception occuring in an 'except'
block that is handling the original exception.
Greg Ewing [4] and Guido van Rossum [5], and probably others, have
previously mentioned adding a traceback attribute to Exception
instances. This is noted in PEP 3000.
This PEP was motivated by yet another recent Python-Dev reposting
of the same ideas [6] [7].
Rationale
The Python-Dev discussions revealed interest in exception chaining
for two quite different purposes. To handle the unexpected raising
of a secondary exception, the exception must be retained implicitly.
To support intentional translation of an exception, there must be a
way to chain exceptions explicitly. This PEP addresses both.
Several attribute names for chained exceptions have been suggested
on Python-Dev [2], including 'cause', 'antecedent', 'reason',
'original', 'chain', 'chainedexc', 'exc_chain', 'excprev',
'previous', and 'precursor'. For an explicitly chained exception,
this PEP suggests '__cause__' because of its specific meaning. For
an implicitly chained exception, this PEP proposes the name
'__context__' because the intended meaning is more specific than
temporal precedence but less specific than causation: an exception
occurs in the context of handling another exception.
This PEP suggests names with leading and trailing double-underscores
for these three attributes because they are set by the Python VM.
Only in very special cases should they be set by normal assignment.
This PEP handles exceptions that occur during 'except' blocks and
'finally' blocks in the same way. Reading the traceback makes it
clear where the exceptions occurred, so additional mechanisms for
distinguishing the two cases would only add unnecessary complexity.
This PEP proposes that the outermost exception object (the one
exposed for matching by 'except' clauses) be the most recently
raised exception for compatibility with current behaviour.
This PEP proposes that tracebacks display the outermost exception
last, because this would be consistent with the chronological order
of tracebacks (from oldest to most recent frame) and because the
actual thrown exception is easier to find on the last line.
To keep things simpler, the C API calls for setting an exception
will not automatically set the exception's '__context__'. Guido
van Rossum has has expressed concerns with making such changes [8].
As for other languages, Java and Ruby both discard the original
exception when another exception occurs in a 'catch'/'rescue' or
'finally'/'ensure' clause. Perl 5 lacks built-in structured
exception handling. For Perl 6, RFC number 88 [9] proposes an exception
mechanism that implicitly retains chained exceptions in an array
named @@. In that RFC, the most recently raised exception is
exposed for matching, as in this PEP; also, arbitrary expressions
(possibly involving @@) can be evaluated for exception matching.
Exceptions in C# contain a read-only 'InnerException' property that
may point to another exception. Its documentation [10] says that
"When an exception X is thrown as a direct result of a previous
exception Y, the InnerException property of X should contain a
reference to Y." This property is not set by the VM automatically;
rather, all exception constructors take an optional 'innerException'
argument to set it explicitly. The '__cause__' attribute fulfills
the same purpose as InnerException, but this PEP proposes a new form
of 'raise' rather than extending the constructors of all exceptions.
C# also provides a GetBaseException method that jumps directly to
the end of the InnerException chain; this PEP proposes no analog.
The reason all three of these attributes are presented together in
one proposal is that the '__traceback__' attribute provides
convenient access to the traceback on chained exceptions.
Implicit Exception Chaining
Here is an example to illustrate the '__context__' attribute.
def compute(a, b):
try:
a/b
except Exception, exc:
log(exc)
def log(exc):
file = open('logfile.txt') # oops, forgot the 'w'
print >>file, exc
file.close()
Calling compute(0, 0) causes a ZeroDivisionError. The compute()
function catches this exception and calls log(exc), but the log()
function also raises an exception when it tries to write to a
file that wasn't opened for writing.
In today's Python, the caller of compute() gets thrown an IOError.
The ZeroDivisionError is lost. With the proposed change, the
instance of IOError has an additional '__context__' attribute that
retains the ZeroDivisionError.
The following more elaborate example demonstrates the handling of a
mixture of 'finally' and 'except' clauses:
def main(filename):
file = open(filename) # oops, forgot the 'w'
try:
try:
compute()
except Exception, exc:
log(file, exc)
finally:
file.clos() # oops, misspelled 'close'
def compute():
1/0
def log(file, exc):
try:
print >>file, exc # oops, file is not writable
except:
display(exc)
def display(exc):
print ex # oops, misspelled 'exc'
Calling main() with the name of an existing file will trigger four
exceptions. The ultimate result will be an AttributeError due to
the misspelling of 'clos', whose __context__ points to a NameError
due to the misspelling of 'ex', whose __context__ points to an
IOError due to the file being read-only, whose __context__ points to
a ZeroDivisionError, whose __context__ attribute is None.
The proposed semantics are as follows:
1. Each thread has an exception context initially set to None.
2. Whenever an exception is raised, if the exception instance does
not already have a '__context__' attribute, the interpreter sets
it equal to the thread's exception context.
3. Immediately after an exception is raised, the thread's exception
context is set to the exception.
4. Whenever the interpreter exits an 'except' block by reaching the
end or executing a 'return', 'yield', 'continue', or 'break'
statement, the thread's exception context is set to None.
Explicit Exception Chaining
The '__cause__' attribute on exception objects is always initialized
to None. It is set by a new form of the 'raise' statement:
raise EXCEPTION from CAUSE
which is equivalent to:
exc = EXCEPTION
exc.__cause__ = CAUSE
raise exc
In the following example, a database provides implementations for a
few different kinds of storage, with file storage as one kind. The
database designer wants errors to propagate as DatabaseError objects
so that the client doesn't have to be aware of the storage-specific
details, but doesn't want to lose the underlying error information.
class DatabaseError(Exception):
pass
class FileDatabase(Database):
def __init__(self, filename):
try:
self.file = open(filename)
except IOError, exc:
raise DatabaseError('failed to open') from exc
If the call to open() raises an exception, the problem will be
reported as a DatabaseError, with a __cause__ attribute that reveals
the IOError as the original cause.
Traceback Attribute
The following example illustrates the '__traceback__' attribute.
def do_logged(file, work):
try:
work()
except Exception, exc:
write_exception(file, exc)
raise exc
from traceback import format_tb
def write_exception(file, exc):
...
type = exc.__class__
message = str(exc)
lines = format_tb(exc.__traceback__)
file.write(... type ... message ... lines ...)
...
In today's Python, the do_logged() function would have to extract
the traceback from sys.exc_traceback or sys.exc_info()[2] and pass
both the value and the traceback to write_exception(). With the
proposed change, write_exception() simply gets one argument and
obtains the exception using the '__traceback__' attribute.
The proposed semantics are as follows:
1. Whenever an exception is caught, if the exception instance does
not already have a '__traceback__' attribute, the interpreter
sets it to the newly caught traceback.
Enhanced Reporting
The default exception handler will be modified to report chained
exceptions. The chain of exceptions is traversed by following the
'__cause__' and '__context__' attributes, with '__cause__' taking
priority. In keeping with the chronological order of tracebacks,
the most recently raised exception is displayed last; that is, the
display begins with the description of the innermost exception and
backs up the chain to the outermost exception. The tracebacks are
formatted as usual, with one of the lines:
The above exception was the direct cause of the following exception:
or
During handling of the above exception, another exception occurred:
between tracebacks, depending whether they are linked by __cause__
or __context__ respectively. Here is a sketch of the procedure:
def print_chain(exc):
if exc.__cause__:
print_chain(exc.__cause__)
print '\nThe above exception was the direct cause...'
elif exc.__context__:
print_chain(exc.__context__)
print '\nDuring handling of the above exception, ...'
print_exc(exc)
In the 'traceback' module, the format_exception, print_exception,
print_exc, and print_last functions will be updated to accept an
optional 'chain' argument, True by default. When this argument is
True, these functions will format or display the entire chain of
exceptions as just described. When it is False, these functions
will format or display only the outermost exception.
The 'cgitb' module should also be updated to display the entire
chain of exceptions.
C API
The PyErr_Set* calls for setting exceptions will not set the
'__context__' attribute on exceptions. PyErr_NormalizeException
will always set the 'traceback' attribute to its 'tb' argument and
the '__context__' and '__cause__' attributes to None.
A new API function, PyErr_SetContext(context), will help C
programmers provide chained exception information. This function
will first normalize the current exception so it is an instance,
then set its '__context__' attribute. A similar API function,
PyErr_SetCause(cause), will set the '__cause__' attribute.
Compatibility
Chained exceptions expose the type of the most recent exception, so
they will still match the same 'except' clauses as they do now.
The proposed changes should not break any code unless it sets or
uses attributes named '__context__', '__cause__', or '__traceback__'
on exception instances. As of 2005-05-12, the Python standard
library contains no mention of such attributes.
Open Issue: Extra Information
Walter Dรถrwald [11] expressed a desire to attach extra information
to an exception during its upward propagation without changing its
type. This could be a useful feature, but it is not addressed by
this PEP. It could conceivably be addressed by a separate PEP
establishing conventions for other informational attributes on
exceptions.
Open Issue: Suppressing Context
As written, this PEP makes it impossible to suppress '__context__',
since setting exc.__context__ to None in an 'except' or 'finally'
clause will only result in it being set again when exc is raised.
Open Issue: Limiting Exception Types
To improve encapsulation, library implementors may want to wrap all
implementation-level exceptions with an application-level exception.
One could try to wrap exceptions by writing this:
try:
... implementation may raise an exception ...
except:
import sys
raise ApplicationError from sys.exc_value
or this:
try:
... implementation may raise an exception ...
except Exception, exc:
raise ApplicationError from exc
but both are somewhat flawed. It would be nice to be able to name
the current exception in a catch-all 'except' clause, but that isn't
addressed here. Such a feature would allow something like this:
try:
... implementation may raise an exception ...
except *, exc:
raise ApplicationError from exc
Open Issue: yield
The exception context is lost when a 'yield' statement is executed;
resuming the frame after the 'yield' does not restore the context.
Addressing this problem is out of the scope of this PEP; it is not a
new problem, as demonstrated by the following example:
>>> def gen():
... try:
... 1/0
... except:
... yield 3
... raise
...
>>> g = gen()
>>> g.next()
3
>>> g.next()
TypeError: exceptions must be classes, instances, or strings
(deprecated), not NoneType
Open Issue: Garbage Collection
The strongest objection to this proposal has been that it creates
cycles between exceptions and stack frames [12]. Collection of
cyclic garbage (and therefore resource release) can be greatly
delayed.
>>> try:
>>> 1/0
>>> except Exception, err:
>>> pass
will introduce a cycle from err -> traceback -> stack frame -> err,
keeping all locals in the same scope alive until the next GC happens.
Today, these locals would go out of scope. There is lots of code
which assumes that "local" resources -- particularly open files -- will
be closed quickly. If closure has to wait for the next GC, a program
(which runs fine today) may run out of file handles.
Making the __traceback__ attribute a weak reference would avoid the
problems with cyclic garbage. Unfortunately, it would make saving
the Exception for later (as unittest does) more awkward, and it would
not allow as much cleanup of the sys module.
A possible alternate solution, suggested by Adam Olsen, would be to
instead turn the reference from the stack frame to the 'err' variable
into a weak reference when the variable goes out of scope [13].
Possible Future Compatible Changes
These changes are consistent with the appearance of exceptions as
a single object rather than a triple at the interpreter level.
- If PEP 340 or PEP 343 is accepted, replace the three (type, value,
traceback) arguments to __exit__ with a single exception argument.
- Deprecate sys.exc_type, sys.exc_value, sys.exc_traceback, and
sys.exc_info() in favour of a single member, sys.exception.
- Deprecate sys.last_type, sys.last_value, and sys.last_traceback
in favour of a single member, sys.last_exception.
- Deprecate the three-argument form of the 'raise' statement in
favour of the one-argument form.
- Upgrade cgitb.html() to accept a single value as its first
argument as an alternative to a (type, value, traceback) tuple.
Possible Future Incompatible Changes
These changes might be worth considering for Python 3000.
- Remove sys.exc_type, sys.exc_value, sys.exc_traceback, and
sys.exc_info().
- Remove sys.last_type, sys.last_value, and sys.last_traceback.
- Replace the three-argument sys.excepthook with a one-argument
API, and changing the 'cgitb' module to match.
- Remove the three-argument form of the 'raise' statement.
- Upgrade traceback.print_exception to accept an 'exception'
argument instead of the type, value, and traceback arguments.
Implementation
The __traceback__ and __cause__ attributes and the new raise syntax were
implemented in revision 57783 [14].
Acknowledgements
Brett Cannon, Greg Ewing, Guido van Rossum, Jeremy Hylton, Phillip
J. Eby, Raymond Hettinger, Walter Dรถrwald, and others.
References
[1] Raymond Hettinger, "Idea for avoiding exception masking"
http://mail.python.org/pipermail/python-dev/2003-January/032492.html
[2] Brett Cannon explains chained exceptions
http://mail.python.org/pipermail/python-dev/2003-June/036063.html
[3] Greg Ewing points out masking caused by exceptions during finally
http://mail.python.org/pipermail/python-dev/2003-June/036290.html
[4] Greg Ewing suggests storing the traceback in the exception object
http://mail.python.org/pipermail/python-dev/2003-June/036092.html
[5] Guido van Rossum mentions exceptions having a traceback attribute
http://mail.python.org/pipermail/python-dev/2005-April/053060.html
[6] Ka-Ping Yee, "Tidier Exceptions"
http://mail.python.org/pipermail/python-dev/2005-May/053671.html
[7] Ka-Ping Yee, "Chained Exceptions"
http://mail.python.org/pipermail/python-dev/2005-May/053672.html
[8] Guido van Rossum discusses automatic chaining in PyErr_Set*
http://mail.python.org/pipermail/python-dev/2003-June/036180.html
[9] Tony Olensky, "Omnibus Structured Exception/Error Handling Mechanism"
http://dev.perl.org/perl6/rfc/88.html
[10] MSDN .NET Framework Library, "Exception.InnerException Property"
http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemexceptionclassinnerexceptiontopic.asp
[11] Walter Dรถrwald suggests wrapping exceptions to add details
http://mail.python.org/pipermail/python-dev/2003-June/036148.html
[12] Guido van Rossum restates the objection to cyclic trash
http://mail.python.org/pipermail/python-3000/2007-January/005322.html
[13] Adam Olsen suggests using a weakref from stack frame to exception
http://mail.python.org/pipermail/python-3000/2007-January/005363.html
[14] Patch to implement the bulk of the PEP
http://svn.python.org/view/python/branches/py3k/Include/?rev=57783&view=rev
Copyright
This document has been placed in the public domain.
pep-3135 New Super
| PEP: | 3135 |
|---|---|
| Title: | New Super |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Calvin Spealman <ironfroggy at gmail.com>, Tim Delaney <timothy.c.delaney at gmail.com>, Lie Ryan <lie.1296 at gmail.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 28-Apr-2007 |
| Python-Version: | 3.0 |
| Post-History: | 28-Apr-2007, 29-Apr-2007 (1), 29-Apr-2007 (2), 14-May-2007, 12-Mar-2009 |
Contents
Numbering Note
This PEP started its life as PEP 367. Since it is now targeted for Python 3000, it has been moved into the 3xxx space.
Abstract
This PEP proposes syntactic sugar for use of the super type to automatically construct instances of the super type binding to the class that a method was defined in, and the instance (or class object for classmethods) that the method is currently acting upon.
The premise of the new super usage suggested is as follows:
super().foo(1, 2)
to replace the old:
super(Foo, self).foo(1, 2)
Rationale
The current usage of super requires an explicit passing of both the class and instance it must operate from, requiring a breaking of the DRY (Don't Repeat Yourself) rule. This hinders any change in class name, and is often considered a wart by many.
Specification
Within the specification section, some special terminology will be used to distinguish similar and closely related concepts. "super class" will refer to the actual builtin class named "super". A "super instance" is simply an instance of the super class, which is associated with another class and possibly with an instance of that class.
The new super semantics are only available in Python 3.0.
Replacing the old usage of super, calls to the next class in the MRO (method resolution order) can be made without explicitly passing the class object (although doing so will still be supported). Every function will have a cell named __class__ that contains the class object that the function is defined in.
The new syntax:
super()
is equivalent to:
super(__class__, <firstarg>)
where __class__ is the class that the method was defined in, and <firstarg> is the first parameter of the method (normally self for instance methods, and cls for class methods). For functions defined outside a class body, __class__ is not defined, and will result in runtime SystemError.
While super is not a reserved word, the parser recognizes the use of super in a method definition and only passes in the __class__ cell when this is found. Thus, calling a global alias of super without arguments will not necessarily work.
Closed Issues
Determining the class object to use
The class object is taken from a cell named __class__.
Should super actually become a keyword?
No. It is not necessary for super to become a keyword.
super used with __call__ attributes
It was considered that it might be a problem that instantiating super instances the classic way, because calling it would lookup the __call__ attribute and thus try to perform an automatic super lookup to the next class in the MRO. However, this was found to be false, because calling an object only looks up the __call__ method directly on the object's type. The following example shows this in action.
class A(object):
def __call__(self):
return '__call__'
def __getattribute__(self, attr):
if attr == '__call__':
return lambda: '__getattribute__'
a = A()
assert a() == '__call__'
assert a.__call__() == '__getattribute__'
In any case, this issue goes away entirely because classic calls to super(<class>, <instance>) are still supported with the same meaning.
Alternative Proposals
No Changes
Although its always attractive to just keep things how they are, people have sought a change in the usage of super calling for some time, and for good reason, all mentioned previously.
- Decoupling from the class name (which might not even be bound to the right class anymore!)
- Simpler looking, cleaner super calls would be better
Dynamic attribute on super type
The proposal adds a dynamic attribute lookup to the super type, which will automatically determine the proper class and instance parameters. Each super attribute lookup identifies these parameters and performs the super lookup on the instance, as the current super implementation does with the explicit invokation of a super instance upon a class and instance.
This proposal relies on sys._getframe(), which is not appropriate for anything except a prototype implementation.
self.__super__.foo(*args)
The __super__ attribute is mentioned in this PEP in several places, and could be a candidate for the complete solution, actually using it explicitly instead of any super usage directly. However, double-underscore names are usually an internal detail, and attempted to be kept out of everyday code.
super(self, *args) or __super__(self, *args)
This solution only solves the problem of the type indication, does not handle differently named super methods, and is explicit about the name of the instance. It is less flexable without being able to enacted on other method names, in cases where that is needed. One use case this fails is where a base- class has a factory classmethod and a subclass has two factory classmethods, both of which needing to properly make super calls to the one in the base- class.
super.foo(self, *args)
This variation actually eliminates the problems with locating the proper instance, and if any of the alternatives were pushed into the spotlight, I would want it to be this one.
super(*p, **kw)
There has been the proposal that directly calling super(*p, **kw) would be equivalent to calling the method on the super object with the same name as the method currently being executed i.e. the following two methods would be equivalent:
def f(self, *p, **kw):
super.f(*p, **kw)
def f(self, *p, **kw):
super(*p, **kw)
There is strong sentiment for and against this, but implementation and style concerns are obvious. Guido has suggested that this should be excluded from this PEP on the principle of KISS (Keep It Simple Stupid).
History
12-Mar-2009 - Updated to reflect the current state of implementation.
- 29-Apr-2007 - Changed title from "Super As A Keyword" to "New Super"
- Updated much of the language and added a terminology section for clarification in confusing places.
- Added reference implementation and history sections.
- 06-May-2007 - Updated by Tim Delaney to reflect discussions on the python-3000
- and python-dev mailing lists.
References
| [1] | Fixing super anyone? (http://mail.python.org/pipermail/python-3000/2007-April/006667.html) |
| [2] | PEP 3130: Access to Module/Class/Function Currently Being Defined (this) (http://mail.python.org/pipermail/python-ideas/2007-April/000542.html) |
Copyright
This document has been placed in the public domain.
pep-3136 Labeled break and continue
| PEP: | 3136 |
|---|---|
| Title: | Labeled break and continue |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Matt Chisholm <matt-python at theory.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 30-Jun-2007 |
| Python-Version: | 3.1 |
| Post-History: |
Contents
Rejection Notice
This PEP is rejected. See http://mail.python.org/pipermail/python-3000/2007-July/008663.html.
Abstract
This PEP proposes support for labels in Python's break and continue statements. It is inspired by labeled break and continue in other languages, and the author's own infrequent but persistent need for such a feature.
Introduction
The break statement allows the programmer to terminate a loop early, and the continue statement allows the programmer to move to the next iteration of a loop early. In Python currently, break and continue can apply only to the innermost enclosing loop.
Adding support for labels to the break and continue statements is a logical extension to the existing behavior of the break and continue statements. Labeled break and continue can improve the readability and flexibility of complex code which uses nested loops.
For brevity's sake, the examples and discussion in this PEP usually refers to the break statement. However, all of the examples and motivations apply equally to labeled continue.
Motivation
If the programmer wishes to move to the next iteration of an outer enclosing loop, or terminate multiple loops at once, he or she has a few less-than elegant options.
Here's one common way of imitating labeled break in Python (For this and future examples, ... denotes an arbitrary number of intervening lines of code):
for a in a_list:
time_to_break_out_of_a = False
...
for b in b_list:
...
if condition_one(a, b):
break
...
if condition_two(a, b):
time_to_break_out_of_a = True
break
...
if time_to_break_out_of_a:
break
...
This requires five lines and an extra variable, time_to_break_out_of_a, to keep track of when to break out of the outer (a) loop. And those five lines are spread across many lines of code, making the control flow difficult to understand.
This technique is also error-prone. A programmer modifying this code might inadvertently put new code after the end of the inner (b) loop but before the test for time_to_break_out_of_a, instead of after the test. This means that code which should have been skipped by breaking out of the outer loop gets executed incorrectly.
This could also be written with an exception. The programmer would declare a special exception, wrap the inner loop in a try, and catch the exception and break when you see it:
class BreakOutOfALoop(Exception): pass
for a in a_list:
...
try:
for b in b_list:
...
if condition_one(a, b):
break
...
if condition_two(a, b):
raise BreakOutOfALoop
...
except BreakOutOfALoop:
break
...
Again, though; this requires five lines and a new, single-purpose exception class (instead of a new variable), and spreads basic control flow out over many lines. And it breaks out of the inner loop with break and out of the other loop with an exception, which is inelegant. [1]
This next strategy might be the most elegant solution, assuming condition_two() is inexpensive to compute:
for a in a_list:
...
for b in b_list:
...
if condition_one(a, b):
break
...
if condition_two(a, b):
break
...
if condition_two(a, b)
break
...
Breaking twice is still inelegant. This implementation also relies on the fact that the inner (b) loop bleeds b into the outer for loop, which (although explicitly supported) is both surprising to novices, and in my opinion counter-intuitive and poor practice.
The programmer must also still remember to put in both breaks on condition two and not insert code before the second break. A single conceptual action, breaking out of both loops on condition_two(), requires four lines of code at two indentation levels, possibly separated by many intervening lines at the end of the inner (b) loop.
Other languages
Now, put aside whatever dislike you may have for other programming languages, and consider the syntax of labeled break and continue. In Perl:
ALOOP: foreach $a (@a_array){
...
BLOOP: foreach $b (@b_array){
...
if (condition_one($a,$b)){
last BLOOP; # same as plain old last;
}
...
if (condition_two($a,$b)){
last ALOOP;
}
...
}
...
}
(Notes: Perl uses last instead of break. The BLOOP labels could be omitted; last and continue apply to the innermost loop by default.)
PHP uses a number denoting the number of loops to break out of, rather than a label:
foreach ($a_array as $a){
....
foreach ($b_array as $b){
....
if (condition_one($a, $b)){
break 1; # same as plain old break
}
....
if (condition_two($a, $b)){
break 2;
}
....
}
...
}
C/C++, Java, and Ruby all have similar constructions.
The control flow regarding when to break out of the outer (a) loop is fully encapsulated in the break statement which gets executed when the break condition is satisfied. The depth of the break statement does not matter. Control flow is not spread out. No extra variables, exceptions, or re-checking or storing of control conditions is required. There is no danger that code will get inadvertently inserted after the end of the inner (b) loop and before the break condition is re-checked inside the outer (a) loop. These are the benefits that labeled break and continue would bring to Python.
What this PEP is not
This PEP is not a proposal to add GOTO to Python. GOTO allows a programmer to jump to an arbitrary block or line of code, and generally makes control flow more difficult to follow. Although break and continue (with or without support for labels) can be considered a type of GOTO, it is much more restricted. Another Python construct, yield, could also be considered a form of GOTO -- an even less restrictive one. The goal of this PEP is to propose an extension to the existing control flow tools break and continue, to make control flow easier to understand, not more difficult.
Labeled break and continue cannot transfer control to another function or method. They cannot even transfer control to an arbitrary line of code in the current scope. Currently, they can only affect the behavior of a loop, and are quite different and much more restricted than GOTO. This extension allows them to affect any enclosing loop in the current name-space, but it does not change their behavior to that of GOTO.
Specification
Under all of these proposals, break and continue by themselves will continue to behave as they currently do, applying to the innermost loop by default.
Proposal A - Explicit labels
The for and while loop syntax will be followed by an optional as or label (contextual) keyword [2] and then an identifier, which may be used to identify the loop out of which to break (or which should be continued).
The break (and continue) statements will be followed by an optional identifier that refers to the loop out of which to break (or which should be continued). Here is an example using the as keyword:
for a in a_list as a_loop:
...
for b in b_list as b_loop:
...
if condition_one(a, b):
break b_loop # same as plain old break
...
if condition_two(a, b):
break a_loop
...
...
Or, with label instead of as:
for a in a_list label a_loop:
...
for b in b_list label b_loop:
...
if condition_one(a, b):
break b_loop # same as plain old break
...
if condition_two(a, b):
break a_loop
...
...
This has all the benefits outlined above. It requires modifications to the language syntax: the syntax of break and continue syntax statements and for and while statements. It requires either a new conditional keyword label or an extension to the conditional keyword as. [3] It is unlikely to require any changes to existing Python programs. Passing an identifier not defined in the local scope to break or continue would raise a NameError.
Proposal B - Numeric break & continue
Rather than altering the syntax of for and while loops, break and continue would take a numeric argument denoting the enclosing loop which is being controlled, similar to PHP.
It seems more Pythonic to me for break and continue to refer to loops indexing from zero, as opposed to indexing from one as PHP does.
for a in a_list:
...
for b in b_list:
...
if condition_one(a,b):
break 0 # same as plain old break
...
if condition_two(a,b):
break 1
...
...
Passing a number that was too large, or less than zero, or non-integer to break or continue would (probably) raise an IndexError.
This proposal would not require any changes to existing Python programs.
Proposal C - The reduplicative method
The syntax of break and continue would be altered to allow multiple break and continue statements on the same line. Thus, break break would break out of the first and second enclosing loops.
for a in a_list:
...
for b in b_list:
...
if condition_one(a,b):
break # plain old break
...
if condition_two(a,b):
break break
...
...
This would also allow the programmer to break out of the inner loop and continue the next outermost simply by writing break continue, [4] and so on. I'm not sure what exception would be raised if the programmer used more break or continue statements than existing loops (perhaps a SyntaxError?).
I expect this proposal to get rejected because it will be judged too difficult to understand.
This proposal would not require any changes to existing Python programs.
Proposal D - Explicit iterators
Rather than embellishing for and while loop syntax with labels, the programmer wishing to use labeled breaks would be required to create the iterator explicitly and assign it to a identifier if he or she wanted to break out of or continue that loop from within a deeper loop.
a_iter = iter(a_list)
for a in a_iter:
...
b_iter = iter(b_list)
for b in b_iter:
...
if condition_one(a,b):
break b_iter # same as plain old break
...
if condition_two(a,b):
break a_iter
...
...
Passing a non-iterator object to break or continue would raise a TypeError; and a nonexistent identifier would raise a NameError. This proposal requires only one extra line to create a labeled loop, and no extra lines to break out of a containing loop, and no changes to existing Python programs.
Proposal E - Explicit iterators and iterator methods
This is a variant of Proposal D. Iterators would need be created explicitly if anything other that the most basic use of break and continue was required. Instead of modifying the syntax of break and continue, .break() and .continue() methods could be added to the Iterator type.
a_iter = iter(a_list)
for a in a_iter:
...
b_iter = iter(b_list)
for b in b_iter:
...
if condition_one(a,b):
b_iter.break() # same as plain old break
...
if condition_two(a,b):
a_iter.break()
...
...
I expect that this proposal will get rejected on the grounds of sheer ugliness. However, it requires no changes to the language syntax whatsoever, nor does it require any changes to existing Python programs.
Implementation
I have never looked at the Python language implementation itself, so I have no idea how difficult this would be to implement. If this PEP is accepted, but no one is available to write the feature, I will try to implement it myself.
Footnotes
| [1] | Breaking some loops with exceptions is inelegant because it's a violation of There's Only One Way To Do It. |
| [2] | Or really any new contextual keyword that the community likes: as, label, labeled, loop, name, named, walrus, whatever. |
| [3] | The use of as in a similar context has been proposed here, http://sourceforge.net/tracker/index.php?func=detail&aid=1714448&group_id=5470&atid=355470 but to my knowledge this idea has not been written up as a PEP. |
| [4] | To continue the Nth outer loop, you would write break N-1 times and then continue. Only one continue would be allowed, and only at the end of a sequence of breaks. continue break or continue continue makes no sense. |
Resources
This issue has come up before, although it has never been resolved, to my knowledge.
- labeled breaks [5], on comp.lang.python, in the context of do...while loops
- break LABEL vs. exceptions + PROPOSAL [6], on python-list, as compared to using Exceptions for flow control
- Named code blocks [7] on python-list, a suggestion motivated by the desire for labeled break / continue
- mod_python bug fix [8] An example of someone setting a flag inside an inner loop that triggers a continue in the containing loop, to work around the absence of labeled break and continue
References
| [5] | http://groups.google.com/group/comp.lang.python/browse_thread/thread/6da848f762c9cf58/979ca3cd42633b52?lnk=gst&q=labeled+break&rnum=3#979ca3cd42633b52 |
| [6] | http://mail.python.org/pipermail/python-list/1999-September/#11080 |
| [7] | http://mail.python.org/pipermail/python-list/2001-April/#78439 |
| [8] | http://mail-archives.apache.org/mod_mbox/httpd-python-cvs/200511.mbox/%3C20051112204322.4010.qmail@minotaur.apache.org%3E |
Copyright
This document has been placed in the public domain.
pep-3137 Immutable Bytes and Mutable Buffer
| PEP: | 3137 |
|---|---|
| Title: | Immutable Bytes and Mutable Buffer |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 26-Sep-2007 |
| Python-Version: | 3.0 |
| Post-History: | 26-Sep-2007, 30-Sep-2007 |
Contents
Introduction
After releasing Python 3.0a1 with a mutable bytes type, pressure mounted to add a way to represent immutable bytes. Gregory P. Smith proposed a patch that would allow making a bytes object temporarily immutable by requesting that the data be locked using the new buffer API from PEP 3118. This did not seem the right approach to me.
Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to make the bytes type immutable (by crudely removing all mutating APIs) and fix the fall-out in the test suite. This showed that there aren't all that many places that depend on the mutability of bytes, with the exception of code that builds up a return value from small pieces.
Thinking through the consequences, and noticing that using the array module as an ersatz mutable bytes type is far from ideal, and recalling a proposal put forward earlier by Talin, I floated the suggestion to have both a mutable and an immutable bytes type. (This had been brought up before, but until seeing the evidence of Jeffrey's patch I wasn't open to the suggestion.)
Moreover, a possible implementation strategy became clear: use the old PyString implementation, stripped down to remove locale support and implicit conversions to/from Unicode, for the immutable bytes type, and keep the new PyBytes implementation as the mutable bytes type.
The ensuing discussion made it clear that the idea is welcome but needs to be specified more precisely. Hence this PEP.
Advantages
One advantage of having an immutable bytes type is that code objects can use these. It also makes it possible to efficiently create hash tables using bytes for keys; this may be useful when parsing protocols like HTTP or SMTP which are based on bytes representing text.
Porting code that manipulates binary data (or encoded text) in Python 2.x will be easier using the new design than using the original 3.0 design with mutable bytes; simply replace str with bytes and change '...' literals into b'...' literals.
Naming
I propose the following type names at the Python level:
- bytes is an immutable array of bytes (PyString)
- bytearray is a mutable array of bytes (PyBytes)
- memoryview is a bytes view on another object (PyMemory)
The old type named buffer is so similar to the new type memoryview, introduce by PEP 3118, that it is redundant. The rest of this PEP doesn't discuss the functionality of memoryview; it is just mentioned here to justify getting rid of the old buffer type. (An earlier version of this PEP proposed buffer as the new name for PyBytes; in the end this name was deemed to confusing given the many other uses of the word buffer.)
While eventually it makes sense to change the C API names, this PEP maintains the old C API names, which should be familiar to all.
Summary
Here's a simple ASCII-art table summarizing the type names in various Python versions:
+--------------+-------------+------------+--------------------------+ | C name | 2.x repr | 3.0a1 repr | 3.0a2 repr | +--------------+-------------+------------+--------------------------+ | PyUnicode | unicode u'' | str '' | str '' | | PyString | str '' | str8 s'' | bytes b'' | | PyBytes | N/A | bytes b'' | bytearray bytearray(b'') | | PyBuffer | buffer | buffer | N/A | | PyMemoryView | N/A | memoryview | memoryview <...> | +--------------+-------------+------------+--------------------------+
Literal Notations
The b'...' notation introduced in Python 3.0a1 returns an immutable bytes object, whatever variation is used. To create a mutable array of bytes, use bytearray(b'...') or bytearray([...]). The latter form takes a list of integers in range(256).
Functionality
PEP 3118 Buffer API
Both bytes and bytearray implement the PEP 3118 buffer API. The bytes type only implements read-only requests; the bytearray type allows writable and data-locked requests as well. The element data type is always 'B' (i.e. unsigned byte).
Constructors
There are four forms of constructors, applicable to both bytes and bytearray:
- bytes(<bytes>), bytes(<bytearray>), bytearray(<bytes>), bytearray(<bytearray>): simple copying constructors, with the note that bytes(<bytes>) might return its (immutable) argument, but bytearray(<bytearray>) always makes a copy.
- bytes(<str>, <encoding>[, <errors>]), bytearray(<str>, <encoding>[, <errors>]): encode a text string. Note that the str.encode() method returns an immutable bytes object. The <encoding> argument is mandatory; <errors> is optional. <encoding> and <errrors>, if given, must be str instances.
- bytes(<memory view>), bytearray(<memory view>): construct a bytes or bytearray object from anything that implements the PEP 3118 buffer API.
- bytes(<iterable of ints>), bytearray(<iterable of ints>): construct a bytes or bytearray object from a stream of integers in range(256).
- bytes(<int>), bytearray(<int>): construct a zero-initialized bytes or bytearray object of a given length.
Comparisons
The bytes and bytearray types are comparable with each other and orderable, so that e.g. b'abc' == bytearray(b'abc') < b'abd'.
Comparing either type to a str object for equality returns False regardless of the contents of either operand. Ordering comparisons with str raise TypeError. This is all conformant to the standard rules for comparison and ordering between objects of incompatible types.
(Note: in Python 3.0a1, comparing a bytes instance with a str instance would raise TypeError, on the premise that this would catch the occasional mistake quicker, especially in code ported from Python 2.x. However, a long discussion on the python-3000 list pointed out so many problems with this that it is clearly a bad idea, to be rolled back in 3.0a2 regardless of the fate of the rest of this PEP.)
Slicing
Slicing a bytes object returns a bytes object. Slicing a bytearray object returns a bytearray object.
Slice assignment to a bytearray object accepts anything that implements the PEP 3118 buffer API, or an iterable of integers in range(256).
Indexing
Indexing bytes and bytearray returns small ints (like the bytes type in 3.0a1, and like lists or array.array('B')).
Assignment to an item of a bytearray object accepts an int in range(256). (To assign from a bytes sequence, use a slice assignment.)
Str() and Repr()
The str() and repr() functions return the same thing for these objects. The repr() of a bytes object returns a b'...' style literal. The repr() of a bytearray returns a string of the form "bytearray(b'...')".
Operators
The following operators are implemented by the bytes and bytearray types, except where mentioned:
- b1 + b2: concatenation. With mixed bytes/bytearray operands, the return type is that of the first argument (this seems arbitrary until you consider how += works).
- b1 += b2: mutates b1 if it is a bytearray object.
- b * n, n * b: repetition; n must be an integer.
- b *= n: mutates b if it is a bytearray object.
- b1 in b2, b1 not in b2: substring test; b1 can be any object implementing the PEP 3118 buffer API.
- i in b, i not in b: single-byte membership test; i must be an integer (if it is a length-1 bytes array, it is considered to be a substring test, with the same outcome).
- len(b): the number of bytes.
- hash(b): the hash value; only implemented by the bytes type.
Note that the % operator is not implemented. It does not appear worth the complexity.
Methods
The following methods are implemented by bytes as well as bytearray, with similar semantics. They accept anything that implements the PEP 3118 buffer API for bytes arguments, and return the same type as the object whose method is called ("self"):
.capitalize(), .center(), .count(), .decode(), .endswith(), .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(), .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(), .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(), .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(), .splitlines(), .startswith(), .strip(), .swapcase(), .title(), .translate(), .upper(), .zfill()
This is exactly the set of methods present on the str type in Python 2.x, with the exclusion of .encode(). The signatures and semantics are the same too. However, whenever character classes like letter, whitespace, lower case are used, the ASCII definitions of these classes are used. (The Python 2.x str type uses the definitions from the current locale, settable through the locale module.) The .encode() method is left out because of the more strict definitions of encoding and decoding in Python 3000: encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string.
In addition, both types implement the class method .fromhex(), which constructs an object from a string containing hexadecimal values (with or without spaces between the bytes).
The bytearray type implements these additional methods from the MutableSequence ABC (see PEP 3119):
.extend(), .insert(), .append(), .reverse(), .pop(), .remove().
Bytes and the Str Type
Like the bytes type in Python 3.0a1, and unlike the relationship between str and unicode in Python 2.x, attempts to mix bytes (or bytearray) objects and str objects without specifying an encoding will raise a TypeError exception. (However, comparing bytes/bytearray and str objects for equality will simply return False; see the section on Comparisons above.)
Conversions between bytes or bytearray objects and str objects must always be explicit, using an encoding. There are two equivalent APIs: str(b, <encoding>[, <errors>]) is equivalent to b.decode(<encoding>[, <errors>]), and bytes(s, <encoding>[, <errors>]) is equivalent to s.encode(<encoding>[, <errors>]).
There is one exception: we can convert from bytes (or bytearray) to str without specifying an encoding by writing str(b). This produces the same result as repr(b). This exception is necessary because of the general promise that any object can be printed, and printing is just a special case of conversion to str. There is however no promise that printing a bytes object interprets the individual bytes as characters (unlike in Python 2.x).
The str type currently implements the PEP 3118 buffer API. While this is perhaps occasionally convenient, it is also potentially confusing, because the bytes accessed via the buffer API represent a platform-depending encoding: depending on the platform byte order and a compile-time configuration option, the encoding could be UTF-16-BE, UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of the str type might completely change the bytes representation, e.g. to UTF-8, or even make it impossible to access the data as a contiguous array of bytes at all. Therefore, the PEP 3118 buffer API will be removed from the str type.
The basestring Type
The basestring type will be removed from the language. Code that used to say isinstance(x, basestring) should be changed to use isinstance(x, str) instead.
Pickling
Left as an exercise for the reader.
Copyright
This document has been placed in the public domain.
pep-3138 String representation in Python 3000
| PEP: | 3138 |
|---|---|
| Title: | String representation in Python 3000 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Atsuo Ishimoto <ishimoto--at--gembook.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 05-May-2008 |
| Post-History: | 05-May-2008, 05-Jun-2008 |
Contents
Abstract
This PEP proposes a new string representation form for Python 3000. In Python prior to Python 3000, the repr() built-in function converted arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, a wider range of characters, based on the Unicode standard, should be considered 'printable'.
Motivation
The current repr() converts 8-bit strings to ASCII using following algorithm.
- Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII characters (>= 0x80) to '\xXX'.
- Backslash-escape quote characters (apostrophe, ') and add the quote character at the beginning and the end.
For Unicode strings, the following additional conversions are done.
- Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
- Convert 16-bit characters (>= 0x100) to '\uXXXX'.
- Convert 21-bit characters (>= 0x10000) and surrogate pair characters to '\U00xxxxxx'.
This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for logging. Although all non-ASCII characters are escaped, this does not matter when most of the string's characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient.
We can use print(aJapaneseString) to get a readable string, but we don't have a similar workaround for printing strings from collections such as lists or tuples. print(listOfJapaneseStrings) uses repr() to build the string to be printed, so the resulting strings are always hex-escaped. Or when open(japaneseFilemame) raises an exception, the error message is something like IOError: [Errno 2] No such file or directory: '\u65e5\u672c\u8a9e', which isn't helpful.
Python 3000 has a lot of nice features for non-Latin users such as non-ASCII identifiers, so it would be helpful if Python could also progress in a similar way for printable output.
Some users might be concerned that such output will mess up their console if they print binary data like images. But this is unlikely to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won't mess it up.
This issue was once discussed by Hye-Shik Chang [1], but was rejected.
Specification
- Add a new function to the Python C API int Py_UNICODE_ISPRINTABLE
(Py_UNICODE ch). This function returns 0 if repr() should escape
the Unicode character ch; otherwise it returns 1. Characters
that should be escaped are defined in the Unicode character database
as:
- Cc (Other, Control)
- Cf (Other, Format)
- Cs (Other, Surrogate)
- Co (Other, Private Use)
- Cn (Other, Not Assigned)
- Zl (Separator, Line), refers to LINE SEPARATOR ('\u2028').
- Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\u2029').
- Zs (Separator, Space) other than ASCII space ('\x20'). Characters in this category should be escaped to avoid ambiguity.
- The algorithm to build repr() strings should be changed to:
- Convert CR, LF, TAB and '\' to '\r', '\n', '\t', '\\'.
- Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to '\xXX'.
- Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\uXXXX'.
- Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns 0) to 'xXX', '\uXXXX' or '\U00xxxxxx'.
- Backslash-escape quote characters (apostrophe, 0x27) and add a quote character at the beginning and the end.
- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by default.
- Add a new function to the Python C API PyObject *PyObject_ASCII (PyObject *o). This function converts any python object to a string using PyObject_Repr() and then hex-escapes all non-ASCII characters. PyObject_ASCII() generates the same string as PyObject_Repr() in Python 2.
- Add a new built-in function, ascii(). This function converts any python object to a string using repr() and then hex-escapes all non-ASCII characters. ascii() generates the same string as repr() in Python 2.
- Add a '%a' string format operator. '%a' converts any python object to a string using repr() and then hex-escapes all non-ASCII characters. The '%a' format operator generates the same string as '%r' in Python 2. Also, add '!a' conversion flags to the string.format() method and add '%A' operator to the PyUnicode_FromFormat(). They convert any object to an ASCII string as '%a' string format operator.
- Add an isprintable() method to the string type. str.isprintable() returns False if repr() would escape any character in the string; otherwise returns True. The isprintable() method calls the Py_UNICODE_ISPRINTABLE() function internally.
Rationale
The repr() in Python 3000 should be Unicode, not ASCII based, just like Python 3000 strings. Also, conversion should not be affected by the locale setting, because the locale is not necessarily the same as the output device's locale. For example, it is common for a daemon process to be invoked in an ASCII setting, but writes UTF-8 to its log files. Also, web applications might want to report the error information in more readable form based on the HTML page's encoding.
Characters not supported by the user's console could be hex-escaped on printing, by the Unicode encoder's error-handler. If the error-handler of the output file is 'backslashreplace', such characters are hex-escaped without raising UnicodeEncodeError. For example, if the default encoding is ASCII, print('Hello ¢') will print 'Hello \xa2'. If the encoding is ISO-8859-1, 'Hello ¢' will be printed.
The default error-handler for sys.stdout is 'strict'. Other applications reading the output might not understand hex-escaped characters, so unsupported characters should be trapped when writing. If unsupported characters must be escaped, the error-handler should be changed explicitly. Unlike sys.stdout, sys.stderr doesn't raise UnicodeEncodingError by default, because the default error-handler is 'backslashreplace'. So printing error messages containing non-ASCII characters to sys.stderr will not raise an exception. Also, information about uncaught exceptions (exception object, traceback) is printed by the interpreter without raising exceptions.
Alternate Solutions
To help debugging in non-Latin languages without changing repr(), other suggestions were made.
Supply a tool to print lists or dicts.
Strings to be printed for debugging are not only contained by lists or dicts, but also in many other types of object. File objects contain a file name in Unicode, exception objects contain a message in Unicode, etc. These strings should be printed in readable form when repr()ed. It is unlikely to be possible to implement a tool to print all possible object types.
Use sys.displayhook and sys.excepthook.
For interactive sessions, we can write hooks to restore hex escaped characters to the original characters. But these hooks are called only when printing the result of evaluating an expression entered in an interactive Python session, and don't work for the print() function, for non-interactive sessions or for logging.debug("%r", ...), etc.
Subclass sys.stdout and sys.stderr.
It is difficult to implement a subclass to restore hex-escaped characters since there isn't enough information left by the time it's a string to undo the escaping correctly in all cases. For example, print("\\"+"u0041") should be printed as '\u0041', not 'A'. But there is no chance to tell file objects apart.
Make the encoding used by unicode_repr() adjustable, and make the existing repr() the default.
With adjustable repr(), the result of using repr() is unpredictable and would make it impossible to write correct code involving repr(). And if current repr() is the default, then the old convention remains intact and users may expect ASCII strings as the result of repr(). Third party applications or libraries could be confused when a custom repr() function is used.
Backwards Compatibility
Changing repr() may break some existing code, especially testing code. Five of Python's regression tests fail with this modification. If you need repr() strings without non-ASCII character as Python 2, you can use the following function.
def repr_ascii(obj):
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
For logging or for debugging, the following code can raise UnicodeEncodeError.
log = open("logfile", "w")
log.write(repr(data)) # UnicodeEncodeError will be raised
# if data contains unsupported characters.
To avoid exceptions being raised, you can explicitly specify the error-handler.
log = open("logfile", "w", errors="backslashreplace")
log.write(repr(data)) # Unsupported characters will be escaped.
For a console that uses a Unicode-based encoding, for example, en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn't work and all printable characters are not escaped. This will cause a problem of similarly drawing characters in Western, Greek and Cyrillic languages. These languages use similar (but different) alphabets (descended from a common ancestor) and contain letters that look similar but have different character codes. For example, it is hard to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual representation, of course, very much depends on the fonts used but usually these letters are almost indistinguishable.) To avoid the problem, the user can adjust the terminal encoding to get a result suitable for their environment.
Rejected Proposals
Add encoding and errors arguments to the builtin print() function, with defaults of sys.getfilesystemencoding() and 'backslashreplace'.
Complicated to implement, and in general, this is not seen as a good idea. [2]
Use character names to escape characters, instead of hex character codes. For example, repr('\u03b1') can be converted to "\N{GREEK SMALL LETTER ALPHA}".
Using character names can be very verbose compared to hex-escape. e.g., repr("\ufbf9") is converted to "\N{ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}".
Default error-handler of sys.stdout should be 'backslashreplace'.
Stuff written to stdout might be consumed by another program that might misinterpret the \ escapes. For interactive sessions, it is possible to make the 'backslashreplace' error-handler the default, but this may add confusion of the kind "it works in interactive mode but not when redirecting to a file".
Implementation
The author wrote a patch in http://bugs.python.org/issue2630; this was committed to the Python 3.0 branch in revision 64138 on 06-11-2008.
References
| [1] | Multibyte string on string::string_print (http://bugs.python.org/issue479898) |
| [2] | [Python-3000] Displaying strings containing unicode escapes (http://mail.python.org/pipermail/python-3000/2008-April/013366.html) |
Copyright
This document has been placed in the public domain.
pep-3139 Cleaning out sys and the "interpreter" module
| PEP: | 3139 |
|---|---|
| Title: | Cleaning out sys and the "interpreter" module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Benjamin Peterson <benjamin at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 4-April-2008 |
| Python-Version: | 3.0 |
Contents
Rejection Notice
Guido's -0.5 put an end to this PEP. See http://mail.python.org/pipermail/python-3000/2008-April/012977.html.
Abstract
This PEP proposes a new low-level module for CPython-specific interpreter functions in order to clean out the sys module and separate general Python functionality from implementation details.
Rationale
The sys module currently contains functions and data that can be put into two major groups:
- Data and functions that are available in all Python implementations and deal
with the general running of a Python virtual machine.
- argv
- byteorder
- path, path_hooks, meta_path, path_importer_cache, and modules
- copyright, hexversion, version, and version_info
- displayhook, __displayhook__
- excepthook, __excepthook__, exc_info, and exc_clear
- exec_prefix and prefix
- executable
- exit
- flags, py3kwarning, dont_write_bytecode, and warn_options
- getfilesystemencoding
- get/setprofile
- get/settrace, call_tracing
- getwindowsversion
- maxint and maxunicode
- platform
- ps1 and ps2
- stdin, stderr, stdout, __stdin__, __stderr__, __stdout__
- tracebacklimit
- Data and functions that affect the CPython interpreter.
- get/setrecursionlimit
- get/setcheckinterval
- _getframe and _current_frame
- getrefcount
- get/setdlopenflags
- settscdumps
- api_version
- winver
- dllhandle
- float_info
- _compact_freelists
- _clear_type_cache
- subversion
- builtin_module_names
- callstats
- intern
The second collections of items has been steadily increasing over the years causing clutter in sys. Guido has even said he doesn't recognize some of things in it [1]!
Moving these items items off to another module would send a clear message to other Python implementations about what functions need and need not be implemented.
It has also been proposed that the contents of types module be distributed across the standard library [2]; the interpreter module would provide an excellent resting place for internal types like frames and code objects.
Specification
A new builtin module named "interpreter" (see Naming) will be added.
The second list of items above will be split into the stdlib as follows:
- The interpreter module
- get/setrecursionlimit
- get/setcheckinterval
- _getframe and _current_frame
- get/setdlopenflags
- settscdumps
- api_version
- winver
- dllhandle
- float_info
- _clear_type_cache
- subversion
- builtin_module_names
- callstats
- intern
- The gc module:
- getrefcount
- _compact_freelists
Transition Plan
Once implemented in 3.x, the interpreter module will be back-ported to 2.6. Py3k warnings will be added the the sys functions it replaces.
Open Issues
What should move?
dont_write_bytecode
Some believe that the writing of bytecode is an implementation detail and should be moved [3]. The counterargument is that all current, complete Python implementations do write some sort of bytecode, so it is valuable to be able to disable it. Also, if it is moved, some wish to put it in the imp module.
Move to some to imp?
It was noted that dont_write_bytecode or maybe builtin_module_names might fit nicely in the imp module.
References
| [1] | http://bugs.python.org/issue1522 |
| [2] | http://mail.python.org/pipermail/stdlib-sig/2008-April/000172.html |
| [3] | http://mail.python.org/pipermail/stdlib-sig/2008-April/000217.html |
| [4] | http://mail.python.org/pipermail/python-3000/2007-November/011351.html |
| [5] | http://mail.python.org/pipermail/stdlib-sig/2008-April/000223.html |
Copyright
This document has been placed in the public domain.
Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
pep-3140 str(container) should call str(item), not repr(item)
| PEP: | 3140 |
|---|---|
| Title: | str(container) should call str(item), not repr(item) |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Oleg Broytmann <phd at phd.pp.ru>, Jim J. Jewett <jimjjewett at gmail.com> |
| Discussions-To: | <python-3000 at python.org> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 27-May-2008 |
| Post-History: | 28-May-2008 |
Rejection
Guido said this would cause too much disturbance too close to beta. See http://mail.python.org/pipermail/python-3000/2008-May/013876.html.
Abstract
This document discusses the advantages and disadvantages of the current implementation of str(container). It also discusses the pros and cons of a different approach - to call str(item) instead of repr(item).
Motivation
Currently str(container) calls repr on items. Arguments for it:
-- containers refuse to guess what the user wants to see on
str(container) - surroundings, delimiters, and so on;
-- repr(item) usually displays type information - apostrophes
around strings, class names, etc.
Arguments against:
-- it's illogical; str() is expected to call __str__ if it exists,
not __repr__;
-- there is no standard way to print a container's content calling
items' __str__, that's inconvenient in cases where __str__ and
__repr__ return different results;
-- repr(item) sometimes do wrong things (hex-escapes non-ASCII
strings, e.g.)
This PEP proposes to change how str(container) works. It is
proposed to mimic how repr(container) works except one detail
- call str on items instead of repr. This allows a user to choose
what results she want to get - from item.__repr__ or item.__str__.
Current situation
Most container types (tuples, lists, dicts, sets, etc.) do not
implement __str__ method, so str(container) calls
container.__repr__, and container.__repr__, once called, forgets
it is called from str and always calls repr on the container's
items.
This behaviour has advantages and disadvantages. One advantage is
that most items are represented with type information - strings
are surrounded by apostrophes, instances may have both class name
and instance data:
>>> print([42, '42'])
[42, '42']
>>> print([Decimal('42'), datetime.now()])
[Decimal("42"), datetime.datetime(2008, 5, 27, 19, 57, 43, 485028)]
The disadvantage is that __repr__ often returns technical data
(like '<object at address>') or unreadable string (hex-encoded
string if the input is non-ASCII string):
>>> print(['тест'])
['\xd4\xc5\xd3\xd4']
One of the motivations for PEP 3138 is that neither repr nor str
will allow the sensible printing of dicts whose keys are non-ASCII
text strings. Now that Unicode identifiers are allowed, it
includes Python's own attribute dicts. This also includes JSON
serialization (and caused some hoops for the json lib).
PEP 3138 proposes to fix this by breaking the "repr is safe ASCII"
invariant, and changing the way repr (which is used for
persistence) outputs some objects, with system-dependent failures.
Changing how str(container) works would allow easy debugging in
the normal case, and retain the safety of ASCII-only for the
machine-readable case. The only downside is that str(x) and
repr(x) would more often be different -- but only in those cases
where the current almost-the-same version is insufficient.
It also seems illogical that str(container) calls repr on items
instead of str. It's only logical to expect following code
class Test:
def __str__(self):
return "STR"
def __repr__(self):
return "REPR"
test = Test()
print(test)
print(repr(test))
print([test])
print(str([test]))
to print
STR
REPR
[STR]
[STR]
where it actually prints
STR
REPR
[REPR]
[REPR]
Especially it is illogical to see that print in Python 2 uses str
if it is called on what seems to be a tuple:
>>> print Decimal('42'), datetime.now()
42 2008-05-27 20:16:22.534285
where on an actual tuple it prints
>>> print((Decimal('42'), datetime.now()))
(Decimal("42"), datetime.datetime(2008, 5, 27, 20, 16, 27, 937911))
A different approach - call str(item)
For example, with numbers it is often only the value that people
care about.
>>> print Decimal('3')
3
But putting the value in a list forces users to read the type
information, exactly as if repr had been called for the benefit of
a machine:
>>> print [Decimal('3')]
[Decimal("3")]
After this change, the type information would not clutter the str
output:
>>> print "%s".format([Decimal('3')])
[3]
>>> str([Decimal('3')]) # ==
[3]
But it would still be available if desired:
>>> print "%r".format([Decimal('3')])
[Decimal('3')]
>>> repr([Decimal('3')]) # ==
[Decimal('3')]
There is a number of strategies to fix the problem. The most
radical is to change __repr__ so it accepts a new parameter (flag)
"called from str, so call str on items, not repr". The
drawback of the proposal is that every __repr__ implementation
must be changed. Introspection could help a bit (inspect __repr__
before calling if it accepts 2 or 3 parameters), but introspection
doesn't work on classes written in C, like all built-in containers.
Less radical proposal is to implement __str__ methods for built-in
container types. The obvious drawback is a duplication of effort
- all those __str__ and __repr__ implementations are only differ
in one small detail - if they call str or repr on items.
The most conservative proposal is not to change str at all but
to allow developers to implement their own application- or
library-specific pretty-printers. The drawback is again
a multiplication of effort and proliferation of many small
specific container-traversal algorithms.
Backward compatibility
In those cases where type information is more important than usual, it will still be possible to get the current results by calling repr explicitly.
Copyright
This document has been placed in the public domain.
pep-3141 A Type Hierarchy for Numbers
| PEP: | 3141 |
|---|---|
| Title: | A Type Hierarchy for Numbers |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Jeffrey Yasskin <jyasskin at google.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 23-Apr-2007 |
| Post-History: | 25-Apr-2007, 16-May-2007, 02-Aug-2007 |
Contents
Abstract
This proposal defines a hierarchy of Abstract Base Classes (ABCs) (PEP 3119) to represent number-like classes. It proposes a hierarchy of Number :> Complex :> Real :> Rational :> Integral where A :> B means "A is a supertype of B". The hierarchy is inspired by Scheme's numeric tower [4].
Rationale
Functions that take numbers as arguments should be able to determine the properties of those numbers, and if and when overloading based on types is added to the language, should be overloadable based on the types of the arguments. For example, slicing requires its arguments to be Integrals, and the functions in the math module require their arguments to be Real.
Specification
This PEP specifies a set of Abstract Base Classes, and suggests a general strategy for implementing some of the methods. It uses terminology from PEP 3119, but the hierarchy is intended to be meaningful for any systematic method of defining sets of classes.
The type checks in the standard library should use these classes instead of the concrete built-ins.
Numeric Classes
We begin with a Number class to make it easy for people to be fuzzy about what kind of number they expect. This class only helps with overloading; it doesn't provide any operations.
class Number(metaclass=ABCMeta): pass
Most implementations of complex numbers will be hashable, but if you need to rely on that, you'll have to check it explicitly: mutable numbers are supported by this hierarchy.
class Complex(Number):
"""Complex defines the operations that work on the builtin complex type.
In short, those are: conversion to complex, bool(), .real, .imag,
+, -, *, /, **, abs(), .conjugate(), ==, and !=.
If it is given heterogenous arguments, and doesn't have special
knowledge about them, it should fall back to the builtin complex
type as described below.
"""
@abstractmethod
def __complex__(self):
"""Return a builtin complex instance."""
def __bool__(self):
"""True if self != 0."""
return self != 0
@abstractproperty
def real(self):
"""Retrieve the real component of this number.
This should subclass Real.
"""
raise NotImplementedError
@abstractproperty
def imag(self):
"""Retrieve the real component of this number.
This should subclass Real.
"""
raise NotImplementedError
@abstractmethod
def __add__(self, other):
raise NotImplementedError
@abstractmethod
def __radd__(self, other):
raise NotImplementedError
@abstractmethod
def __neg__(self):
raise NotImplementedError
def __pos__(self):
"""Coerces self to whatever class defines the method."""
raise NotImplementedError
def __sub__(self, other):
return self + -other
def __rsub__(self, other):
return -self + other
@abstractmethod
def __mul__(self, other):
raise NotImplementedError
@abstractmethod
def __rmul__(self, other):
raise NotImplementedError
@abstractmethod
def __div__(self, other):
"""a/b; should promote to float or complex when necessary."""
raise NotImplementedError
@abstractmethod
def __rdiv__(self, other):
raise NotImplementedError
@abstractmethod
def __pow__(self, exponent):
"""a**b; should promote to float or complex when necessary."""
raise NotImplementedError
@abstractmethod
def __rpow__(self, base):
raise NotImplementedError
@abstractmethod
def __abs__(self):
"""Returns the Real distance from 0."""
raise NotImplementedError
@abstractmethod
def conjugate(self):
"""(x+y*i).conjugate() returns (x-y*i)."""
raise NotImplementedError
@abstractmethod
def __eq__(self, other):
raise NotImplementedError
# __ne__ is inherited from object and negates whatever __eq__ does.
The Real ABC indicates that the value is on the real line, and supports the operations of the float builtin. Real numbers are totally ordered except for NaNs (which this PEP basically ignores).
class Real(Complex):
"""To Complex, Real adds the operations that work on real numbers.
In short, those are: conversion to float, trunc(), math.floor(),
math.ceil(), round(), divmod(), //, %, <, <=, >, and >=.
Real also provides defaults for some of the derived operations.
"""
# XXX What to do about the __int__ implementation that's
# currently present on float? Get rid of it?
@abstractmethod
def __float__(self):
"""Any Real can be converted to a native float object."""
raise NotImplementedError
@abstractmethod
def __trunc__(self):
"""Truncates self to an Integral.
Returns an Integral i such that:
* i>=0 iff self>0;
* abs(i) <= abs(self);
* for any Integral j satisfying the first two conditions,
abs(i) >= abs(j) [i.e. i has "maximal" abs among those].
i.e. "truncate towards 0".
"""
raise NotImplementedError
@abstractmethod
def __floor__(self):
"""Finds the greatest Integral <= self."""
raise NotImplementedError
@abstractmethod
def __ceil__(self):
"""Finds the least Integral >= self."""
raise NotImplementedError
@abstractmethod
def __round__(self, ndigits:Integral=None):
"""Rounds self to ndigits decimal places, defaulting to 0.
If ndigits is omitted or None, returns an Integral,
otherwise returns a Real, preferably of the same type as
self. Types may choose which direction to round half. For
example, float rounds half toward even.
"""
raise NotImplementedError
def __divmod__(self, other):
"""The pair (self // other, self % other).
Sometimes this can be computed faster than the pair of
operations.
"""
return (self // other, self % other)
def __rdivmod__(self, other):
"""The pair (self // other, self % other).
Sometimes this can be computed faster than the pair of
operations.
"""
return (other // self, other % self)
@abstractmethod
def __floordiv__(self, other):
"""The floor() of self/other. Integral."""
raise NotImplementedError
@abstractmethod
def __rfloordiv__(self, other):
"""The floor() of other/self."""
raise NotImplementedError
@abstractmethod
def __mod__(self, other):
"""self % other
See
http://mail.python.org/pipermail/python-3000/2006-May/001735.html
and consider using "self/other - trunc(self/other)"
instead if you're worried about round-off errors.
"""
raise NotImplementedError
@abstractmethod
def __rmod__(self, other):
"""other % self"""
raise NotImplementedError
@abstractmethod
def __lt__(self, other):
"""< on Reals defines a total ordering, except perhaps for NaN."""
raise NotImplementedError
@abstractmethod
def __le__(self, other):
raise NotImplementedError
# __gt__ and __ge__ are automatically done by reversing the arguments.
# (But __le__ is not computed as the opposite of __gt__!)
# Concrete implementations of Complex abstract methods.
# Subclasses may override these, but don't have to.
def __complex__(self):
return complex(float(self))
@property
def real(self):
return +self
@property
def imag(self):
return 0
def conjugate(self):
"""Conjugate is a no-op for Reals."""
return +self
We should clean up Demo/classes/Rat.py and promote it into rational.py in the standard library. Then it will implement the Rational ABC.
class Rational(Real, Exact):
""".numerator and .denominator should be in lowest terms."""
@abstractproperty
def numerator(self):
raise NotImplementedError
@abstractproperty
def denominator(self):
raise NotImplementedError
# Concrete implementation of Real's conversion to float.
# (This invokes Integer.__div__().)
def __float__(self):
return self.numerator / self.denominator
And finally integers:
class Integral(Rational):
"""Integral adds a conversion to int and the bit-string operations."""
@abstractmethod
def __int__(self):
raise NotImplementedError
def __index__(self):
"""__index__() exists because float has __int__()."""
return int(self)
def __lshift__(self, other):
return int(self) << int(other)
def __rlshift__(self, other):
return int(other) << int(self)
def __rshift__(self, other):
return int(self) >> int(other)
def __rrshift__(self, other):
return int(other) >> int(self)
def __and__(self, other):
return int(self) & int(other)
def __rand__(self, other):
return int(other) & int(self)
def __xor__(self, other):
return int(self) ^ int(other)
def __rxor__(self, other):
return int(other) ^ int(self)
def __or__(self, other):
return int(self) | int(other)
def __ror__(self, other):
return int(other) | int(self)
def __invert__(self):
return ~int(self)
# Concrete implementations of Rational and Real abstract methods.
def __float__(self):
"""float(self) == float(int(self))"""
return float(int(self))
@property
def numerator(self):
"""Integers are their own numerators."""
return +self
@property
def denominator(self):
"""Integers have a denominator of 1."""
return 1
Changes to operations and __magic__ methods
To support more precise narrowing from float to int (and more generally, from Real to Integral), we propose the following new __magic__ methods, to be called from the corresponding library functions. All of these return Integrals rather than Reals.
- __trunc__(self), called from a new builtin trunc(x), which returns the Integral closest to x between 0 and x.
- __floor__(self), called from math.floor(x), which returns the greatest Integral <= x.
- __ceil__(self), called from math.ceil(x), which returns the least Integral >= x.
- __round__(self), called from round(x), which returns the Integral closest to x, rounding half as the type chooses. float will change in 3.0 to round half toward even. There is also a 2-argument version, __round__(self, ndigits), called from round(x, ndigits), which should return a Real.
In 2.6, math.floor, math.ceil, and round will continue to return floats.
The int() conversion implemented by float is equivalent to trunc(). In general, the int() conversion should try __int__() first and if it is not found, try __trunc__().
complex.__{divmod,mod,floordiv,int,float}__ also go away. It would be nice to provide a nice error message to help confused porters, but not appearing in help(complex) is more important.
Notes for type implementors
Implementors should be careful to make equal numbers equal and hash them to the same values. This may be subtle if there are two different extensions of the real numbers. For example, a complex type could reasonably implement hash() as follows:
def __hash__(self):
return hash(complex(self))
but should be careful of any values that fall outside of the built in complex's range or precision.
Adding More Numeric ABCs
There are, of course, more possible ABCs for numbers, and this would be a poor hierarchy if it precluded the possibility of adding those. You can add MyFoo between Complex and Real with:
class MyFoo(Complex): ... MyFoo.register(Real)
Implementing the arithmetic operations
We want to implement the arithmetic operations so that mixed-mode operations either call an implementation whose author knew about the types of both arguments, or convert both to the nearest built in type and do the operation there. For subtypes of Integral, this means that __add__ and __radd__ should be defined as:
class MyIntegral(Integral):
def __add__(self, other):
if isinstance(other, MyIntegral):
return do_my_adding_stuff(self, other)
elif isinstance(other, OtherTypeIKnowAbout):
return do_my_other_adding_stuff(self, other)
else:
return NotImplemented
def __radd__(self, other):
if isinstance(other, MyIntegral):
return do_my_adding_stuff(other, self)
elif isinstance(other, OtherTypeIKnowAbout):
return do_my_other_adding_stuff(other, self)
elif isinstance(other, Integral):
return int(other) + int(self)
elif isinstance(other, Real):
return float(other) + float(self)
elif isinstance(other, Complex):
return complex(other) + complex(self)
else:
return NotImplemented
There are 5 different cases for a mixed-type operation on subclasses of Complex. I'll refer to all of the above code that doesn't refer to MyIntegral and OtherTypeIKnowAbout as "boilerplate". a will be an instance of A, which is a subtype of Complex (a : A <: Complex), and b : B <: Complex. I'll consider a + b:
- If A defines an __add__ which accepts b, all is well.
- If A falls back to the boilerplate code, and it were to return a value from __add__, we'd miss the possibility that B defines a more intelligent __radd__, so the boilerplate should return NotImplemented from __add__. (Or A may not implement __add__ at all.)
- Then B's __radd__ gets a chance. If it accepts a, all is well.
- If it falls back to the boilerplate, there are no more possible methods to try, so this is where the default implementation should live.
- If B <: A, Python tries B.__radd__ before A.__add__. This is ok, because it was implemented with knowledge of A, so it can handle those instances before delegating to Complex.
If A<:Complex and B<:Real without sharing any other knowledge, then the appropriate shared operation is the one involving the built in complex, and both __radd__s land there, so a+b == b+a.
Rejected Alternatives
The initial version of this PEP defined an algebraic hierarchy inspired by a Haskell Numeric Prelude [3] including MonoidUnderPlus, AdditiveGroup, Ring, and Field, and mentioned several other possible algebraic types before getting to the numbers. We had expected this to be useful to people using vectors and matrices, but the NumPy community really wasn't interested, and we ran into the issue that even if x is an instance of X <: MonoidUnderPlus and y is an instance of Y <: MonoidUnderPlus, x + y may still not make sense.
Then we gave the numbers a much more branching structure to include things like the Gaussian Integers and Z/nZ, which could be Complex but wouldn't necessarily support things like division. The community decided that this was too much complication for Python, so I've now scaled back the proposal to resemble the Scheme numeric tower much more closely.
The Decimal Type
After consultation with its authors it has been decided that the Decimal type should not at this time be made part of the numeric tower.
References
| [1] | Introducing Abstract Base Classes (http://www.python.org/dev/peps/pep-3119/) |
| [2] | Possible Python 3K Class Tree?, wiki page by Bill Janssen (http://wiki.python.org/moin/AbstractBaseClasses) |
| [3] | NumericPrelude: An experimental alternative hierarchy of numeric type classes (http://darcs.haskell.org/numericprelude/docs/html/index.html) |
| [4] | The Scheme numerical tower (http://www.swiss.ai.mit.edu/ftpdir/scheme-reports/r5rs-html/r5rs_8.html#SEC50) |
Acknowledgements
Thanks to Neal Norwitz for encouraging me to write this PEP in the first place, to Travis Oliphant for pointing out that the numpy people didn't really care about the algebraic concepts, to Alan Isaac for reminding me that Scheme had already done this, and to Guido van Rossum and lots of other people on the mailing list for refining the concept.
Copyright
This document has been placed in the public domain.
pep-3142 Add a "while" clause to generator expressions
| PEP: | 3142 |
|---|---|
| Title: | Add a "while" clause to generator expressions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gerald Britton <gerald.britton at gmail.com> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 12-Jan-2009 |
| Python-Version: | 3.0 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2013-May/126136.html |
Abstract
This PEP proposes an enhancement to generator expressions, adding a "while" clause to complement the existing "if" clause.
Rationale
A generator expression (PEP 289 [1]) is a concise method to serve dynamically-generated objects to list comprehensions (PEP 202 [2]). Current generator expressions allow for an "if" clause to filter the objects that are returned to those meeting some set of criteria. However, since the "if" clause is evaluated for every object that may be returned, in some cases it is possible that all objects would be rejected after a certain point. For example: g = (n for n in range(100) if n*n < 50) which is equivalent to the using a generator function (PEP 255 [3]): def __gen(exp): for n in exp: if n*n < 50: yield n g = __gen(iter(range(10))) would yield 0, 1, 2, 3, 4, 5, 6 and 7, but would also consider the numbers from 8 to 99 and reject them all since n*n >= 50 for numbers in that range. Allowing for a "while" clause would allow the redundant tests to be short-circuited: g = (n for n in range(100) while n*n < 50) would also yield 0, 1, 2, 3, 4, 5, 6 and 7, but would stop at 8 since the condition (n*n < 50) is no longer true. This would be equivalent to the generator function: def __gen(exp): for n in exp: if n*n < 50: yield n else: break g = __gen(iter(range(100))) Currently, in order to achieve the same result, one would need to either write a generator function such as the one above or use the takewhile function from itertools: from itertools import takewhile g = takewhile(lambda n: n*n < 50, range(100)) The takewhile code achieves the same result as the proposed syntax, albeit in a longer (some would say "less-elegant") fashion. Also, the takewhile version requires an extra function call (the lambda in the example above) with the associated performance penalty. A simple test shows that: for n in (n for n in range(100) if 1): pass performs about 10% better than: for n in takewhile(lambda n: 1, range(100)): pass though they achieve similar results. (The first example uses a generator; takewhile is an iterator). If similarly implemented, a "while" clause should perform about the same as the "if" clause does today. The reader may ask if the "if" and "while" clauses should be mutually exclusive. There are good examples that show that there are times when both may be used to good advantage. For example: p = (p for p in primes() if p > 100 while p < 1000) should return prime numbers found between 100 and 1000, assuming I have a primes() generator that yields prime numbers. Adding a "while" clause to generator expressions maintains the compact form while adding a useful facility for short-circuiting the expression.
Acknowledgements
Raymond Hettinger first proposed the concept of generator expressions in January 2002.
References
[1] PEP 289: Generator Expressions http://www.python.org/dev/peps/pep-0289/ [2] PEP 202: List Comprehensions http://www.python.org/dev/peps/pep-0202/ [3] PEP 255: Simple Generators http://www.python.org/dev/peps/pep-0255/
Copyright
This document has been placed in the public domain.
pep-3143 Standard daemon process library
| PEP: | 3143 |
|---|---|
| Title: | Standard daemon process library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Ben Finney <ben+python at benfinney.id.au> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2009-01-26 |
| Python-Version: | 3.x |
| Post-History: |
Contents
Abstract
Writing a program to become a well-behaved Unix daemon is somewhat complex and tricky to get right, yet the steps are largely similar for any daemon regardless of what else the program may need to do.
This PEP introduces a package to the Python standard library that provides a simple interface to the task of becoming a daemon process.
PEP Deferral
Further exploration of the concepts covered in this PEP has been deferred for lack of a current champion interested in promoting the goals of the PEP and collecting and incorporating feedback, and with sufficient available time to do so effectively.
Specification
Example usage
Simple example of direct DaemonContext usage:
import daemon
from spam import do_main_program
with daemon.DaemonContext():
do_main_program()
More complex example usage:
import os
import grp
import signal
import daemon
import lockfile
from spam import (
initial_program_setup,
do_main_program,
program_cleanup,
reload_program_config,
)
context = daemon.DaemonContext(
working_directory='/var/lib/foo',
umask=0o002,
pidfile=lockfile.FileLock('/var/run/spam.pid'),
)
context.signal_map = {
signal.SIGTERM: program_cleanup,
signal.SIGHUP: 'terminate',
signal.SIGUSR1: reload_program_config,
}
mail_gid = grp.getgrnam('mail').gr_gid
context.gid = mail_gid
important_file = open('spam.data', 'w')
interesting_file = open('eggs.data', 'w')
context.files_preserve = [important_file, interesting_file]
initial_program_setup()
with context:
do_main_program()
Interface
A new package, daemon, is added to the standard library.
A class, DaemonContext, is defined to represent the settings and process context for the program running as a daemon process.
DaemonContext objects
A DaemonContext instance represents the behaviour settings and process context for the program when it becomes a daemon. The behaviour and environment is customised by setting options on the instance, before calling the open method.
Each option can be passed as a keyword argument to the DaemonContext constructor, or subsequently altered by assigning to an attribute on the instance at any time prior to calling open. That is, for options named wibble and wubble, the following invocation:
foo = daemon.DaemonContext(wibble=bar, wubble=baz) foo.open()
is equivalent to:
foo = daemon.DaemonContext() foo.wibble = bar foo.wubble = baz foo.open()
The following options are defined.
- files_preserve
Default: None List of files that should not be closed when starting the daemon. If None, all open file descriptors will be closed.
Elements of the list are file descriptors (as returned by a file object's fileno() method) or Python file objects. Each specifies a file that is not to be closed during daemon start.
- chroot_directory
Default: None Full path to a directory to set as the effective root directory of the process. If None, specifies that the root directory is not to be changed.
- working_directory
Default: '/' Full path of the working directory to which the process should change on daemon start.
Since a filesystem cannot be unmounted if a process has its current working directory on that filesystem, this should either be left at default or set to a directory that is a sensible “home directory” for the daemon while it is running.
- umask
Default: 0 File access creation mask (“umask”) to set for the process on daemon start.
Since a process inherits its umask from its parent process, starting the daemon will reset the umask to this value so that files are created by the daemon with access modes as it expects.
- pidfile
Default: None Context manager for a PID lock file. When the daemon context opens and closes, it enters and exits the pidfile context manager.
- detach_process
Default: None If True, detach the process context when opening the daemon context; if False, do not detach.
If unspecified (None) during initialisation of the instance, this will be set to True by default, and False only if detaching the process is determined to be redundant; for example, in the case when the process was started by init, by initd, or by inetd.
- signal_map
Default: system-dependent Mapping from operating system signals to callback actions.
The mapping is used when the daemon context opens, and determines the action for each signal's signal handler:
- A value of None will ignore the signal (by setting the signal action to signal.SIG_IGN).
- A string value will be used as the name of an attribute on the DaemonContext instance. The attribute's value will be used as the action for the signal handler.
- Any other value will be used as the action for the signal handler.
The default value depends on which signals are defined on the running system. Each item from the list below whose signal is actually defined in the signal module will appear in the default map:
- signal.SIGTTIN: None
- signal.SIGTTOU: None
- signal.SIGTSTP: None
- signal.SIGTERM: 'terminate'
Depending on how the program will interact with its child processes, it may need to specify a signal map that includes the signal.SIGCHLD signal (received when a child process exits). See the specific operating system's documentation for more detail on how to determine what circumstances dictate the need for signal handlers.
- uid
Default: os.getuid() - gid
Default: os.getgid() The user ID (“UID”) value and group ID (“GID”) value to switch the process to on daemon start.
The default values, the real UID and GID of the process, will relinquish any effective privilege elevation inherited by the process.
- prevent_core
Default: True If true, prevents the generation of core files, in order to avoid leaking sensitive information from daemons run as root.
- stdin
Default: None - stdout
Default: None - stderr
Default: None Each of stdin, stdout, and stderr is a file-like object which will be used as the new file for the standard I/O stream sys.stdin, sys.stdout, and sys.stderr respectively. The file should therefore be open, with a minimum of mode 'r' in the case of stdin, and mode 'w+' in the case of stdout and stderr.
If the object has a fileno() method that returns a file descriptor, the corresponding file will be excluded from being closed during daemon start (that is, it will be treated as though it were listed in files_preserve).
If None, the corresponding system stream is re-bound to the file named by os.devnull.
The following methods are defined.
- open()
Return: None Open the daemon context, turning the current program into a daemon process. This performs the following steps:
If this instance's is_open property is true, return immediately. This makes it safe to call open multiple times on an instance.
If the prevent_core attribute is true, set the resource limits for the process to prevent any core dump from the process.
If the chroot_directory attribute is not None, set the effective root directory of the process to that directory (via os.chroot).
This allows running the daemon process inside a “chroot gaol” as a means of limiting the system's exposure to rogue behaviour by the process. Note that the specified directory needs to already be set up for this purpose.
Set the process UID and GID to the uid and gid attribute values.
Close all open file descriptors. This excludes those listed in the files_preserve attribute, and those that correspond to the stdin, stdout, or stderr attributes.
Change current working directory to the path specified by the working_directory attribute.
Reset the file access creation mask to the value specified by the umask attribute.
If the detach_process option is true, detach the current process into its own process group, and disassociate from any controlling terminal.
Set signal handlers as specified by the signal_map attribute.
If any of the attributes stdin, stdout, stderr are not None, bind the system streams sys.stdin, sys.stdout, and/or sys.stderr to the files represented by the corresponding attributes. Where the attribute has a file descriptor, the descriptor is duplicated (instead of re-binding the name).
If the pidfile attribute is not None, enter its context manager.
Mark this instance as open (for the purpose of future open and close calls).
Register the close method to be called during Python's exit processing.
When the function returns, the running program is a daemon process.
- close()
Return: None Close the daemon context. This performs the following steps:
- If this instance's is_open property is false, return immediately. This makes it safe to call close multiple times on an instance.
- If the pidfile attribute is not None, exit its context manager.
- Mark this instance as closed (for the purpose of future open and close calls).
- is_open
Return: True if the instance is open, False otherwise. This property exposes the state indicating whether the instance is currently open. It is True if the instance's open method has been called and the close method has not subsequently been called.
- terminate(signal_number, stack_frame)
Return: None Signal handler for the signal.SIGTERM signal. Performs the following step:
- Raise a SystemExit exception explaining the signal.
The class also implements the context manager protocol via __enter__ and __exit__ methods.
- __enter__()
Return: The DaemonContext instance Call the instance's open() method, then return the instance.
- __exit__(exc_type, exc_value, exc_traceback)
Return: True or False as defined by the context manager protocol Call the instance's close() method, then return True if the exception was handled or False if it was not.
Motivation
The majority of programs written to be Unix daemons either implement behaviour very similar to that in the specification, or are poorly-behaved daemons by the correct daemon behaviour.
Since these steps should be much the same in most implementations but are very particular and easy to omit or implement incorrectly, they are a prime target for a standard well-tested implementation in the standard library.
Rationale
Correct daemon behaviour
According to Stevens in [stevens] §2.6, a program should perform the following steps to become a Unix daemon process.
- Close all open file descriptors.
- Change current working directory.
- Reset the file access creation mask.
- Run in the background.
- Disassociate from process group.
- Ignore terminal I/O signals.
- Disassociate from control terminal.
- Don't reacquire a control terminal.
- Correctly handle the following circumstances:
- Started by System V init process.
- Daemon termination by SIGTERM signal.
- Children generate SIGCLD signal.
The daemon tool [slack-daemon] lists (in its summary of features) behaviour that should be performed when turning a program into a well-behaved Unix daemon process. It differs from this PEP's intent in that it invokes a separate program as a daemon process. The following features are appropriate for a daemon that starts itself once the program is already running:
- Sets up the correct process context for a daemon.
- Behaves sensibly when started by initd(8) or inetd(8).
- Revokes any suid or sgid privileges to reduce security risks in case daemon is incorrectly installed with special privileges.
- Prevents the generation of core files to prevent leaking sensitive information from daemons run as root (optional).
- Names the daemon by creating and locking a PID file to guarantee that only one daemon with the given name can execute at any given time (optional).
- Sets the user and group under which to run the daemon (optional, root only).
- Creates a chroot gaol (optional, root only).
- Captures the daemon's stdout and stderr and directs them to syslog (optional).
A daemon is not a service
This PEP addresses only Unix-style daemons, for which the above correct behaviour is relevant, as opposed to comparable behaviours on other operating systems.
There is a related concept in many systems, called a “service”. A service differs from the model in this PEP, in that rather than having the current program continue to run as a daemon process, a service starts an additional process to run in the background, and the current process communicates with that additional process via some defined channels.
The Unix-style daemon model in this PEP can be used, among other things, to implement the background-process part of a service; but this PEP does not address the other aspects of setting up and managing a service.
Reference Implementation
The python-daemon package [python-daemon].
Other daemon implementations
Prior to this PEP, several existing third-party Python libraries or tools implemented some of this PEP's correct daemon behaviour.
The reference implementation is a fairly direct successor from the following implementations:
- Many good ideas were contributed by the community to Python cookbook recipes #66012 [cookbook-66012] and #278731 [cookbook-278731].
- The bda.daemon library [bda.daemon] is an implementation of [cookbook-66012]. It is the predecessor of [python-daemon].
Other Python daemon implementations that differ from this PEP:
- The zdaemon tool [zdaemon] was written for the Zope project. Like [slack-daemon], it differs from this specification because it is used to run another program as a daemon process.
- The Python library daemon [clapper-daemon] is (according to its homepage) no longer maintained. As of version 1.0.1, it implements the basic steps from [stevens].
- The daemonize library [seutter-daemonize] also implements the basic steps from [stevens].
- Ray Burr's daemon.py module [burr-daemon] provides the [stevens] procedure as well as PID file handling and redirection of output to syslog.
- Twisted [twisted] includes, perhaps unsurprisingly, an implementation of a process daemonisation API that is integrated with the rest of the Twisted framework; it differs significantly from the API in this PEP.
- The Python initd library [dagitses-initd], which uses [clapper-daemon], implements an equivalent of Unix initd(8) for controlling a daemon process.
References
| [stevens] | (1, 2, 3, 4) Unix Network Programming, W. Richard Stevens, 1994 Prentice Hall. |
| [slack-daemon] | (1, 2) The (non-Python) “libslack” implementation of a daemon tool http://www.libslack.org/daemon/ by “raf” <raf@raf.org>. |
| [python-daemon] | (1, 2) The python-daemon library http://pypi.python.org/pypi/python-daemon/ by Ben Finney et al. |
| [cookbook-66012] | (1, 2) Python Cookbook recipe 66012, “Fork a daemon process on Unix” http://code.activestate.com/recipes/66012/. |
| [cookbook-278731] | Python Cookbook recipe 278731, “Creating a daemon the Python way” http://code.activestate.com/recipes/278731/. |
| [bda.daemon] | The bda.daemon library http://pypi.python.org/pypi/bda.daemon/ by Robert Niederreiter et al. |
| [zdaemon] | The zdaemon tool http://pypi.python.org/pypi/zdaemon/ by Guido van Rossum et al. |
| [clapper-daemon] | (1, 2) The daemon library http://pypi.python.org/pypi/daemon/ by Brian Clapper. |
| [seutter-daemonize] | The daemonize library http://daemonize.sourceforge.net/ by Jerry Seutter. |
| [burr-daemon] | The daemon.py module http://www.nightmare.com/~ryb/code/daemon.py by Ray Burr. |
| [twisted] | The Twisted application framework http://pypi.python.org/pypi/Twisted/ by Glyph Lefkowitz et al. |
| [dagitses-initd] | The Python initd library http://pypi.python.org/pypi/initd/ by Michael Andreas Dagitses. |
Copyright
This work is hereby placed in the public domain. To the extent that placing a work in the public domain is not legally possible, the copyright holder hereby grants to all recipients of this work all rights and freedoms that would otherwise be restricted by copyright.
pep-3144 IP Address Manipulation Library for the Python Standard Library
| PEP: | 3144 |
|---|---|
| Title: | IP Address Manipulation Library for the Python Standard Library |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Peter Moody <pmoody at google.com> |
| BDFL-Delegate: | Nick Coghlan |
| Discussions-To: | <ipaddr-py-dev at googlegroups.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 6-Feb-2012 |
| Python-Version: | 3.3 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2012-May/119474.html |
Abstract:
This PEP proposes a design and for an IP address manipulation module for
python.
PEP Acceptance:
This PEP was accepted by Nick Coghlan on the 15th of May, 2012.
Motivation:
Several very good IP address modules for python already exist.
The truth is that all of them struggle with the balance between
adherence to Pythonic principals and the shorthand upon which
network engineers and administrators rely. ipaddress aims to
strike the right balance.
Rationale:
The existence of several Python IP address manipulation modules is
evidence of an outstanding need for the functionality this module
seeks to provide.
Background:
PEP 3144 and ipaddr have been up for inclusion before. The
version of the library specified here is backwards incompatible
with the version on PyPI and the one which was discussed before.
In order to avoid confusing users of the current ipaddr, I've
renamed this version of the library "ipaddress".
The main differences between ipaddr and ipaddress are:
* ipaddress *Network classes are equivalent to the ipaddr *Network
class counterparts with the strict flag set to True.
* ipaddress *Interface classes are equivalent to the ipaddr
*Network class counterparts with the strict flag set to False.
* The factory functions in ipaddress were renamed to disambiguate
them from classes.
* A few attributes were renamed to disambiguate their purpose as
well. (eg. network, network_address)
* A number of methods and functions which returned containers in ipaddr now
return iterators. This includes, subnets, address_exclude,
summarize_address_range and collapse_address_list.
Due to the backwards incompatible API changes between ipaddress and ipaddr,
the proposal is to add the module using the new provisional API status:
* http://docs.python.org/dev/glossary.html#term-provisional-package
Relevant messages on python-dev:
* http://mail.python.org/pipermail/python-dev/2012-January/116016.html
* http://mail.python.org/pipermail/python-dev/2012-February/116656.html
* http://mail.python.org/pipermail/python-dev/2012-February/116688.html
Specification:
The ipaddr module defines a total of 6 new public classes, 3 for
manipulating IPv4 objects and 3 for manipulating IPv6 objects.
The classes are as follows:
IPv4Address/IPv6Address - These define individual addresses, for
example the IPv4 address returned by an A record query for
www.google.com (74.125.224.84) or the IPv6 address returned by a
AAAA record query for ipv6.google.com (2001:4860:4001:801::1011).
IPv4Network/IPv6Network - These define networks or groups of
addresses, for example the IPv4 network reserved for multicast use
(224.0.0.0/4) or the IPv6 network reserved for multicast
(ff00::/8, wow, that's big).
IPv4Interface/IPv6Interface - These hybrid classes refer to an
individual address on a given network. For example, the IPV4
address 192.0.2.1 on the network 192.0.2.0/24 could be referred to
as 192.0.2.1/24. Likewise, the IPv6 address 2001:DB8::1 on the
network 2001:DB8::/96 could be referred to as 2001:DB8::1/96.
It's very common to refer to addresses assigned to computer
network interfaces like this, hence the Interface name.
All IPv4 classes share certain characteristics and methods; the
number of bits needed to represent them, whether or not they
belong to certain special IPv4 network ranges, etc. Similarly,
all IPv6 classes share characteristics and methods.
ipaddr makes extensive use of inheritance to avoid code
duplication as much as possible. The parent classes are private,
but they are outlined here:
_IPAddrBase - Provides methods common to all ipaddr objects.
_BaseAddress - Provides methods common to IPv4Address and
IPv6Address.
_BaseInterface - Provides methods common to IPv4Interface and
IPv6Interface, as well as IPv4Network and IPv6Network (ipaddr
treats the Network classes as a special case of Interface).
_BaseV4 - Provides methods and variables (eg, _max_prefixlen)
common to all IPv4 classes.
_BaseV6 - Provides methods and variables common to all IPv6 classes.
Comparisons between objects of differing IP versions results in a
TypeError [1]. Additionally, comparisons of objects with
different _Base parent classes results in a TypeError. The effect
of the _Base parent class limitation is that IPv4Interface's can
be compared to IPv4Network's and IPv6Interface's can be compared
to IPv6Network's.
Reference Implementation:
The current reference implementation can be found at:
http://code.google.com/p/ipaddress-py/source/browse/ipaddress.py
Or see the tarball to include the README and unittests.
http://code.google.com/p/ipaddress-py/downloads/detail?name=ipaddress-1.0.tar.gz
More information about using the reference implementation can be
found at: http://code.google.com/p/ipaddr-py/wiki/Using3144
References:
[1] Appealing to authority is a logical fallacy, but Vint Cerf is an
an authority who can't be ignored. Full text of the email
follows:
"""
I have seen a substantial amount of traffic about IPv4 and
IPv6 comparisons and the general consensus is that these are
not comparable.
If we were to take a very simple minded view, we might treat
these as pure integers in which case there is an ordering but
not a useful one.
In the IPv4 world, "length" is important because we take
longest (most specific) address first for routing. Length is
determine by the mask, as you know.
Assuming that the same style of argument works in IPv6, we
would have to conclude that treating an IPv6 value purely as
an integer for comparison with IPv4 would lead to some really
strange results.
All of IPv4 space would lie in the host space of 0::0/96
prefix of IPv6. For any useful interpretation of IPv4, this is
a non-starter.
I think the only sensible conclusion is that IPv4 values and
IPv6 values should be treated as non-comparable.
Vint
"""
Copyright:
This document has been placed in the public domain.
pep-3145 Asynchronous I/O For subprocess.Popen
| PEP: | 3145 |
|---|---|
| Title: | Asynchronous I/O For subprocess.Popen |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | (James) Eric Pruitt, Charles R. McCreary, Josiah Carlson |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/plain |
| Created: | 04-Aug-2009 |
| Python-Version: | 3.2 |
| Post-History: |
Abstract:
In its present form, the subprocess.Popen implementation is prone to
dead-locking and blocking of the parent Python script while waiting on data
from the child process. This PEP proposes to make
subprocess.Popen more asynchronous to help alleviate these
problems.
PEP Deferral:
Further exploration of the concepts covered in this PEP has been deferred
at least until after PEP 3156 has been resolved.
PEP Withdrawal:
This can be dealt with in the bug tracker.
A specific proposal is attached to http://bugs.python.org/issue18823
Motivation:
A search for "python asynchronous subprocess" will turn up numerous
accounts of people wanting to execute a child process and communicate with
it from time to time reading only the data that is available instead of
blocking to wait for the program to produce data [1] [2] [3]. The current
behavior of the subprocess module is that when a user sends or receives
data via the stdin, stderr and stdout file objects, dead locks are common
and documented [4] [5]. While communicate can be used to alleviate some of
the buffering issues, it will still cause the parent process to block while
attempting to read data when none is available to be read from the child
process.
Rationale:
There is a documented need for asynchronous, non-blocking functionality in
subprocess.Popen [6] [7] [2] [3]. Inclusion of the code would improve the
utility of the Python standard library that can be used on Unix based and
Windows builds of Python. Practically every I/O object in Python has a
file-like wrapper of some sort. Sockets already act as such and for
strings there is StringIO. Popen can be made to act like a file by simply
using the methods attached the the subprocess.Popen.stderr, stdout and
stdin file-like objects. But when using the read and write methods of
those options, you do not have the benefit of asynchronous I/O. In the
proposed solution the wrapper wraps the asynchronous methods to mimic a
file object.
Reference Implementation:
I have been maintaining a Google Code repository that contains all of my
changes including tests and documentation [9] as well as blog detailing
the problems I have come across in the development process [10].
I have been working on implementing non-blocking asynchronous I/O in the
subprocess.Popen module as well as a wrapper class for subprocess.Popen
that makes it so that an executed process can take the place of a file by
duplicating all of the methods and attributes that file objects have.
There are two base functions that have been added to the subprocess.Popen
class: Popen.send and Popen._recv, each with two separate implementations,
one for Windows and one for Unix based systems. The Windows
implementation uses ctypes to access the functions needed to control pipes
in the kernel 32 DLL in an asynchronous manner. On Unix based systems,
the Python interface for file control serves the same purpose. The
different implementations of Popen.send and Popen._recv have identical
arguments to make code that uses these functions work across multiple
platforms.
When calling the Popen._recv function, it requires the pipe name be
passed as an argument so there exists the Popen.recv function that passes
selects stdout as the pipe for Popen._recv by default. Popen.recv_err
selects stderr as the pipe by default. Popen.recv and Popen.recv_err
are much easier to read and understand than Popen._recv('stdout' ...) and
Popen._recv('stderr' ...) respectively.
Since the Popen._recv function does not wait on data to be produced
before returning a value, it may return empty bytes. Popen.asyncread
handles this issue by returning all data read over a given time
interval.
The ProcessIOWrapper class uses the asyncread and asyncwrite functions to
allow a process to act like a file so that there are no blocking issues
that can arise from using the stdout and stdin file objects produced from
a subprocess.Popen call.
References:
[1] [ python-Feature Requests-1191964 ] asynchronous Subprocess
http://mail.python.org/pipermail/python-bugs-list/2006-December/
036524.html
[2] Daily Life in an Ivory Basement : /feb-07/problems-with-subprocess
http://ivory.idyll.org/blog/feb-07/problems-with-subprocess
[3] How can I run an external command asynchronously from Python? - Stack
Overflow
http://stackoverflow.com/questions/636561/how-can-i-run-an-external-
command-asynchronously-from-python
[4] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.wait
[5] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.kill
[6] Issue 1191964: asynchronous Subprocess - Python tracker
http://bugs.python.org/issue1191964
[7] Module to allow Asynchronous subprocess use on Windows and Posix
platforms - ActiveState Code
http://code.activestate.com/recipes/440554/
[8] subprocess.rst - subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev/source/browse/doc/subprocess.rst?spec=svn2c925e935cad0166d5da85e37c742d8e7f609de5&r=2c925e935cad0166d5da85e37c742d8e7f609de5#437
[9] subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev
[10] Python Subprocess Dev
http://subdev.blogspot.com/
Copyright:
This P.E.P. is licensed under the Open Publication License;
http://www.opencontent.org/openpub/.
pep-3146 Merging Unladen Swallow into CPython
| PEP: | 3146 |
|---|---|
| Title: | Merging Unladen Swallow into CPython |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Collin Winter <collinwinter at google.com>, Jeffrey Yasskin <jyasskin at google.com>, Reid Kleckner <rnk at mit.edu> |
| Status: | Withdrawn |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 1-Jan-2010 |
| Python-Version: | 3.3 |
| Post-History: |
Contents
PEP Withdrawal
With Unladen Swallow going the way of the Norwegian Blue [1] [2], this PEP has been deemed to have been withdrawn.
Abstract
This PEP proposes the merger of the Unladen Swallow project [3] into CPython's source tree. Unladen Swallow is an open-source branch of CPython focused on performance. Unladen Swallow is source-compatible with valid Python 2.6.4 applications and C extension modules.
Unladen Swallow adds a just-in-time (JIT) compiler to CPython, allowing for the compilation of selected Python code to optimized machine code. Beyond classical static compiler optimizations, Unladen Swallow's JIT compiler takes advantage of data collected at runtime to make checked assumptions about code behaviour, allowing the production of faster machine code.
This PEP proposes to integrate Unladen Swallow into CPython's development tree in a separate py3k-jit branch, targeted for eventual merger with the main py3k branch. While Unladen Swallow is by no means finished or perfect, we feel that Unladen Swallow has reached sufficient maturity to warrant incorporation into CPython's roadmap. We have sought to create a stable platform that the wider CPython development team can build upon, a platform that will yield increasing performance for years to come.
This PEP will detail Unladen Swallow's implementation and how it differs from CPython 2.6.4; the benchmarks used to measure performance; the tools used to ensure correctness and compatibility; the impact on CPython's current platform support; and the impact on the CPython core development process. The PEP concludes with a proposed merger plan and brief notes on possible directions for future work.
We seek the following from the BDFL:
- Approval for the overall concept of adding a just-in-time compiler to CPython, following the design laid out below.
- Permission to continue working on the just-in-time compiler in the CPython source tree.
- Permission to eventually merge the just-in-time compiler into the py3k branch once all blocking issues [32] have been addressed.
- A pony.
Rationale, Implementation
Many companies and individuals would like Python to be faster, to enable its use in more projects. Google is one such company.
Unladen Swallow is a Google-sponsored branch of CPython, initiated to improve the performance of Google's numerous Python libraries, tools and applications. To make the adoption of Unladen Swallow as easy as possible, the project initially aimed at four goals:
- A performance improvement of 5x over the baseline of CPython 2.6.4 for single-threaded code.
- 100% source compatibility with valid CPython 2.6 applications.
- 100% source compatibility with valid CPython 2.6 C extension modules.
- Design for eventual merger back into CPython.
We chose 2.6.4 as our baseline because Google uses CPython 2.4 internally, and jumping directly from CPython 2.4 to CPython 3.x was considered infeasible.
To achieve the desired performance, Unladen Swallow has implemented a just-in-time (JIT) compiler [52] in the tradition of Urs Hoelzle's work on Self [53], gathering feedback at runtime and using that to inform compile-time optimizations. This is similar to the approach taken by the current breed of JavaScript engines [60], [61]; most Java virtual machines [65]; Rubinius [62], MacRuby [64], and other Ruby implementations; Psyco [66]; and others.
We explicitly reject any suggestion that our ideas are original. We have sought to reuse the published work of other researchers wherever possible. If we have done any original work, it is by accident. We have tried, as much as possible, to take good ideas from all corners of the academic and industrial community. A partial list of the research papers that have informed Unladen Swallow is available on the Unladen Swallow wiki [55].
The key observation about optimizing dynamic languages is that they are only dynamic in theory; in practice, each individual function or snippet of code is relatively static, using a stable set of types and child functions. The current CPython bytecode interpreter assumes the worst about the code it is running, that at any moment the user might override the len() function or pass a never-before-seen type into a function. In practice this never happens, but user code pays for that support. Unladen Swallow takes advantage of the relatively static nature of user code to improve performance.
At a high level, the Unladen Swallow JIT compiler works by translating a function's CPython bytecode to platform-specific machine code, using data collected at runtime, as well as classical compiler optimizations, to improve the quality of the generated machine code. Because we only want to spend resources compiling Python code that will actually benefit the runtime of the program, an online heuristic is used to assess how hot a given function is. Once the hotness value for a function crosses a given threshold, it is selected for compilation and optimization. Until a function is judged hot, however, it runs in the standard CPython eval loop, which in Unladen Swallow has been instrumented to record interesting data about each bytecode executed. This runtime data is used to reduce the flexibility of the generated machine code, allowing us to optimize for the common case. For example, we collect data on
- Whether a branch was taken/not taken. If a branch is never taken, we will not compile it to machine code.
- Types used by operators. If we find that a + b is only ever adding integers, the generated machine code for that snippet will not support adding floats.
- Functions called at each callsite. If we find that a particular foo() callsite is always calling the same foo function, we can optimize the call or inline it away
Refer to [56] for a complete list of data points gathered and how they are used.
However, if by chance the historically-untaken branch is now taken, or some integer-optimized a + b snippet receives two strings, we must support this. We cannot change Python semantics. Each of these sections of optimized machine code is preceded by a guard, which checks whether the simplifying assumptions we made when optimizing still hold. If the assumptions are still valid, we run the optimized machine code; if they are not, we revert back to the interpreter and pick up where we left off.
We have chosen to reuse a set of existing compiler libraries called LLVM [4] for code generation and code optimization. This has saved our small team from needing to understand and debug code generation on multiple machine instruction sets and from needing to implement a large set of classical compiler optimizations. The project would not have been possible without such code reuse. We have found LLVM easy to modify and its community receptive to our suggestions and modifications.
In somewhat more depth, Unladen Swallow's JIT works by compiling CPython bytecode to LLVM's own intermediate representation (IR) [96], taking into account any runtime data from the CPython eval loop. We then run a set of LLVM's built-in optimization passes, producing a smaller, optimized version of the original LLVM IR. LLVM then lowers the IR to platform-specific machine code, performing register allocation, instruction scheduling, and any necessary relocations. This arrangement of the compilation pipeline allows the LLVM-based JIT to be easily omitted from a compiled python binary by passing --without-llvm to ./configure; various use cases for this flag are discussed later.
For a complete detailing of how Unladen Swallow works, consult the Unladen Swallow documentation [54], [56].
Unladen Swallow has focused on improving the performance of single-threaded, pure-Python code. We have not made an effort to remove CPython's global interpreter lock (GIL); we feel this is separate from our work, and due to its sensitivity, is best done in a mainline development branch. We considered making GIL-removal a part of Unladen Swallow, but were concerned by the possibility of introducing subtle bugs when porting our work from CPython 2.6 to 3.x.
A JIT compiler is an extremely versatile tool, and we have by no means exhausted its full potential. We have tried to create a sufficiently flexible framework that the wider CPython development community can build upon it for years to come, extracting increased performance in each subsequent release.
Alternatives
There are number of alternative strategies for improving Python performance which we considered, but found unsatisfactory.
Cython, Shedskin: Cython [103] and Shedskin [104] are both static compilers for Python. We view these as useful-but-limited workarounds for CPython's historically-poor performance. Shedskin does not support the full Python standard library [105], while Cython requires manual Cython-specific annotations for optimum performance.
Static compilers like these are useful for writing extension modules without worrying about reference counting, but because they are static, ahead-of-time compilers, they cannot optimize the full range of code under consideration by a just-in-time compiler informed by runtime data.
IronPython: IronPython [108] is Python on Microsoft's .Net platform. It is not actively tested on Mono [109], meaning that it is essentially Windows-only, making it unsuitable as a general CPython replacement.
Jython: Jython [110] is a complete implementation of Python 2.5, but is significantly slower than Unladen Swallow (3-5x on measured benchmarks) and has no support for CPython extension modules [111], which would make migration of large applications prohibitively expensive.
Psyco: Psyco [66] is a specializing JIT compiler for CPython, implemented as an extension module. It primarily improves performance for numerical code. Pros: exists; makes some code faster. Cons: 32-bit only, with no plans for 64-bit support; supports x86 only; very difficult to maintain; incompatible with SSE2 optimized code due to alignment issues.
PyPy: PyPy [67] has good performance on numerical code, but is slower than Unladen Swallow on some workloads. Migration of large applications from CPython to PyPy would be prohibitively expensive: PyPy's JIT compiler supports only 32-bit x86 code generation; important modules, such as MySQLdb and pycrypto, do not build against PyPy; PyPy does not offer an embedding API, much less the same API as CPython.
PyV8: PyV8 [112] is an alpha-stage experimental Python-to-JavaScript compiler that runs on top of V8. PyV8 does not implement the whole Python language, and has no support for CPython extension modules.
WPython: WPython [106] is a wordcode-based reimplementation of CPython's interpreter loop. While it provides a modest improvement to interpreter performance [107], it is not an either-or substitute for a just-in-time compiler. An interpreter will never be as fast as optimized machine code. We view WPython and similar interpreter enhancements as complementary to our work, rather than as competitors.
Performance
Benchmarks
Unladen Swallow has developed a fairly large suite of benchmarks, ranging from synthetic microbenchmarks designed to test a single feature up through whole-application macrobenchmarks. The inspiration for these benchmarks has come variously from third-party contributors (in the case of the html5lib benchmark), Google's own internal workloads (slowspitfire, pickle, unpickle), as well as tools and libraries in heavy use throughout the wider Python community (django, 2to3, spambayes). These benchmarks are run through a single interface called perf.py that takes care of collecting memory usage information, graphing performance, and running statistics on the benchmark results to ensure significance.
The full list of available benchmarks is available on the Unladen Swallow wiki [44], including instructions on downloading and running the benchmarks for yourself. All our benchmarks are open-source; none are Google-proprietary. We believe this collection of benchmarks serves as a useful tool to benchmark any complete Python implementation, and indeed, PyPy is already using these benchmarks for their own performance testing [82], [97]. We welcome this, and we seek additional workloads for the benchmark suite from the Python community.
We have focused our efforts on collecting macrobenchmarks and benchmarks that simulate real applications as well as possible, when running a whole application is not feasible. Along a different axis, our benchmark collection originally focused on the kinds of workloads seen by Google's Python code (webapps, text processing), though we have since expanded the collection to include workloads Google cares nothing about. We have so far shied away from heavily-numerical workloads, since NumPy [81] already does an excellent job on such code and so improving numerical performance was not an initial high priority for the team; we have begun to incorporate such benchmarks into the collection [98] and have started work on optimizing numerical Python code.
Beyond these benchmarks, there are also a variety of workloads we are explicitly not interested in benchmarking. Unladen Swallow is focused on improving the performance of pure-Python code, so the performance of extension modules like NumPy is uninteresting since NumPy's core routines are implemented in C. Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy applications would, we feel, fail to accurately measure interpreter or code generation optimizations. That said, there's certainly room to improve the performance of C-language extensions modules in the standard library, and as such, we have added benchmarks for the cPickle and re modules.
Performance vs CPython 2.6.4
The charts below compare the arithmetic mean of multiple benchmark iterations for CPython 2.6.4 and Unladen Swallow. perf.py gathers more data than this, and indeed, arithmetic mean is not the whole story; we reproduce only the mean for the sake of conciseness. We include the t score from the Student's two-tailed T-test [45] at the 95% confidence interval to indicate the significance of the result. Most benchmarks are run for 100 iterations, though some longer-running whole-application benchmarks are run for fewer iterations.
A description of each of these benchmarks is available on the Unladen Swallow wiki [44].
Command:
./perf.py -r -b default,apps ../a/python ../b/python
32-bit; gcc 4.0.3; Ubuntu Dapper; Intel Core2 Duo 6600 @ 2.4GHz; 2 cores; 4MB L2 cache; 4GB RAM
| Benchmark | CPython 2.6.4 | Unladen Swallow r988 | Change | Significance | Timeline |
|---|---|---|---|---|---|
| 2to3 | 25.13 s | 24.87 s | 1.01x faster | t=8.94 | http://tinyurl.com/yamhrpg |
| django | 1.08 s | 0.80 s | 1.35x faster | t=315.59 | http://tinyurl.com/y9mrn8s |
| html5lib | 14.29 s | 13.20 s | 1.08x faster | t=2.17 | http://tinyurl.com/y8tyslu |
| nbody | 0.51 s | 0.28 s | 1.84x faster | t=78.007 | http://tinyurl.com/y989qhg |
| rietveld | 0.75 s | 0.55 s | 1.37x faster | Insignificant | http://tinyurl.com/ye7mqd3 |
| slowpickle | 0.75 s | 0.55 s | 1.37x faster | t=20.78 | http://tinyurl.com/ybrsfnd |
| slowspitfire | 0.83 s | 0.61 s | 1.36x faster | t=2124.66 | http://tinyurl.com/yfknhaw |
| slowunpickle | 0.33 s | 0.26 s | 1.26x faster | t=15.12 | http://tinyurl.com/yzlakoo |
| spambayes | 0.31 s | 0.34 s | 1.10x slower | Insignificant | http://tinyurl.com/yem62ub |
64-bit; gcc 4.2.4; Ubuntu Hardy; AMD Opteron 8214 HE @ 2.2 GHz; 4 cores; 1MB L2 cache; 8GB RAM
| Benchmark | CPython 2.6.4 | Unladen Swallow r988 | Change | Significance | Timeline |
|---|---|---|---|---|---|
| 2to3 | 31.98 s | 30.41 s | 1.05x faster | t=8.35 | http://tinyurl.com/ybcrl3b |
| django | 1.22 s | 0.94 s | 1.30x faster | t=106.68 | http://tinyurl.com/ybwqll6 |
| html5lib | 18.97 s | 17.79 s | 1.06x faster | t=2.78 | http://tinyurl.com/yzlyqvk |
| nbody | 0.77 s | 0.27 s | 2.86x faster | t=133.49 | http://tinyurl.com/yeyqhbg |
| rietveld | 0.74 s | 0.80 s | 1.08x slower | t=-2.45 | http://tinyurl.com/yzjc6ff |
| slowpickle | 0.91 s | 0.62 s | 1.48x faster | t=28.04 | http://tinyurl.com/yf7en6k |
| slowspitfire | 1.01 s | 0.72 s | 1.40x faster | t=98.70 | http://tinyurl.com/yc8pe2o |
| slowunpickle | 0.51 s | 0.34 s | 1.51x faster | t=32.65 | http://tinyurl.com/yjufu4j |
| spambayes | 0.43 s | 0.45 s | 1.06x slower | Insignificant | http://tinyurl.com/yztbjfp |
Many of these benchmarks take a hit under Unladen Swallow because the current version blocks execution to compile Python functions down to machine code. This leads to the behaviour seen in the timeline graphs for the html5lib and rietveld benchmarks, for example, and slows down the overall performance of 2to3. We have an active development branch to fix this problem ([47], [48]), but working within the strictures of CPython's current threading system has complicated the process and required far more care and time than originally anticipated. We view this issue as critical to final merger into the py3k branch.
We have obviously not met our initial goal of a 5x performance improvement. A performance retrospective follows, which addresses why we failed to meet our initial performance goal. We maintain a list of yet-to-be-implemented performance work [51].
Memory Usage
The following table shows maximum memory usage (in kilobytes) for each of Unladen Swallow's default benchmarks for both CPython 2.6.4 and Unladen Swallow r988, as well as a timeline of memory usage across the lifetime of the benchmark. We include tables for both 32- and 64-bit binaries. Memory usage was measured on Linux 2.6 systems by summing the Private_ sections from the kernel's /proc/$pid/smaps pseudo-files [46].
Command:
./perf.py -r --track_memory -b default,apps ../a/python ../b/python
32-bit
| Benchmark | CPython 2.6.4 | Unladen Swallow r988 | Change | Timeline |
|---|---|---|---|---|
| 2to3 | 26396 kb | 46896 kb | 1.77x | http://tinyurl.com/yhr2h4z |
| django | 10028 kb | 27740 kb | 2.76x | http://tinyurl.com/yhan8vs |
| html5lib | 150028 kb | 173924 kb | 1.15x | http://tinyurl.com/ybt44en |
| nbody | 3020 kb | 16036 kb | 5.31x | http://tinyurl.com/ya8hltw |
| rietveld | 15008 kb | 46400 kb | 3.09x | http://tinyurl.com/yhd5dra |
| slowpickle | 4608 kb | 16656 kb | 3.61x | http://tinyurl.com/ybukyvo |
| slowspitfire | 85776 kb | 97620 kb | 1.13x | http://tinyurl.com/y9vj35z |
| slowunpickle | 3448 kb | 13744 kb | 3.98x | http://tinyurl.com/yexh4d5 |
| spambayes | 7352 kb | 46480 kb | 6.32x | http://tinyurl.com/yem62ub |
64-bit
| Benchmark | CPython 2.6.4 | Unladen Swallow r988 | Change | Timeline |
|---|---|---|---|---|
| 2to3 | 51596 kb | 82340 kb | 1.59x | http://tinyurl.com/yljg6rs |
| django | 16020 kb | 38908 kb | 2.43x | http://tinyurl.com/ylqsebh |
| html5lib | 259232 kb | 324968 kb | 1.25x | http://tinyurl.com/yha6oee |
| nbody | 4296 kb | 23012 kb | 5.35x | http://tinyurl.com/yztozza |
| rietveld | 24140 kb | 73960 kb | 3.06x | http://tinyurl.com/ybg2nq7 |
| slowpickle | 4928 kb | 23300 kb | 4.73x | http://tinyurl.com/yk5tpbr |
| slowspitfire | 133276 kb | 148676 kb | 1.11x | http://tinyurl.com/y8bz2xe |
| slowunpickle | 4896 kb | 16948 kb | 3.46x | http://tinyurl.com/ygywwoc |
| spambayes | 10728 kb | 84992 kb | 7.92x | http://tinyurl.com/yhjban5 |
The increased memory usage comes from a) LLVM code generation, analysis and optimization libraries; b) native code; c) memory usage issues or leaks in LLVM; d) data structures needed to optimize and generate machine code; e) as-yet uncategorized other sources.
While we have made significant progress in reducing memory usage since the initial naive JIT implementation [43], there is obviously more to do. We believe that there are still memory savings to be made without sacrificing performance. We have tended to focus on raw performance, and we have not yet made a concerted push to reduce memory usage. We view reducing memory usage as a blocking issue for final merger into the py3k branch. We seek guidance from the community on an acceptable level of increased memory usage.
Start-up Time
Statically linking LLVM's code generation, analysis and optimization libraries increases the time needed to start the Python binary. C++ static initializers used by LLVM also increase start-up time, as does importing the collection of pre-compiled C runtime routines we want to inline to Python code.
Results from Unladen Swallow's startup benchmarks:
$ ./perf.py -r -b startup /tmp/cpy-26/bin/python /tmp/unladen/bin/python ### normal_startup ### Min: 0.219186 -> 0.352075: 1.6063x slower Avg: 0.227228 -> 0.364384: 1.6036x slower Significant (t=-51.879098, a=0.95) Stddev: 0.00762 -> 0.02532: 3.3227x larger Timeline: http://tinyurl.com/yfe8z3r ### startup_nosite ### Min: 0.105949 -> 0.264912: 2.5004x slower Avg: 0.107574 -> 0.267505: 2.4867x slower Significant (t=-703.557403, a=0.95) Stddev: 0.00214 -> 0.00240: 1.1209x larger Timeline: http://tinyurl.com/yajn8fa ### bzr_startup ### Min: 0.067990 -> 0.097985: 1.4412x slower Avg: 0.084322 -> 0.111348: 1.3205x slower Significant (t=-37.432534, a=0.95) Stddev: 0.00793 -> 0.00643: 1.2330x smaller Timeline: http://tinyurl.com/ybdm537 ### hg_startup ### Min: 0.016997 -> 0.024997: 1.4707x slower Avg: 0.026990 -> 0.036772: 1.3625x slower Significant (t=-53.104502, a=0.95) Stddev: 0.00406 -> 0.00417: 1.0273x larger Timeline: http://tinyurl.com/ycout8m
bzr_startup and hg_startup measure how long it takes Bazaar and Mercurial, respectively, to display their help screens. startup_nosite runs python -S many times; usage of the -S option is rare, but we feel this gives a good indication of where increased startup time is coming from.
Unladen Swallow has made headway toward optimizing startup time, but there is still more work to do and further optimizations to implement. Improving start-up time is a high-priority item [34] in Unladen Swallow's merger punchlist.
Binary Size
Statically linking LLVM's code generation, analysis and optimization libraries significantly increases the size of the python binary. The tables below report stripped on-disk binary sizes; the binaries are stripped to better correspond with the configurations used by system package managers. We feel this is the most realistic measure of any change in binary size.
| Binary size | CPython 2.6.4 | CPython 3.1.1 | Unladen Swallow r1041 |
|---|---|---|---|
| 32-bit | 1.3M | 1.4M | 12M |
| 64-bit | 1.6M | 1.6M | 12M |
The increased binary size is caused by statically linking LLVM's code generation, analysis and optimization libraries into the python binary. This can be straightforwardly addressed by modifying LLVM to better support shared linking and then using that, instead of the current static linking. For the moment, though, static linking provides an accurate look at the cost of linking against LLVM.
Even when statically linking, we believe there is still headroom to improve on-disk binary size by narrowing Unladen Swallow's dependencies on LLVM. This issue is actively being addressed [33].
Performance Retrospective
Our initial goal for Unladen Swallow was a 5x performance improvement over CPython 2.6. We did not hit that, nor to put it bluntly, even come close. Why did the project not hit that goal, and can an LLVM-based JIT ever hit that goal?
Why did Unladen Swallow not achieve its 5x goal? The primary reason was that LLVM required more work than we had initially anticipated. Based on the fact that Apple was shipping products based on LLVM [83], and other high-level languages had successfully implemented LLVM-based JITs ([62], [64], [84]), we had assumed that LLVM's JIT was relatively free of show-stopper bugs.
That turned out to be incorrect. We had to turn our attention away from performance to fix a number of critical bugs in LLVM's JIT infrastructure (for example, [85], [86]) as well as a number of nice-to-have enhancements that would enable further optimizations along various axes (for example, [88], [87], [89]). LLVM's static code generation facilities, tools and optimization passes are stable and stress-tested, but the just-in-time infrastructure was relatively untested and buggy. We have fixed this.
(Our hypothesis is that we hit these problems -- problems other projects had avoided -- because of the complexity and thoroughness of CPython's standard library test suite.)
We also diverted engineering effort away from performance and into support tools such as gdb and oProfile. gdb did not work well with JIT compilers at all, and LLVM previously had no integration with oProfile. Having JIT-aware debuggers and profilers has been very valuable to the project, and we do not regret channeling our time in these directions. See the Debugging and Profiling sections for more information.
Can an LLVM-based CPython JIT ever hit the 5x performance target? The benchmark results for JIT-based JavaScript implementations suggest that 5x is indeed possible, as do the results PyPy's JIT has delivered for numeric workloads. The experience of Self-92 [53] is also instructive.
Can LLVM deliver this? We believe that we have only begun to scratch the surface of what our LLVM-based JIT can deliver. The optimizations we have incorporated into this system thus far have borne significant fruit (for example, [90], [91], [92]). Our experience to date is that the limiting factor on Unladen Swallow's performance is the engineering cycles needed to implement the literature. We have found LLVM easy to work with and to modify, and its built-in optimizations have greatly simplified the task of implementing Python-level optimizations.
An overview of further performance opportunities is discussed in the Future Work section.
Correctness and Compatibility
Unladen Swallow's correctness test suite includes CPython's test suite (under Lib/test/), as well as a number of important third-party applications and libraries [6]. A full list of these applications and libraries is reproduced below. Any dependencies needed by these packages, such as zope.interface [35], are also tested indirectly as a part of testing the primary package, thus widening the corpus of tested third-party Python code.
- 2to3
- Cheetah
- cvs2svn
- Django
- Nose
- NumPy
- PyCrypto
- pyOpenSSL
- PyXML
- Setuptools
- SQLAlchemy
- SWIG
- SymPy
- Twisted
- ZODB
These applications pass all relevant tests when run under Unladen Swallow. Note that some tests that failed against our baseline of CPython 2.6.4 were disabled, as were tests that made assumptions about CPython internals such as exact bytecode numbers or bytecode format. Any package with disabled tests includes a README.unladen file that details the changes (for example, [38]).
In addition, Unladen Swallow is tested automatically against an array of internal Google Python libraries and applications. These include Google's internal Python bindings for BigTable [36], the Mondrian code review application [37], and Google's Python standard library, among others. The changes needed to run these projects under Unladen Swallow have consistently broken into one of three camps:
- Adding CPython 2.6 C API compatibility. Since Google still primarily uses CPython 2.4 internally, we have needed to convert uses of int to Py_ssize_t and similar API changes.
- Fixing or disabling explicit, incorrect tests of the CPython version number.
- Conditionally disabling code that worked around or depending on bugs in CPython 2.4 that have since been fixed.
Testing against this wide range of public and proprietary applications and libraries has been instrumental in ensuring the correctness of Unladen Swallow. Testing has exposed bugs that we have duly corrected. Our automated regression testing regime has given us high confidence in our changes as we have moved forward.
In addition to third-party testing, we have added further tests to CPython's test suite for corner cases of the language or implementation that we felt were untested or underspecified (for example, [49], [50]). These have been especially important when implementing optimizations, helping make sure we have not accidentally broken the darker corners of Python.
We have also constructed a test suite focused solely on the LLVM-based JIT compiler and the optimizations implemented for it [39]. Because of the complexity and subtlety inherent in writing an optimizing compiler, we have attempted to exhaustively enumerate the constructs, scenarios and corner cases we are compiling and optimizing. The JIT tests also include tests for things like the JIT hotness model, making it easier for future CPython developers to maintain and improve.
We have recently begun using fuzz testing [40] to stress-test the compiler. We have used both pyfuzz [41] and Fusil [42] in the past, and we recommend they be introduced as an automated part of the CPython testing process.
Known Incompatibilities
The only application or library we know to not work with Unladen Swallow that does work with CPython 2.6.4 is Psyco [66]. We are aware of some libraries such as PyGame [80] that work well with CPython 2.6.4, but suffer some degradation due to changes made in Unladen Swallow. We are tracking this issue [48] and are working to resolve these instances of degradation.
While Unladen Swallow is source-compatible with CPython 2.6.4, it is not binary compatible. C extension modules compiled against one will need to be recompiled to work with the other.
The merger of Unladen Swallow should have minimal impact on long-lived CPython optimization branches like WPython. WPython [106] and Unladen Swallow are largely orthogonal, and there is no technical reason why both could not be merged into CPython. The changes needed to make WPython compatible with a JIT-enhanced version of CPython should be minimal [115]. The same should be true for other CPython optimization projects (for example, [116]).
Invasive forks of CPython such as Stackless Python [117] are more challenging to support. Since Stackless is highly unlikely to be merged into CPython [118] and an increased maintenance burden is part and parcel of any fork, we consider compatibility with Stackless to be relatively low-priority. JIT-compiled stack frames use the C stack, so Stackless should be able to treat them the same as it treats calls through extension modules. If that turns out to be unacceptable, Stackless could either remove the JIT compiler or improve JIT code generation to better support heap-based stack frames [119], [120].
Platform Support
Unladen Swallow is inherently limited by the platform support provided by LLVM, especially LLVM's JIT compilation system [7]. LLVM's JIT has the best support on x86 and x86-64 systems, and these are the platforms where Unladen Swallow has received the most testing. We are confident in LLVM/Unladen Swallow's support for x86 and x86-64 hardware. PPC and ARM support exists, but is not widely used and may be buggy (for example, [101], [85], [102]).
Unladen Swallow is known to work on the following operating systems: Linux, Darwin, Windows. Unladen Swallow has received the most testing on Linux and Darwin, though it still builds and passes its tests on Windows.
In order to support hardware and software platforms where LLVM's JIT does not work, Unladen Swallow provides a ./configure --without-llvm option. This flag carves out any part of Unladen Swallow that depends on LLVM, yielding a Python binary that works and passes its tests, but has no performance advantages. This configuration is recommended for hardware unsupported by LLVM, or systems that care more about memory usage than performance.
Impact on CPython Development
Experimenting with Changes to Python or CPython Bytecode
Unladen Swallow's JIT compiler operates on CPython bytecode, and as such, it is immune to Python language changes that affect only the parser.
We recommend that changes to the CPython bytecode compiler or the semantics of individual bytecodes be prototyped in the interpreter loop first, then be ported to the JIT compiler once the semantics are clear. To make this easier, Unladen Swallow includes a --without-llvm configure-time option that strips out the JIT compiler and all associated infrastructure. This leaves the current burden of experimentation unchanged so that developers can prototype in the current low-barrier-to-entry interpreter loop.
Unladen Swallow began implementing its JIT compiler by doing straightforward, naive translations from bytecode implementations into LLVM API calls. We found this process to be easily understood, and we recommend the same approach for CPython. We include several sample changes from the Unladen Swallow repository here as examples of this style of development: [26], [27], [28], [29].
Debugging
The Unladen Swallow team implemented changes to gdb to make it easier to use gdb to debug JIT-compiled Python code. These changes were released in gdb 7.0 [17]. They make it possible for gdb to identify and unwind past JIT-generated call stack frames. This allows gdb to continue to function as before for CPython development if one is changing, for example, the list type or builtin functions.
Example backtrace after our changes, where baz, bar and foo are JIT-compiled:
Program received signal SIGSEGV, Segmentation fault. 0x00002aaaabe7d1a8 in baz () (gdb) bt #0 0x00002aaaabe7d1a8 in baz () #1 0x00002aaaabe7d12c in bar () #2 0x00002aaaabe7d0aa in foo () #3 0x00002aaaabe7d02c in main () #4 0x0000000000b870a2 in llvm::JIT::runFunction (this=0x1405b70, F=0x14024e0, ArgValues=...) at /home/rnk/llvm-gdb/lib/ExecutionEngine/JIT/JIT.cpp:395 #5 0x0000000000baa4c5 in llvm::ExecutionEngine::runFunctionAsMain (this=0x1405b70, Fn=0x14024e0, argv=..., envp=0x7fffffffe3c0) at /home/rnk/llvm-gdb/lib/ExecutionEngine/ExecutionEngine.cpp:377 #6 0x00000000007ebd52 in main (argc=2, argv=0x7fffffffe3a8, envp=0x7fffffffe3c0) at /home/rnk/llvm-gdb/tools/lli/lli.cpp:208
Previously, the JIT-compiled frames would have caused gdb to unwind incorrectly, generating lots of obviously-incorrect #6 0x00002aaaabe7d0aa in ?? ()-style stack frames.
Highlights:
- gdb 7.0 is able to correctly parse JIT-compiled stack frames, allowing full use of gdb on non-JIT-compiled functions, that is, the vast majority of the CPython codebase.
- Disassembling inside a JIT-compiled stack frame automatically prints the full list of instructions making up that function. This is an advance over the state of gdb before our work: developers needed to guess the starting address of the function and manually disassemble the assembly code.
- Flexible underlying mechanism allows CPython to add more and more information, and eventually reach parity with C/C++ support in gdb for JIT-compiled machine code.
Lowlights:
- gdb cannot print local variables or tell you what line you're currently executing inside a JIT-compiled function. Nor can it step through JIT-compiled code, except for one instruction at a time.
- Not yet integrated with Apple's gdb or Microsoft's Visual Studio debuggers.
The Unladen Swallow team is working with Apple to get these changes incorporated into their future gdb releases.
Profiling
Unladen Swallow integrates with oProfile 0.9.4 and newer [18] to support assembly-level profiling on Linux systems. This means that oProfile will correctly symbolize JIT-compiled functions in its reports.
Example report, where the #u#-prefixed symbol names are JIT-compiled Python functions:
$ opreport -l ./python | less CPU: Core 2, speed 1600 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name symbol name 79589 4.2329 python PyString_FromFormatV 62971 3.3491 python PyEval_EvalCodeEx 62713 3.3354 python tupledealloc 57071 3.0353 python _PyEval_CallFunction 50009 2.6597 24532.jo #u#force_unicode 47468 2.5246 python PyUnicodeUCS2_Decode 45829 2.4374 python PyFrame_New 45173 2.4025 python lookdict_string 43082 2.2913 python PyType_IsSubtype 39763 2.1148 24532.jo #u#render5 38145 2.0287 python _PyType_Lookup 37643 2.0020 python PyObject_GC_UnTrack 37105 1.9734 python frame_dealloc 36849 1.9598 python PyEval_EvalFrame 35630 1.8950 24532.jo #u#resolve 33313 1.7717 python PyObject_IsInstance 33208 1.7662 python PyDict_GetItem 33168 1.7640 python PyTuple_New 30458 1.6199 python PyCFunction_NewEx
This support is functional, but as-yet unpolished. Unladen Swallow maintains a punchlist of items we feel are important to improve in our oProfile integration to make it more useful to core CPython developers [19].
Highlights:
- Symbolization of JITted frames working in oProfile on Linux.
Lowlights:
- No work yet invested in improving symbolization of JIT-compiled frames for Apple's Shark [20] or Microsoft's Visual Studio profiling tools.
- Some polishing still desired for oProfile output.
We recommend using oProfile 0.9.5 (and newer) to work around a now-fixed bug on x86-64 platforms in oProfile. oProfile 0.9.4 will work fine on 32-bit platforms, however.
Given the ease of integrating oProfile with LLVM [21] and Unladen Swallow [22], other profiling tools should be easy as well, provided they support a similar JIT interface [23].
We have documented the process for using oProfile to profile Unladen Swallow [24]. This document will be merged into CPython's Doc/ tree in the merge.
Addition of C++ to CPython
In order to use LLVM, Unladen Swallow has introduced C++ into the core CPython tree and build process. This is an unavoidable part of depending on LLVM; though LLVM offers a C API [8], it is limited and does not expose the functionality needed by CPython. Because of this, we have implemented the internal details of the Unladen Swallow JIT and its supporting infrastructure in C++. We do not propose converting the entire CPython codebase to C++.
Highlights:
- Easy use of LLVM's full, powerful code generation and related APIs.
- Convenient, abstract data structures simplify code.
- C++ is limited to relatively small corners of the CPython codebase.
- C++ can be disabled via ./configure --without-llvm, which even omits the dependency on libstdc++.
Lowlights:
- Developers must know two related languages, C and C++ to work on the full range of CPython's internals.
- A C++ style guide will need to be developed and enforced. PEP 7 will be extended [121] to encompass C++ by taking the relevant parts of the C++ style guides from Unladen Swallow [71], LLVM [72] and Google [73].
- Different C++ compilers emit different ABIs; this can cause problems if CPython is compiled with one C++ compiler and extensions modules are compiled with a different C++ compiler.
Managing LLVM Releases, C++ API Changes
LLVM is released regularly every six months. This means that LLVM may be released two or three times during the course of development of a CPython 3.x release. Each LLVM release brings newer and more powerful optimizations, improved platform support and more sophisticated code generation.
LLVM releases usually include incompatible changes to the LLVM C++ API; the release notes for LLVM 2.6 [9] include a list of intentionally-introduced incompatibilities. Unladen Swallow has tracked LLVM trunk closely over the course of development. Our experience has been that LLVM API changes are obvious and easily or mechanically remedied. We include two such changes from the Unladen Swallow tree as references here: [10], [11].
Due to API incompatibilities, we recommend that an LLVM-based CPython target compatibility with a single version of LLVM at a time. This will lower the overhead on the core development team. Pegging to an LLVM version should not be a problem from a packaging perspective, because pre-built LLVM packages generally become available via standard system package managers fairly quickly following an LLVM release, and failing that, llvm.org itself includes binary releases.
Unladen Swallow has historically included a copy of the LLVM and Clang source trees in the Unladen Swallow tree; this was done to allow us to closely track LLVM trunk as we made patches to it. We do not recommend this model of development for CPython. CPython releases should be based on official LLVM releases. Pre-built LLVM packages are available from MacPorts [12] for Darwin, and from most major Linux distributions ([13], [14], [16]). LLVM itself provides additional binaries, such as for MinGW [25].
LLVM is currently intended to be statically linked; this means that binary releases of CPython will include the relevant parts (not all!) of LLVM. This will increase the binary size, as noted above. To simplify downstream package management, we will modify LLVM to better support shared linking. This issue will block final merger [99].
Unladen Swallow has tasked a full-time engineer with fixing any remaining critical issues in LLVM before LLVM's 2.7 release. We consider it essential that CPython 3.x be able to depend on a released version of LLVM, rather than closely tracking LLVM trunk as Unladen Swallow has done. We believe we will finish this work [100] before the release of LLVM 2.7, expected in May 2010.
Building CPython
In addition to a runtime dependency on LLVM, Unladen Swallow includes a build-time dependency on Clang [5], an LLVM-based C/C++ compiler. We use this to compile parts of the C-language Python runtime to LLVM's intermediate representation; this allows us to perform cross-language inlining, yielding increased performance. Clang is not required to run Unladen Swallow. Clang binary packages are available from most major Linux distributions (for example, [15]).
We examined the impact of Unladen Swallow on the time needed to build Python, including configure, full builds and incremental builds after touching a single C source file.
| ./configure | CPython 2.6.4 | CPython 3.1.1 | Unladen Swallow r988 |
|---|---|---|---|
| Run 1 | 0m20.795s | 0m16.558s | 0m15.477s |
| Run 2 | 0m15.255s | 0m16.349s | 0m15.391s |
| Run 3 | 0m15.228s | 0m16.299s | 0m15.528s |
| Full make | CPython 2.6.4 | CPython 3.1.1 | Unladen Swallow r988 |
|---|---|---|---|
| Run 1 | 1m30.776s | 1m22.367s | 1m54.053s |
| Run 2 | 1m21.374s | 1m22.064s | 1m49.448s |
| Run 3 | 1m22.047s | 1m23.645s | 1m49.305s |
Full builds take a hit due to a) additional .cc files needed for LLVM interaction, b) statically linking LLVM into libpython, c) compiling parts of the Python runtime to LLVM IR to enable cross-language inlining.
Incremental builds are also somewhat slower than mainline CPython. The table below shows incremental rebuild times after touching Objects/listobject.c.
| Incr make | CPython 2.6.4 | CPython 3.1.1 | Unladen Swallow r1024 |
|---|---|---|---|
| Run 1 | 0m1.854s | 0m1.456s | 0m6.680s |
| Run 2 | 0m1.437s | 0m1.442s | 0m5.310s |
| Run 3 | 0m1.440s | 0m1.425s | 0m7.639s |
As with full builds, this extra time comes from statically linking LLVM into libpython. If libpython were linked shared against LLVM, this overhead would go down.
Proposed Merge Plan
We propose focusing our efforts on eventual merger with CPython's 3.x line of development. The BDFL has indicated that 2.7 is to be the final release of CPython's 2.x line of development [30], and since 2.7 alpha 1 has already been released [31], we have missed the window. Python 3 is the future, and that is where we will target our performance efforts.
We recommend the following plan for merger of Unladen Swallow into the CPython source tree:
- Creation of a branch in the CPython SVN repository to work in, call it py3k-jit as a strawman. This will be a branch of the CPython py3k branch.
- We will keep this branch closely integrated to py3k. The further we deviate, the harder our work will be.
- Any JIT-related patches will go into the py3k-jit branch.
- Non-JIT-related patches will go into the py3k branch (once reviewed and approved) and be merged back into the py3k-jit branch.
- Potentially-contentious issues, such as the introduction of new command line flags or environment variables, will be discussed on python-dev.
Because Google uses CPython 2.x internally, Unladen Swallow is based on CPython 2.6. We would need to port our compiler to Python 3; this would be done as patches are applied to the py3k-jit branch, so that the branch remains a consistent implementation of Python 3 at all times.
We believe this approach will be minimally disruptive to the 3.2 or 3.3 release process while we iron out any remaining issues blocking final merger into py3k. Unladen Swallow maintains a punchlist of known issues needed before final merger [32], which includes all problems mentioned in this PEP; we trust the CPython community will have its own concerns. This punchlist is not static; other issues may emerge in the future that will block final merger into the py3k branch.
Changes will be committed directly to the py3k-jit branch, with only large, tricky or controversial changes sent for pre-commit code review.
Contingency Plans
There is a chance that we will not be able to reduce memory usage or startup time to a level satisfactory to the CPython community. Our primary contingency plan for this situation is to shift from a online just-in-time compilation strategy to an offline ahead-of-time strategy using an instrumented CPython interpreter loop to obtain feedback. This is the same model used by gcc's feedback-directed optimizations (-fprofile-generate) [113] and Microsoft Visual Studio's profile-guided optimizations [114]; we will refer to this as "feedback-directed optimization" here, or FDO.
We believe that an FDO compiler for Python would be inferior to a JIT compiler. FDO requires a high-quality, representative benchmark suite, which is a relative rarity in both open- and closed-source development. A JIT compiler can dynamically find and optimize the hot spots in any application -- benchmark suite or no -- allowing it to adapt to changes in application bottlenecks without human intervention.
If an ahead-of-time FDO compiler is required, it should be able to leverage a large percentage of the code and infrastructure already developed for Unladen Swallow's JIT compiler. Indeed, these two compilation strategies could exist side-by-side.
Future Work
A JIT compiler is an extremely flexible tool, and we have by no means exhausted its full potential. Unladen Swallow maintains a list of yet-to-be-implemented performance optimizations [51] that the team has not yet had time to fully implement. Examples:
- Python/Python inlining [68]. Our compiler currently performs no inlining between pure-Python functions. Work on this is on-going [70].
- Unboxing [69]. Unboxing is critical for numerical performance. PyPy in particular has demonstrated the value of unboxing to heavily-numeric workloads.
- Recompilation, adaptation. Unladen Swallow currently only compiles a Python function once, based on its usage pattern up to that point. If the usage pattern changes, limitations in LLVM [74] prevent us from recompiling the function to better serve the new usage pattern.
- JIT-compile regular expressions. Modern JavaScript engines reuse their JIT compilation infrastructure to boost regex performance [75]. Unladen Swallow has developed benchmarks for Python regular expression performance ([76], [77], [78]), but work on regex performance is still at an early stage [79].
- Trace compilation [93], [94]. Based on the results of PyPy and Tracemonkey [95], we believe that a CPython JIT should incorporate trace compilation to some degree. We initially avoided a purely-tracing JIT compiler in favor of a simpler, function-at-a-time compiler. However this function-at-a-time compiler has laid the groundwork for a future tracing compiler implemented in the same terms.
- Profile generation/reuse. The runtime data gathered by the JIT could be persisted to disk and reused by subsequent JIT compilations, or by external tools such as Cython [103] or a feedback-enhanced code coverage tool.
This list is by no means exhaustive. There is a vast literature on optimizations for dynamic languages that could and should be implemented in terms of Unladen Swallow's LLVM-based JIT compiler [55].
Unladen Swallow Community
We would like to thank the community of developers who have contributed to Unladen Swallow, in particular: James Abbatiello, Joerg Blank, Eric Christopher, Alex Gaynor, Chris Lattner, Nick Lewycky, Evan Phoenix and Thomas Wouters.
Licensing
All work on Unladen Swallow is licensed to the Python Software Foundation (PSF) under the terms of the Python Software Foundation License v2 [57] under the umbrella of Google's blanket Contributor License Agreement with the PSF.
LLVM is licensed [58] under the University of llinois/NCSA Open Source License [59], a liberal, OSI-approved license. The University of Illinois Urbana-Champaign is the sole copyright holder for LLVM.
References
| [1] | http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html |
| [2] | http://en.wikipedia.org/wiki/Dead_Parrot_sketch |
| [3] | http://code.google.com/p/unladen-swallow/ |
| [4] | http://llvm.org/ |
| [5] | http://clang.llvm.org/ |
| [6] | http://code.google.com/p/unladen-swallow/wiki/Testing |
| [7] | http://llvm.org/docs/GettingStarted.html#hardware |
| [8] | http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm-c/ |
| [9] | http://llvm.org/releases/2.6/docs/ReleaseNotes.html#whatsnew |
| [10] | http://code.google.com/p/unladen-swallow/source/detail?r=820 |
| [11] | http://code.google.com/p/unladen-swallow/source/detail?r=532 |
| [12] | http://trac.macports.org/browser/trunk/dports/lang/llvm/Portfile |
| [13] | http://packages.ubuntu.com/karmic/llvm |
| [14] | http://packages.debian.org/unstable/devel/llvm |
| [15] | http://packages.debian.org/sid/clang |
| [16] | http://koji.fedoraproject.org/koji/buildinfo?buildID=134384 |
| [17] | http://www.gnu.org/software/gdb/download/ANNOUNCEMENT |
| [18] | http://oprofile.sourceforge.net/news/ |
| [19] | http://code.google.com/p/unladen-swallow/issues/detail?id=63 |
| [20] | http://developer.apple.com/tools/sharkoptimize.html |
| [21] | http://llvm.org/viewvc/llvm-project?view=rev&revision=75279 |
| [22] | http://code.google.com/p/unladen-swallow/source/detail?r=986 |
| [23] | http://oprofile.sourceforge.net/doc/devel/jit-interface.html |
| [24] | http://code.google.com/p/unladen-swallow/wiki/UsingOProfile |
| [25] | http://llvm.org/releases/download.html |
| [26] | http://code.google.com/p/unladen-swallow/source/detail?r=359 |
| [27] | http://code.google.com/p/unladen-swallow/source/detail?r=376 |
| [28] | http://code.google.com/p/unladen-swallow/source/detail?r=417 |
| [29] | http://code.google.com/p/unladen-swallow/source/detail?r=517 |
| [30] | http://mail.python.org/pipermail/python-dev/2010-January/095682.html |
| [31] | http://www.python.org/dev/peps/pep-0373/ |
| [32] | (1, 2) http://code.google.com/p/unladen-swallow/issues/list?q=label:Merger |
| [33] | http://code.google.com/p/unladen-swallow/issues/detail?id=118 |
| [34] | http://code.google.com/p/unladen-swallow/issues/detail?id=64 |
| [35] | http://www.zope.org/Products/ZopeInterface |
| [36] | http://en.wikipedia.org/wiki/BigTable |
| [37] | http://www.niallkennedy.com/blog/2006/11/google-mondrian.html |
| [38] | http://code.google.com/p/unladen-swallow/source/browse/tests/lib/sqlalchemy/README.unladen |
| [39] | http://code.google.com/p/unladen-swallow/source/browse/trunk/Lib/test/test_llvm.py |
| [40] | http://en.wikipedia.org/wiki/Fuzz_testing |
| [41] | http://bitbucket.org/ebo/pyfuzz/overview/ |
| [42] | http://lwn.net/Articles/322826/ |
| [43] | http://code.google.com/p/unladen-swallow/issues/detail?id=68 |
| [44] | (1, 2) http://code.google.com/p/unladen-swallow/wiki/Benchmarks |
| [45] | http://en.wikipedia.org/wiki/Student's_t-test |
| [46] | http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html |
| [47] | http://code.google.com/p/unladen-swallow/source/browse/branches/background-thread |
| [48] | (1, 2) http://code.google.com/p/unladen-swallow/issues/detail?id=40 |
| [49] | http://code.google.com/p/unladen-swallow/source/detail?r=888 |
| [50] | http://code.google.com/p/unladen-swallow/source/diff?spec=svn576&r=576&format=side&path=/trunk/Lib/test/test_trace.py |
| [51] | (1, 2) http://code.google.com/p/unladen-swallow/issues/list?q=label:Performance |
| [52] | http://en.wikipedia.org/wiki/Just-in-time_compilation |
| [53] | (1, 2) http://research.sun.com/self/papers/urs-thesis.html |
| [54] | http://code.google.com/p/unladen-swallow/wiki/ProjectPlan |
| [55] | (1, 2) http://code.google.com/p/unladen-swallow/wiki/RelevantPapers |
| [56] | (1, 2) http://code.google.com/p/unladen-swallow/source/browse/trunk/Python/llvm_notes.txt |
| [57] | http://www.python.org/psf/license/ |
| [58] | http://llvm.org/docs/DeveloperPolicy.html#clp |
| [59] | http://www.opensource.org/licenses/UoI-NCSA.php |
| [60] | http://code.google.com/p/v8/ |
| [61] | http://webkit.org/blog/214/introducing-squirrelfish-extreme/ |
| [62] | (1, 2) http://rubini.us/ |
| [63] | http://lists.parrot.org/pipermail/parrot-dev/2009-September/002811.html |
| [64] | (1, 2) http://www.macruby.org/ |
| [65] | http://en.wikipedia.org/wiki/HotSpot |
| [66] | (1, 2, 3) http://psyco.sourceforge.net/ |
| [67] | http://codespeak.net/pypy/dist/pypy/doc/ |
| [68] | http://en.wikipedia.org/wiki/Inline_expansion |
| [69] | http://en.wikipedia.org/wiki/Object_type_(object-oriented_programming%29 |
| [70] | http://code.google.com/p/unladen-swallow/issues/detail?id=86 |
| [71] | http://code.google.com/p/unladen-swallow/wiki/StyleGuide |
| [72] | http://llvm.org/docs/CodingStandards.html |
| [73] | http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml |
| [74] | http://code.google.com/p/unladen-swallow/issues/detail?id=41 |
| [75] | http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Regular_Expressions |
| [76] | http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_compile.py |
| [77] | http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_v8.py |
| [78] | http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_regex_effbot.py |
| [79] | http://code.google.com/p/unladen-swallow/issues/detail?id=13 |
| [80] | http://www.pygame.org/ |
| [81] | http://numpy.scipy.org/ |
| [82] | http://codespeak.net:8099/plotsummary.html |
| [83] | http://llvm.org/Users.html |
| [84] | http://www.ffconsultancy.com/ocaml/hlvm/ |
| [85] | (1, 2) http://llvm.org/PR5201 |
| [86] | http://llvm.org/viewvc/llvm-project?view=rev&revision=76828 |
| [87] | http://llvm.org/viewvc/llvm-project?rev=91611&view=rev |
| [88] | http://llvm.org/viewvc/llvm-project?rev=85182&view=rev |
| [89] | http://llvm.org/PR5735 |
| [90] | http://code.google.com/p/unladen-swallow/issues/detail?id=73 |
| [91] | http://code.google.com/p/unladen-swallow/issues/detail?id=88 |
| [92] | http://code.google.com/p/unladen-swallow/issues/detail?id=67 |
| [93] | http://www.ics.uci.edu/~franz/Site/pubs-pdf/C44Prepub.pdf |
| [94] | http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-07-12.pdf |
| [95] | https://wiki.mozilla.org/JavaScript:TraceMonkey |
| [96] | http://llvm.org/docs/LangRef.html |
| [97] | http://code.google.com/p/unladen-swallow/issues/detail?id=120 |
| [98] | http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_nbody.py |
| [99] | http://code.google.com/p/unladen-swallow/issues/detail?id=130 |
| [100] | http://code.google.com/p/unladen-swallow/issues/detail?id=131 |
| [101] | http://llvm.org/PR4816 |
| [102] | http://llvm.org/PR6065 |
| [103] | (1, 2) http://www.cython.org/ |
| [104] | http://shed-skin.blogspot.com/ |
| [105] | http://shedskin.googlecode.com/files/shedskin-tutorial-0.3.html |
| [106] | (1, 2) http://code.google.com/p/wpython/ |
| [107] | http://www.mail-archive.com/python-dev@python.org/msg45143.html |
| [108] | http://ironpython.net/ |
| [109] | http://www.mono-project.com/ |
| [110] | http://www.jython.org/ |
| [111] | http://wiki.python.org/jython/JythonFaq/GeneralInfo |
| [112] | http://code.google.com/p/pyv8/ |
| [113] | http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html |
| [114] | http://msdn.microsoft.com/en-us/library/e7k32f4k.aspx |
| [115] | http://www.mail-archive.com/python-dev@python.org/msg44962.html |
| [116] | http://portal.acm.org/citation.cfm?id=1534530.1534550 |
| [117] | http://www.stackless.com/ |
| [118] | http://mail.python.org/pipermail/python-dev/2004-June/045165.html |
| [119] | http://www.nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt |
| [120] | http://old.nabble.com/LLVM-and-coroutines-microthreads-td23080883.html |
| [121] | http://www.mail-archive.com/python-dev@python.org/msg45544.html |
Copyright
This document has been placed in the public domain.
pep-3147 PYC Repository Directories
| PEP: | 3147 |
|---|---|
| Title: | PYC Repository Directories |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2009-12-16 |
| Python-Version: | 3.2 |
| Post-History: | 2010-01-30, 2010-02-25, 2010-03-03, 2010-04-12 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2010-April/099414.html |
Contents
Abstract
This PEP describes an extension to Python's import mechanism which improves sharing of Python source code files among multiple installed different versions of the Python interpreter. It does this by allowing more than one byte compilation file (.pyc files) to be co-located with the Python source file (.py file). The extension described here can also be used to support different Python compilation caches, such as JIT output that may be produced by an Unladen Swallow [1] enabled C Python.
Background
CPython compiles its source code into "byte code", and for performance reasons, it caches this byte code on the file system whenever the source file has changes. This makes loading of Python modules much faster because the compilation phase can be bypassed. When your source file is foo.py, CPython caches the byte code in a foo.pyc file right next to the source.
Byte code files contain two 32-bit big-endian numbers followed by the marshaled [2] code object. The 32-bit numbers represent a magic number and a timestamp. The magic number changes whenever Python changes the byte code format, e.g. by adding new byte codes to its virtual machine. This ensures that pyc files built for previous versions of the VM won't cause problems. The timestamp is used to make sure that the pyc file match the py file that was used to create it. When either the magic number or timestamp do not match, the py file is recompiled and a new pyc file is written.
In practice, it is well known that pyc files are not compatible across Python major releases. A reading of import.c [3] in the Python source code proves that within recent memory, every new CPython major release has bumped the pyc magic number.
Rationale
Linux distributions such as Ubuntu [4] and Debian [5] provide more than one Python version at the same time to their users. For example, Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1, with Python 2.6 being the default.
This causes a conflict for third party Python source files installed by the system, because you cannot compile a single Python source file for more than one Python version at a time. When Python finds a pyc file with a non-matching magic number, it falls back to the slower process of recompiling the source. Thus if your system installed a /usr/share/python/foo.py, two different versions of Python would fight over the pyc file and rewrite it each time the source is compiled. (The standard library is unaffected by this, since multiple versions of the stdlib are installed on such distributions..)
Furthermore, in order to ease the burden on operating system packagers for these distributions, the distribution packages do not contain Python version numbers [6]; they are shared across all Python versions installed on the system. Putting Python version numbers in the packages would be a maintenance nightmare, since all the packages - and their dependencies - would have to be updated every time a new Python release was added or removed from the distribution. Because of the sheer number of packages available, this amount of work is infeasible.
(PEP 384 [7] has been proposed to address binary compatibility issues of third party extension modules across different versions of Python.)
Because these distributions cannot share pyc files, elaborate mechanisms have been developed to put the resulting pyc files in non-shared locations while the source code is still shared. Examples include the symlink-based Debian regimes python-support [8] and python-central [9]. These approaches make for much more complicated, fragile, inscrutable, and fragmented policies for delivering Python applications to a wide range of users. Arguably more users get Python from their operating system vendor than from upstream tarballs. Thus, solving this pyc sharing problem for CPython is a high priority for such vendors.
This PEP proposes a solution to this problem.
Proposal
Python's import machinery is extended to write and search for byte code cache files in a single directory inside every Python package directory. This directory will be called __pycache__.
Further, pyc file names will contain a magic string (called a "tag") that differentiates the Python version they were compiled for. This allows multiple byte compiled cache files to co-exist for a single Python source file.
The magic tag is implementation defined, but should contain the implementation name and a version number shorthand, e.g. cpython-32. It must be unique among all versions of Python, and whenever the magic number is bumped, a new magic tag must be defined. An example pyc file for Python 3.2 is thus foo.cpython-32.pyc.
The magic tag is available in the imp module via the get_tag() function. This is parallel to the imp.get_magic() function.
This scheme has the added benefit of reducing the clutter in a Python package directory.
When a Python source file is imported for the first time, a __pycache__ directory will be created in the package directory, if one does not already exist. The pyc file for the imported source will be written to the __pycache__ directory, using the magic-tag formatted name. If either the creation of the __pycache__ directory or the pyc file inside that fails, the import will still succeed, just as it does in a pre-PEP-3147 world.
If the py source file is missing, the pyc file inside __pycache__ will be ignored. This eliminates the problem of accidental stale pyc file imports.
For backward compatibility, Python will still support pyc-only distributions, however it will only do so when the pyc file lives in the directory where the py file would have been, i.e. not in the __pycache__ directory. pyc file outside of __pycache__ will only be imported if the py source file is missing.
Tools such as py_compile [15] and compileall [16] will be extended to create PEP 3147 formatted layouts automatically, but will have an option to create pyc-only distribution layouts.
Examples
What would this look like in practice?
Let's say we have a Python package named alpha which contains a sub-package name beta. The source directory layout before byte compilation might look like this:
alpha/
__init__.py
one.py
two.py
beta/
__init__.py
three.py
four.py
After byte compiling this package with Python 3.2, you would see the following layout:
alpha/
__pycache__/
__init__.cpython-32.pyc
one.cpython-32.pyc
two.cpython-32.pyc
__init__.py
one.py
two.py
beta/
__pycache__/
__init__.cpython-32.pyc
three.cpython-32.pyc
four.cpython-32.pyc
__init__.py
three.py
four.py
Note: listing order may differ depending on the platform.
Let's say that two new versions of Python are installed, one is Python 3.3 and another is Unladen Swallow. After byte compilation, the file system would look like this:
alpha/
__pycache__/
__init__.cpython-32.pyc
__init__.cpython-33.pyc
__init__.unladen-10.pyc
one.cpython-32.pyc
one.cpython-33.pyc
one.unladen-10.pyc
two.cpython-32.pyc
two.cpython-33.pyc
two.unladen-10.pyc
__init__.py
one.py
two.py
beta/
__pycache__/
__init__.cpython-32.pyc
__init__.cpython-33.pyc
__init__.unladen-10.pyc
three.cpython-32.pyc
three.cpython-33.pyc
three.unladen-10.pyc
four.cpython-32.pyc
four.cpython-33.pyc
four.unladen-10.pyc
__init__.py
three.py
four.py
As you can see, as long as the Python version identifier string is unique, any number of pyc files can co-exist. These identifier strings are described in more detail below.
A nice property of this layout is that the __pycache__ directories can generally be ignored, such that a normal directory listing would show something like this:
alpha/
__pycache__/
__init__.py
one.py
two.py
beta/
__pycache__/
__init__.py
three.py
four.py
This is much less cluttered than even today's Python.
Python behavior
When Python searches for a module to import (say foo), it may find one of several situations. As per current Python rules, the term "matching pyc" means that the magic number matches the current interpreter's magic number, and the source file's timestamp matches the timestamp in the pyc file exactly.
Case 0: The steady state
When Python is asked to import module foo, it searches for a foo.py file (or foo package, but that's not important for this discussion) along its sys.path. If found, Python looks to see if there is a matching __pycache__/foo.<magic>.pyc file, and if so, that pyc file is loaded.
Case 1: The first import
When Python locates the foo.py, if the __pycache__/foo.<magic>.pyc file is missing, Python will create it, also creating the __pycache__ directory if necessary. Python will parse and byte compile the foo.py file and save the byte code in __pycache__/foo.<magic>.pyc.
Case 2: The second import
When Python is asked to import module foo a second time (in a different process of course), it will again search for the foo.py file along its sys.path. When Python locates the foo.py file, it looks for a matching __pycache__/foo.<magic>.pyc and finding this, it reads the byte code and continues as usual.
Case 3: __pycache__/foo.<magic>.pyc with no source
It's possible that the foo.py file somehow got removed, while leaving the cached pyc file still on the file system. If the __pycache__/foo.<magic>.pyc file exists, but the foo.py file used to create it does not, Python will raise an ImportError when asked to import foo. In other words, Python will not import a pyc file from the cache directory unless the source file exists.
Case 4: legacy pyc files and source-less imports
Python will ignore all legacy pyc files when a source file exists next to it. In other words, if a foo.pyc file exists next to the foo.py file, the pyc file will be ignored in all cases
In order to continue to support source-less distributions though, if the source file is missing, Python will import a lone pyc file if it lives where the source file would have been.
Case 5: read-only file systems
When the source lives on a read-only file system, or the __pycache__ directory or pyc file cannot otherwise be written, all the same rules apply. This is also the case when __pycache__ happens to be written with permissions which do not allow for writing containing pyc files.
Alternative Python implementations
Alternative Python implementations such as Jython [11], IronPython [12], PyPy [13], Pynie [14], and Unladen Swallow can also use the __pycache__ directory to store whatever compilation artifacts make sense for their platforms. For example, Jython could store the class file for the module in __pycache__/foo.jython-32.class.
Implementation strategy
This feature is targeted for Python 3.2, solving the problem for those and all future versions. It may be back-ported to Python 2.7. Vendors are free to backport the changes to earlier distributions as they see fit. For backports of this feature to Python 2, when the -U flag is used, a file such as foo.cpython-27u.pyc can be written.
Effects on existing code
Adoption of this PEP will affect existing code and idioms, both inside Python and outside. This section enumerates some of these effects.
Detecting PEP 3147 availability
The easiest way to detect whether your version of Python provides PEP 3147 functionality is to do the following check:
>>> import imp >>> has3147 = hasattr(imp, 'get_tag')
__file__
In Python 3, when you import a module, its __file__ attribute points to its source py file (in Python 2, it points to the pyc file). A package's __file__ points to the py file for its __init__.py. E.g.:
>>> import foo >>> foo.__file__ 'foo.py' # baz is a package >>> import baz >>> baz.__file__ 'baz/__init__.py'
Nothing in this PEP would change the semantics of __file__.
This PEP proposes the addition of an __cached__ attribute to modules, which will always point to the actual pyc file that was read or written. When the environment variable $PYTHONDONTWRITEBYTECODE is set, or the -B option is given, or if the source lives on a read-only filesystem, then the __cached__ attribute will point to the location that the pyc file would have been written to if it didn't exist. This location of course includes the __pycache__ subdirectory in its path.
For alternative Python implementations which do not support pyc files, the __cached__ attribute may point to whatever information makes sense. E.g. on Jython, this might be the .class file for the module: __pycache__/foo.jython-32.class. Some implementations may use multiple compiled files to create the module, in which case __cached__ may be a tuple. The exact contents of __cached__ are Python implementation specific.
It is recommended that when nothing sensible can be calculated, implementations should set the __cached__ attribute to None.
py_compile and compileall
Python comes with two modules, py_compile [15] and compileall [16] which support compiling Python modules external to the built-in import machinery. py_compile in particular has intimate knowledge of byte compilation, so these will be updated to understand the new layout. The -b flag is added to compileall for writing legacy .pyc byte-compiled file path names.
bdist_wininst and the Windows installer
These tools also compile modules explicitly on installation. If they do not use py_compile and compileall, then they would also have to be modified to understand the new layout.
File extension checks
There exists some code which checks for files ending in .pyc and simply chops off the last character to find the matching .py file. This code will obviously fail once this PEP is implemented.
To support this use case, we'll add two new methods to the imp package [17]:
- imp.cache_from_source(py_path) -> pyc_path
- imp.source_from_cache(pyc_path) -> py_path
Alternative implementations are free to override these functions to return reasonable values based on their own support for this PEP. These methods are allowed to return None when the implementation (or PEP 302 loader [18] in effect) for whatever reason cannot calculate the appropriate file name. They should not raise exceptions.
Backports
For versions of Python earlier than 3.2 (and possibly 2.7), it is possible to backport this PEP. However, in Python 3.2 (and possibly 2.7), this behavior will be turned on by default, and in fact, it will replace the old behavior. Backports will need to support the old layout by default. We suggest supporting PEP 3147 through the use of an environment variable called $PYTHONENABLECACHEDIR or the command line switch -Xenablecachedir to enable the feature.
Makefiles and other dependency tools
Makefiles and other tools which calculate dependencies on .pyc files (e.g. to byte-compile the source if the .pyc is missing) will have to be updated to check the new paths.
Alternatives
This section describes some alternative approaches or details that were considered and rejected during the PEP's development.
Hexadecimal magic tags
pyc files inside of the __pycache__ directories contain a magic tag in their file names. These are mnemonic tags for the actual magic numbers used by the importer. We could have used the hexadecimal representation [10] of the binary magic number as a unique identifier. For example, in Python 3.2:
>>> from binascii import hexlify
>>> from imp import get_magic
>>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
'foo.580c0d0a.pyc'
This isn't particularly human friendly though, thus the magic tag proposed in this PEP.
PEP 304
There is some overlap between the goals of this PEP and PEP 304 [19], which has been withdrawn. However PEP 304 would allow a user to create a shadow file system hierarchy in which to store pyc files. This concept of a shadow hierarchy for pyc files could be used to satisfy the aims of this PEP. Although the PEP 304 does not indicate why it was withdrawn, shadow directories have a number of problems. The location of the shadow pyc files would not be easily discovered and would depend on the proper and consistent use of the $PYTHONBYTECODE environment variable both by the system and by end users. There are also global implications, meaning that while the system might want to shadow pyc files, users might not want to, but the PEP defines only an all-or-nothing approach.
As an example of the problem, a common (though fragile) Python idiom for locating data files is to do something like this:
from os import dirname, join import foo.bar data_file = join(dirname(foo.bar.__file__), 'my.dat')
This would be problematic since foo.bar.__file__ will give the location of the pyc file in the shadow directory, and it may not be possible to find the my.dat file relative to the source directory from there.
Fat byte compilation files
An earlier version of this PEP described "fat" Python byte code files. These files would contain the equivalent of multiple pyc files in a single pyf file, with a lookup table keyed off the appropriate magic number. This was an extensible file format so that the first 5 parallel Python implementations could be supported fairly efficiently, but with extension lookup tables available to scale pyf byte code objects as large as necessary.
The fat byte compilation files were fairly complex, and inherently introduced difficult race conditions, so the current simplification of using directories was suggested. The same problem applies to using zip files as the fat pyc file format.
Multiple file extensions
The PEP author also considered an approach where multiple thin byte compiled files lived in the same place, but used different file extensions to designate the Python version. E.g. foo.pyc25, foo.pyc26, foo.pyc31 etc. This was rejected because of the clutter involved in writing so many different files. The multiple extension approach makes it more difficult (and an ongoing task) to update any tools that are dependent on the file extension.
.pyc
A proposal was floated to call the __pycache__ directory .pyc or some other dot-file name. This would have the effect on *nix systems of hiding the directory. There are many reasons why this was rejected by the BDFL [20] including the fact that dot-files are only special on some platforms, and we actually do not want to hide these completely from users.
Reference implementation
Work on this code is tracked in a Bazaar branch on Launchpad [22] until it's ready for merge into Python 3.2. The work-in-progress diff can also be viewed [23] and is updated automatically as new changes are uploaded.
A Rietveld code review issue [24] has been opened as of 2010-04-01 (no, this is not an April Fools joke :).
References
| [1] | PEP 3146 |
| [2] | The marshal module: http://www.python.org/doc/current/library/marshal.html |
| [3] | import.c: http://svn.python.org/view/python/branches/py3k/Python/import.c?view=markup |
| [4] | Ubuntu: <http://www.ubuntu.com> |
| [5] | Debian: <http://www.debian.org> |
| [6] | Debian Python Policy: http://www.debian.org/doc/packaging-manuals/python-policy/ |
| [7] | PEP 384 |
| [8] | python-support: http://wiki.debian.org/DebianPythonFAQ#Whatispython-support.3F |
| [9] | python-central: http://wiki.debian.org/DebianPythonFAQ#Whatispython-central.3F |
| [10] | binascii.hexlify(): http://www.python.org/doc/current/library/binascii.html#binascii.hexlify |
| [11] | Jython: http://www.jython.org/ |
| [12] | IronPython: http://ironpython.net/ |
| [13] | PyPy: http://codespeak.net/pypy/dist/pypy/doc/ |
| [14] | Pynie: http://code.google.com/p/pynie/ |
| [15] | (1, 2) py_compile: http://docs.python.org/library/py_compile.html |
| [16] | (1, 2) compileall: http://docs.python.org/library/compileall.html |
| [17] | imp: http://www.python.org/doc/current/library/imp.html |
| [18] | PEP 302 |
| [19] | PEP 304 |
| [20] | http://www.mail-archive.com/python-dev@python.org/msg45203.html |
| [21] | importlib: http://docs.python.org/3.1/library/importlib.html |
| [22] | https://code.launchpad.net/~barry/python/pep3147 |
| [23] | https://code.launchpad.net/~barry/python/pep3147/+merge/22648 |
| [24] | http://codereview.appspot.com/842043/show |
ACKNOWLEDGMENTS
Barry Warsaw's original idea was for fat Python byte code files. Martin von Loewis reviewed an early draft of the PEP and suggested the simplification to store traditional pyc and pyo files in a directory. Many other people reviewed early versions of this PEP and provided useful feedback including but not limited to:
- David Malcolm
- Josselin Mouette
- Matthias Klose
- Michael Hudson
- Michael Vogt
- Piotr OĹźarowski
- Scott Kitterman
- Toshio Kuratomi
Copyright
This document has been placed in the public domain.
pep-3148 futures - execute computations asynchronously
| PEP: | 3148 |
|---|---|
| Title: | futures - execute computations asynchronously |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Brian Quinlan <brian at sweetapp.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 16-Oct-2009 |
| Python-Version: | 3.2 |
| Post-History: |
Contents
Abstract
This PEP proposes a design for a package that facilitates the evaluation of callables using threads and processes.
Motivation
Python currently has powerful primitives to construct multi-threaded and multi-process applications but parallelizing simple operations requires a lot of work i.e. explicitly launching processes/threads, constructing a work/results queue, and waiting for completion or some other termination condition (e.g. failure, timeout). It is also difficult to design an application with a global process/thread limit when each component invents its own parallel execution strategy.
Specification
Naming
The proposed package would be called "futures" and would live in a new "concurrent" top-level package. The rationale behind pushing the futures library into a "concurrent" namespace has multiple components. The first, most simple one is to prevent any and all confusion with the existing "from __future__ import x" idiom which has been in use for a long time within Python. Additionally, it is felt that adding the "concurrent" precursor to the name fully denotes what the library is related to - namely concurrency - this should clear up any addition ambiguity as it has been noted that not everyone in the community is familiar with Java Futures, or the Futures term except as it relates to the US stock market.
Finally; we are carving out a new namespace for the standard library - obviously named "concurrent". We hope to either add, or move existing, concurrency-related libraries to this in the future. A prime example is the multiprocessing.Pool work, as well as other "addons" included in that module, which work across thread and process boundaries.
Interface
The proposed package provides two core classes: Executor and Future. An Executor receives asynchronous work requests (in terms of a callable and its arguments) and returns a Future to represent the execution of that work request.
Executor
Executor is an abstract class that provides methods to execute calls asynchronously.
submit(fn, *args, **kwargs)
Schedules the callable to be executed as fn(*args, **kwargs) and returns a Future instance representing the execution of the callable.
This is an abstract method and must be implemented by Executor subclasses.
map(func, *iterables, timeout=None)
Equivalent to map(func, *iterables) but func is executed asynchronously and several calls to func may be made concurrently. The returned iterator raises a TimeoutError if __next__() is called and the result isn't available after timeout seconds from the original call to map(). If timeout is not specified or None then there is no limit to the wait time. If a call raises an exception then that exception will be raised when its value is retrieved from the iterator.
shutdown(wait=True)
Signal the executor that it should free any resources that it is using when the currently pending futures are done executing. Calls to Executor.submit and Executor.map and made after shutdown will raise RuntimeError.
If wait is True then this method will not return until all the pending futures are done executing and the resources associated with the executor have been freed. If wait is False then this method will return immediately and the resources associated with the executor will be freed when all pending futures are done executing. Regardless of the value of wait, the entire Python program will not exit until all pending futures are done executing.
When using an executor as a context manager, __exit__ will call Executor.shutdown(wait=True).
ProcessPoolExecutor
The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. The callable objects and arguments passed to ProcessPoolExecutor.submit must be pickleable according to the same limitations as the multiprocessing module.
Calling Executor or Future methods from within a callable submitted to a ProcessPoolExecutor will result in deadlock.
__init__(max_workers)
Executes calls asynchronously using a pool of a most max_workers processes. If max_workers is None or not given then as many worker processes will be created as the machine has processors.
ThreadPoolExecutor
The ThreadPoolExecutor class is an Executor subclass that uses a pool of threads to execute calls asynchronously.
Deadlock can occur when the callable associated with a Future waits on the results of another Future. For example:
import time
def wait_on_b():
time.sleep(5)
print(b.result()) # b will never complete because it is waiting on a.
return 5
def wait_on_a():
time.sleep(5)
print(a.result()) # a will never complete because it is waiting on b.
return 6
executor = ThreadPoolExecutor(max_workers=2)
a = executor.submit(wait_on_b)
b = executor.submit(wait_on_a)
And:
def wait_on_future():
f = executor.submit(pow, 5, 2)
# This will never complete because there is only one worker thread and
# it is executing this function.
print(f.result())
executor = ThreadPoolExecutor(max_workers=1)
executor.submit(wait_on_future)
__init__(max_workers)
Executes calls asynchronously using a pool of at most max_workers threads.
Future Objects
The Future class encapsulates the asynchronous execution of a callable. Future instances are returned by Executor.submit.
cancel()
Attempt to cancel the call. If the call is currently being executed then it cannot be cancelled and the method will return False, otherwise the call will be cancelled and the method will return True.
cancelled()
Return True if the call was successfully cancelled.
running()
Return True if the call is currently being executed and cannot be cancelled.
done()
Return True if the call was successfully cancelled or finished running.
result(timeout=None)
Return the value returned by the call. If the call hasn't yet completed then this method will wait up to timeout seconds. If the call hasn't completed in timeout seconds then a TimeoutError will be raised. If timeout is not specified or None then there is no limit to the wait time.
If the future is cancelled before completing then CancelledError will be raised.
If the call raised then this method will raise the same exception.
exception(timeout=None)
Return the exception raised by the call. If the call hasn't yet completed then this method will wait up to timeout seconds. If the call hasn't completed in timeout seconds then a TimeoutError will be raised. If timeout is not specified or None then there is no limit to the wait time.
If the future is cancelled before completing then CancelledError will be raised.
If the call completed without raising then None is returned.
add_done_callback(fn)
Attaches a callable fn to the future that will be called when the future is cancelled or finishes running. fn will be called with the future as its only argument.
Added callables are called in the order that they were added and are always called in a thread belonging to the process that added them. If the callable raises an Exception then it will be logged and ignored. If the callable raises another BaseException then behavior is not defined.
If the future has already completed or been cancelled then fn will be called immediately.
Internal Future Methods
The following Future methods are meant for use in unit tests and Executor implementations.
set_running_or_notify_cancel()
Should be called by Executor implementations before executing the work associated with the Future.
If the method returns False then the Future was cancelled, i.e. Future.cancel was called and returned True. Any threads waiting on the Future completing (i.e. through as_completed() or wait()) will be woken up.
If the method returns True then the Future was not cancelled and has been put in the running state, i.e. calls to Future.running() will return True.
This method can only be called once and cannot be called after Future.set_result() or Future.set_exception() have been called.
set_result(result)
Sets the result of the work associated with the Future.
set_exception(exception)
Sets the result of the work associated with the Future to the given Exception.
Module Functions
wait(fs, timeout=None, return_when=ALL_COMPLETED)
Wait for the Future instances (possibly created by different Executor instances) given by fs to complete. Returns a named 2-tuple of sets. The first set, named "done", contains the futures that completed (finished or were cancelled) before the wait completed. The second set, named "not_done", contains uncompleted futures.
timeout can be used to control the maximum number of seconds to wait before returning. If timeout is not specified or None then there is no limit to the wait time.
return_when indicates when the method should return. It must be one of the following constants:
Constant Description FIRST_COMPLETED The method will return when any future finishes or is cancelled. FIRST_EXCEPTION The method will return when any future finishes by raising an exception. If not future raises an exception then it is equivalent to ALL_COMPLETED. ALL_COMPLETED The method will return when all calls finish.
as_completed(fs, timeout=None)
Returns an iterator over the Future instances given by fs that yields futures as they complete (finished or were cancelled). Any futures that completed before as_completed() was called will be yielded first. The returned iterator raises a TimeoutError if __next__() is called and the result isn't available after timeout seconds from the original call to as_completed(). If timeout is not specified or None then there is no limit to the wait time.
The Future instances can have been created by different Executor instances.
Check Prime Example
from concurrent import futures
import math
PRIMES = [
112272535095293,
112582705942171,
112272535095293,
115280095190773,
115797848077099,
1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def main():
with futures.ProcessPoolExecutor() as executor:
for number, prime in zip(PRIMES, executor.map(is_prime,
PRIMES)):
print('%d is prime: %s' % (number, prime))
if __name__ == '__main__':
main()
Web Crawl Example
from concurrent import futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
def main():
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict(
(executor.submit(load_url, url, 60), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
try:
print('%r page is %d bytes' % (
url, len(future.result())))
except Exception as e:
print('%r generated an exception: %s' % (
url, e))
if __name__ == '__main__':
main()
Rationale
The proposed design of this module was heavily influenced by the the Java java.util.concurrent package [1]. The conceptual basis of the module, as in Java, is the Future class, which represents the progress and result of an asynchronous computation. The Future class makes little commitment to the evaluation mode being used e.g. it can be be used to represent lazy or eager evaluation, for evaluation using threads, processes or remote procedure call.
Futures are created by concrete implementations of the Executor class (called ExecutorService in Java). The reference implementation provides classes that use either a process or a thread pool to eagerly evaluate computations.
Futures have already been seen in Python as part of a popular Python cookbook recipe [2] and have discussed on the Python-3000 mailing list [3].
The proposed design is explicit, i.e. it requires that clients be aware that they are consuming Futures. It would be possible to design a module that would return proxy objects (in the style of weakref) that could be used transparently. It is possible to build a proxy implementation on top of the proposed explicit mechanism.
The proposed design does not introduce any changes to Python language syntax or semantics. Special syntax could be introduced [4] to mark function and method calls as asynchronous. A proxy result would be returned while the operation is eagerly evaluated asynchronously, and execution would only block if the proxy object were used before the operation completed.
Anh Hai Trinh proposed a simpler but more limited API concept [5] and the API has been discussed in some detail on stdlib-sig [6].
The proposed design was discussed on the Python-Dev mailing list [7]. Following those discussions, the following changes were made:
- The Executor class was made into an abstract base class
- The Future.remove_done_callback method was removed due to a lack of convincing use cases
- The Future.add_done_callback method was modified to allow the same callable to be added many times
- The Future class's mutation methods were better documented to indicate that they are private to the Executor that created them
Reference Implementation
The reference implementation [8] contains a complete implementation of the proposed design. It has been tested on Linux and Mac OS X.
References
| [1] | java.util.concurrent package documentation http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html |
| [2] | Python Cookbook recipe 84317, "Easy threading with Futures" http://code.activestate.com/recipes/84317/ |
| [3] | Python-3000 thread, "mechanism for handling asynchronous concurrency" http://mail.python.org/pipermail/python-3000/2006-April/000960.html |
| [4] | Python 3000 thread, "Futures in Python 3000 (was Re: mechanism for handling asynchronous concurrency)" http://mail.python.org/pipermail/python-3000/2006-April/000970.html |
| [5] | A discussion of stream, a similar concept proposed by Anh Hai Trinh http://www.mail-archive.com/stdlib-sig@python.org/msg00480.html |
| [6] | A discussion of the proposed API on stdlib-sig http://mail.python.org/pipermail/stdlib-sig/2009-November/000731.html |
| [7] | A discussion of the PEP on python-dev http://mail.python.org/pipermail/python-dev/2010-March/098169.html |
| [8] | Reference futures implementation http://code.google.com/p/pythonfutures/source/browse/#svn/branches/feedback |
Copyright
This document has been placed in the public domain.
pep-3149 ABI version tagged .so files
| PEP: | 3149 |
|---|---|
| Title: | ABI version tagged .so files |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Barry Warsaw <barry at python.org> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2010-07-09 |
| Python-Version: | 3.2 |
| Post-History: | 2010-07-14, 2010-07-22 |
| Resolution: | http://mail.python.org/pipermail/python-dev/2010-September/103408.html |
Contents
Abstract
PEP 3147 [1] described an extension to Python's import machinery that improved the sharing of Python source code, by allowing more than one byte compilation file (.pyc) to be co-located with each source file.
This PEP defines an adjunct feature which allows the co-location of extension module files (.so) in a similar manner. This optional, build-time feature will enable downstream distributions of Python to more easily provide more than one Python major version at a time.
Background
PEP 3147 defined the file system layout for a pure-Python package, where multiple versions of Python are available on the system. For example, where the alpha package containing source modules one.py and two.py exist on a system with Python 3.2 and 3.3, the post-byte compilation file system layout would be:
alpha/
__pycache__/
__init__.cpython-32.pyc
__init__.cpython-33.pyc
one.cpython-32.pyc
one.cpython-33.pyc
two.cpython-32.pyc
two.cpython-33.pyc
__init__.py
one.py
two.py
For packages with extension modules, a similar differentiation is needed for the module's .so files. Extension modules compiled for different Python major versions are incompatible with each other due to changes in the ABI. Different configuration/compilation options for the same Python version can result in different ABIs (e.g. --with-wide-unicode).
While PEP 384 [2] defines a stable ABI, it will minimize, but not eliminate extension module incompatibilities between Python builds or major versions. Thus a mechanism for discriminating extension module file names is proposed.
Rationale
Linux distributions such as Ubuntu [3] and Debian [4] provide more than one Python version at the same time to their users. For example, Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1, with Python 2.6 being the default.
In order to share as much as possible between the available Python versions, these distributions install third party package modules (.pyc and .so files) into /usr/share/pyshared and symlink to them from /usr/lib/pythonX.Y/dist-packages. The symlinks exist because in a pre-PEP 3147 world (i.e < Python 3.2), the .pyc files resulting from byte compilation by the various installed Pythons will name collide with each other. For Python versions >= 3.2, all pure-Python packages can be shared, because the .pyc files will no longer cause file system naming conflicts. Eliminating these symlinks makes for a simpler, more robust Python distribution.
A similar situation arises with shared library extensions. Because extension modules are typically named foo.so for a foo extension module, these would also name collide if foo was provided for more than one Python version.
In addition, because different configuration/compilation options for the same Python version can cause different ABIs to be presented to extension modules. On POSIX systems for example, the configure options --with-pydebug, --with-pymalloc, and --with-wide-unicode all change the ABI. This PEP proposes to encode build-time options in the file name of the .so extension module files.
PyPy [5] can also benefit from this PEP, allowing it to avoid name collisions in extension modules built for its API, but with a different .so tag.
Proposal
The configure/compilation options chosen at Python interpreter build-time will be encoded in the shared library file name for extension modules. This "tag" will appear between the module base name and the operation file system extension for shared libraries.
The following information MUST be included in the shared library file name:
- The Python implementation (e.g. cpython, pypy, jython, etc.)
- The interpreter's major and minor version numbers
These two fields are separated by a hyphen and no dots are to appear between the major and minor version numbers. E.g. cpython-32.
Python implementations MAY include additional flags in the file name tag as appropriate. For example, on POSIX systems these flags will also contribute to the file name:
- --with-pydebug (flag: d)
- --with-pymalloc (flag: m)
- --with-wide-unicode (flag: u)
By default in Python 3.2, configure enables --with-pymalloc so shared library file names would appear as foo.cpython-32m.so. When the other two flags are also enabled, the file names would be foo.cpython-32dmu.so.
The shared library file name tag is used unconditionally; it cannot be changed. The tag and extension module suffix are available through the sysconfig modules via the following variables:
>>> sysconfig.get_config_var('EXT_SUFFIX')
'.cpython-32mu.so'
>>> sysconfig.get_config_var('SOABI')
'cpython-32mu'
Note that $SOABI contains just the tag, while $EXT_SUFFIX includes the platform extension for shared library files, and is the exact suffix added to the extension module name.
For an arbitrary package foo, you might see these files when the distribution package was installed:
/usr/lib/python/foo.cpython-32m.so /usr/lib/python/foo.cpython-33m.so
(These paths are for example purposes only. Distributions are free to use whatever filesystem layout they choose, and nothing in this PEP changes the locations where from-source builds of Python are installed.)
Python's dynamic module loader will recognize and import shared library extension modules with a tag that matches its build-time options. For backward compatibility, Python will also continue to import untagged extension modules, e.g. foo.so.
This shared library tag would be used globally for all distutils-based extension modules, regardless of where on the file system they are built. Extension modules built by means other than distutils would either have to calculate the tag manually, or fallback to the non-tagged .so file name.
Proven approach
The approach described here is already proven, in a sense, on Debian and Ubuntu system where different extensions are used for debug builds of Python and extension modules. Debug builds on Windows also already use a different file extension for dynamic libraries, and in fact encoded (in a different way than proposed in this PEP) the Python major and minor version in the .dll file name.
Windows
This PEP only addresses build issues on POSIX systems that use the configure script. While Windows or other platform support is not explicitly disallowed under this PEP, platform expertise is needed in order to evaluate, describe, and implement support on such platforms. It is not currently clear that the facilities in this PEP are even useful for Windows.
PEP 384
PEP 384 defines a stable ABI for extension modules. In theory, universal adoption of PEP 384 would eliminate the need for this PEP because all extension modules could be compatible with any Python version. In practice of course, it will be impossible to achieve universal adoption, and as described above, different built-time flags still affect the ABI. Thus even with a stable ABI, this PEP may still be necessary. While a complete specification is reserved for PEP 384, here is a discussion of the relevant issues.
PEP 384 describes a change to PyModule_Create() where 3 is passed as the API version if the extension was complied with Py_LIMITED_API. This should be formalized into an official macro called PYTHON_ABI_VERSION to mirror PYTHON_API_VERSION. If and when the ABI changes in an incompatible way, this version number would be bumped. To facilitate sharing, Python would be extended to search for extension modules with the PYTHON_ABI_VERSION number in its name. The prefix abi is reserved for Python's use.
Thus, an initial implementation of PEP 384, when Python is configured with the default set of flags, would search for the following file names when extension module foo is imported (in this order):
foo.cpython-XYm.so foo.abi3.so foo.so
The distutils [6] build_ext command would also have to be extended to compile to shared library files with the abi3 tag, when the module author indicates that their extension supports that version of the ABI. This could be done in a backward compatible way by adding a keyword argument to the Extension class, such as:
Extension('foo', ['foo.c'], abi=3)
Martin v. Lรถwis describes his thoughts [7] about the applicability of this PEP to PEP 384. In summary:
- --with-pydebug would not be supported by the stable ABI because this changes the layout of PyObject, which is an exposed structure.
- --with-pymalloc has no bearing on the issue.
- --with-wide-unicode is trickier, though Martin's inclination is to force the stable ABI to use a Py_UNICODE that matches the platform's wchar_t.
Alternatives
In the initial python-dev thread [8] where this idea was first introduced, several alternatives were suggested. For completeness they are listed here, along with the reasons for not adopting them.
Independent directories or symlinks
Debian and Ubuntu could simply add a version-specific directory to sys.path that would contain just the extension modules for that version of Python. Or the symlink trick eliminated in PEP 3147 could be retained for just shared libraries. This approach is rejected because it propagates the essential complexity that PEP 3147 tries to avoid, and adds potentially several additional directories to search for all modules, even when the number of extension modules is much fewer than the total number of Python packages. For example, builds were made available both with and without wide unicode, with and without pydebug, and with and without pymalloc, the total number of directories search increases substantially.
Reference implementation
Work on this code is tracked in a Bazaar branch on Launchpad [9] until it's ready for merge into Python 3.2. The work-in-progress diff can also be viewed [10] and is updated automatically as new changes are uploaded.
References
| [1] | PEP 3147 |
| [2] | PEP 384 |
| [3] | Ubuntu: <http://www.ubuntu.com> |
| [4] | Debian: <http://www.debian.org> |
| [5] | http://codespeak.net/pypy/dist/pypy/doc/ |
| [6] | http://docs.python.org/py3k/distutils/index.html |
| [7] | http://mail.python.org/pipermail/python-dev/2010-August/103330.html |
| [8] | http://mail.python.org/pipermail/python-dev/2010-June/100998.html |
| [9] | https://code.edge.launchpad.net/~barry/python/sovers |
| [10] | https://code.edge.launchpad.net/~barry/python/sovers/+merge/29411 |
Copyright
This document has been placed in the public domain.
pep-3150 Statement local namespaces (aka "given" clause)
| PEP: | 3150 |
|---|---|
| Title: | Statement local namespaces (aka "given" clause) |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Nick Coghlan <ncoghlan at gmail.com> |
| Status: | Deferred |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2010-07-09 |
| Python-Version: | 3.4 |
| Post-History: | 2010-07-14, 2011-04-21, 2011-06-13 |
| Resolution: | TBD |
Contents
Abstract
This PEP proposes the addition of an optional given clause to several Python statements that do not currently have an associated code suite. This clause will create a statement local namespace for additional names that are accessible in the associated statement, but do not become part of the containing namespace.
Adoption of a new symbol, ?, is proposed to denote a forward reference to the namespace created by running the associated code suite. It will be a reference to a types.SimpleNamespace object.
The primary motivation is to enable a more declarative style of programming, where the operation to be performed is presented to the reader first, and the details of the necessary subcalculations are presented in the following indented suite. As a key example, this would elevate ordinary assignment statements to be on par with class and def statements where the name of the item to be defined is presented to the reader in advance of the details of how the value of that item is calculated. It also allows named functions to be used in a "multi-line lambda" fashion, where the name is used solely as a placeholder in the current expression and then defined in the following suite.
A secondary motivation is to simplify interim calculations in module and class level code without polluting the resulting namespaces.
The intent is that the relationship between a given clause and a separate function definition that performs the specified operation will be similar to the existing relationship between an explicit while loop and a generator that produces the same sequence of operations as that while loop.
The specific proposal in this PEP has been informed by various explorations of this and related concepts over the years (e.g. [1], [2], [3], [6], [8]), and is inspired to some degree by the where and let clauses in Haskell. It avoids some problems that have been identified in past proposals, but has not yet itself been subject to the test of implementation.
Proposal
This PEP proposes the addition of an optional given clause to the syntax for simple statements which may contain an expression, or may substitute for such a statement for purely syntactic purposes. The current list of simple statements that would be affected by this addition is as follows:
- expression statement
- assignment statement
- augmented assignment statement
- del statement
- return statement
- yield statement
- raise statement
- assert statement
- pass statement
The given clause would allow subexpressions to be referenced by name in the header line, with the actual definitions following in the indented clause. As a simple example:
sorted_data = sorted(data, key=?.sort_key) given:
def sort_key(item):
return item.attr1, item.attr2
The new symbol ? is used to refer to the given namespace. It would be a types.SimpleNamespace instance, so ?.sort_key functions as a forward reference to a name defined in the given clause.
A docstring would be permitted in the given clause, and would be attached to the result namespace as its __doc__ attribute.
The pass statement is included to provide a consistent way to skip inclusion of a meaningful expression in the header line. While this is not an intended use case, it isn't one that can be prevented as multiple alternatives (such as ... and ()) remain available even if pass itself is disallowed.
The body of the given clause will execute in a new scope, using normal function closure semantics. To support early binding of loop variables and global references, as well as to allow access to other names defined at class scope, the given clause will also allow explicit binding operations in the header line:
# Explicit early binding via given clause
seq = []
for i in range(10):
seq.append(?.f) given i=i in:
def f():
return i
assert [f() for f in seq] == list(range(10))
Semantics
The following statement:
op(?.f, ?.g) given bound_a=a, bound_b=b in:
def f():
return bound_a + bound_b
def g():
return bound_a - bound_b
Would be roughly equivalent to the following code (__var denotes a hidden compiler variable or simply an entry on the interpreter stack):
__arg1 = a
__arg2 = b
def __scope(bound_a, bound_b):
def f():
return bound_a + bound_b
def g():
return bound_a - bound_b
return types.SimpleNamespace(**locals())
__ref = __scope(__arg1, __arg2)
__ref.__doc__ = __scope.__doc__
op(__ref.f, __ref.g)
A given clause is essentially a nested function which is created and then immediately executed. Unless explicitly passed in, names are looked up using normal scoping rules, and thus names defined at class scope will not be visible. Names declared as forward references are returned and used in the header statement, without being bound locally in the surrounding namespace.
Syntax Change
Current:
expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
('=' (yield_expr|testlist_star_expr))*)
del_stmt: 'del' exprlist
pass_stmt: 'pass'
return_stmt: 'return' [testlist]
yield_stmt: yield_expr
raise_stmt: 'raise' [test ['from' test]]
assert_stmt: 'assert' test [',' test]
New:
expr_stmt: testlist_star_expr (augassign (yield_expr|testlist) |
('=' (yield_expr|testlist_star_expr))*) [given_clause]
del_stmt: 'del' exprlist [given_clause]
pass_stmt: 'pass' [given_clause]
return_stmt: 'return' [testlist] [given_clause]
yield_stmt: yield_expr [given_clause]
raise_stmt: 'raise' [test ['from' test]] [given_clause]
assert_stmt: 'assert' test [',' test] [given_clause]
given_clause: "given" [(NAME '=' test)+ "in"]":" suite
(Note that expr_stmt in the grammar is a slight misnomer, as it covers assignment and augmented assignment in addition to simple expression statements)
Note
These proposed grammar changes don't yet cover the forward reference expression syntax for accessing names defined in the statement local namespace.
The new clause is added as an optional element of the existing statements rather than as a new kind of compound statement in order to avoid creating an ambiguity in the grammar. It is applied only to the specific elements listed so that nonsense like the following is disallowed:
break given:
a = b = 1
import sys given:
a = b = 1
However, the precise Grammar change described above is inadequate, as it creates problems for the definition of simple_stmt (which allows chaining of multiple single line statements with ";" rather than "\n").
So the above syntax change should instead be taken as a statement of intent. Any actual proposal would need to resolve the simple_stmt parsing problem before it could be seriously considered. This would likely require a non-trivial restructuring of the grammar, breaking up small_stmt and flow_stmt to separate the statements that potentially contain arbitrary subexpressions and then allowing a single one of those statements with a given clause at the simple_stmt level. Something along the lines of:
stmt: simple_stmt | given_stmt | compound_stmt
simple_stmt: small_stmt (';' (small_stmt | subexpr_stmt))* [';'] NEWLINE
small_stmt: (pass_stmt | flow_stmt | import_stmt |
global_stmt | nonlocal_stmt)
flow_stmt: break_stmt | continue_stmt
given_stmt: subexpr_stmt (given_clause |
(';' (small_stmt | subexpr_stmt))* [';']) NEWLINE
subexpr_stmt: expr_stmt | del_stmt | flow_subexpr_stmt | assert_stmt
flow_subexpr_stmt: return_stmt | raise_stmt | yield_stmt
given_clause: "given" (NAME '=' test)* ":" suite
For reference, here are the current definitions at that level:
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
In addition to the above changes, the definition of atom would be changed to also allow ?. The restriction of this usage to statements with an associated given clause would be handled by a later stage of the compilation process (likely AST construction, which already enforces other restrictions where the grammar is overly permissive in order to simplify the initial parsing step).
New PEP 8 Guidelines
As discussed on python-ideas ([7], [9]) new PEP 8 guidelines would also need to be developed to provide appropriate direction on when to use the given clause over ordinary variable assignments.
Based on the similar guidelines already present for try statements, this PEP proposes the following additions for given statements to the "Programming Conventions" section of PEP 8:
- for code that could reasonably be factored out into a separate function, but is not currently reused anywhere, consider using a given clause. This clearly indicates which variables are being used only to define subcomponents of another statement rather than to hold algorithm or application state. This is an especially useful technique when passing multi-line functions to operations which take callable arguments.
- keep given clauses concise. If they become unwieldy, either break them up into multiple steps or else move the details into a separate function.
Rationale
Function and class statements in Python have a unique property relative to ordinary assignment statements: to some degree, they are declarative. They present the reader of the code with some critical information about a name that is about to be defined, before proceeding on with the details of the actual definition in the function or class body.
The name of the object being declared is the first thing stated after the keyword. Other important information is also given the honour of preceding the implementation details:
- decorators (which can greatly affect the behaviour of the created object, and were placed ahead of even the keyword and name as a matter of practicality moreso than aesthetics)
- the docstring (on the first line immediately following the header line)
- parameters, default values and annotations for function definitions
- parent classes, metaclass and optionally other details (depending on the metaclass) for class definitions
This PEP proposes to make a similar declarative style available for arbitrary assignment operations, by permitting the inclusion of a "given" suite following any simple assignment statement:
TARGET = [TARGET2 = ... TARGETN =] EXPR given:
SUITE
By convention, code in the body of the suite should be oriented solely towards correctly defining the assignment operation carried out in the header line. The header line operation should also be adequately descriptive (e.g. through appropriate choices of variable names) to give a reader a reasonable idea of the purpose of the operation without reading the body of the suite.
However, while they are the initial motivating use case, limiting this feature solely to simple assignments would be overly restrictive. Once the feature is defined at all, it would be quite arbitrary to prevent its use for augmented assignments, return statements, yield expressions, comprehensions and arbitrary expressions that may modify the application state.
The given clause may also function as a more readable alternative to some uses of lambda expressions and similar constructs when passing one-off functions to operations like sorted() or in callback based event-driven programming.
In module and class level code, the given clause will serve as a clear and reliable replacement for usage of the del statement to keep interim working variables from polluting the resulting namespace.
One potentially useful way to think of the proposed clause is as a middle ground between conventional in-line code and separation of an operation out into a dedicated function, just as an inline while loop may eventually be factored out into a dedicated generator.
Design Discussion
Keyword Choice
This proposal initially used where based on the name of a similar construct in Haskell. However, it has been pointed out that there are existing Python libraries (such as Numpy [4]) that already use where in the SQL query condition sense, making that keyword choice potentially confusing.
While given may also be used as a variable name (and hence would be deprecated using the usual __future__ dance for introducing new keywords), it is associated much more strongly with the desired "here are some extra variables this expression may use" semantics for the new clause.
Reusing the with keyword has also been proposed. This has the advantage of avoiding the addition of a new keyword, but also has a high potential for confusion as the with clause and with statement would look similar but do completely different things. That way lies C++ and Perl :)
Relation to PEP 403
PEP 403 (General Purpose Decorator Clause) attempts to achieve the main goals of this PEP using a less radical language change inspired by the existing decorator syntax.
Despite having the same author, the two PEPs are in direct competition with each other. PEP 403 represents a minimalist approach that attempts to achieve useful functionality with a minimum of change from the status quo. This PEP instead aims for a more flexible standalone statement design, which requires a larger degree of change to the language.
Note that where PEP 403 is better suited to explaining the behaviour of generator expressions correctly, this PEP is better able to explain the behaviour of decorator clauses in general. Both PEPs support adequate explanations for the semantics of container comprehensions.
Explaining Container Comprehensions and Generator Expressions
One interesting feature of the proposed construct is that it can be used as a primitive to explain the scoping and execution order semantics of container comprehensions:
seq2 = [x for x in y if q(x) for y in seq if p(y)]
# would be equivalent to
seq2 = ?.result given seq=seq:
result = []
for y in seq:
if p(y):
for x in y:
if q(x):
result.append(x)
The important point in this expansion is that it explains why comprehensions appear to misbehave at class scope: only the outermost iterator is evaluated at class scope, while all predicates, nested iterators and value expressions are evaluated inside a nested scope.
Not that, unlike PEP 403, the current version of this PEP cannot provide a precisely equivalent expansion for a generator expression. The closest it can get is to define an additional level of scoping:
seq2 = ?.g(seq) given:
def g(seq):
for y in seq:
if p(y):
for x in y:
if q(x):
yield x
This limitation could be remedied by permitting the given clause to be a generator function, in which case ? would refer to a generator-iterator object rather than a simple namespace:
seq2 = ? given seq=seq in:
for y in seq:
if p(y):
for x in y:
if q(x):
yield x
However, this would make the meaning of "?" quite ambiguous, even more so than is already the case for the meaning of def statements (which will usually have a docstring indicating whether or not a function definition is actually a generator)
Explaining Decorator Clause Evaluation and Application
The standard explanation of decorator clause evaluation and application has to deal with the idea of hidden compiler variables in order to show steps in their order of execution. The given statement allows a decorated function definition like:
@classmethod
def classname(cls):
return cls.__name__
To instead be explained as roughly equivalent to:
classname = .d1(classname) given:
d1 = classmethod
def classname(cls):
return cls.__name__
Anticipated Objections
Two Ways To Do It
A lot of code may now be written with values defined either before the expression where they are used or afterwards in a given clause, creating two ways to do it, perhaps without an obvious way of choosing between them.
On reflection, I feel this is a misapplication of the "one obvious way" aphorism. Python already offers lots of ways to write code. We can use a for loop or a while loop, a functional style or an imperative style or an object oriented style. The language, in general, is designed to let people write code that matches the way they think. Since different people think differently, the way they write their code will change accordingly.
Such stylistic questions in a code base are rightly left to the development group responsible for that code. When does an expression get so complicated that the subexpressions should be taken out and assigned to variables, even though those variables are only going to be used once? When should an inline while loop be replaced with a generator that implements the same logic? Opinions differ, and that's OK.
However, explicit PEP 8 guidance will be needed for CPython and the standard library, and that is discussed in the proposal above.
Out of Order Execution
The given clause makes execution jump around a little strangely, as the body of the given clause is executed before the simple statement in the clause header. The closest any other part of Python comes to this is the out of order evaluation in list comprehensions, generator expressions and conditional expressions and the delayed application of decorator functions to the function they decorate (the decorator expressions themselves are executed in the order they are written).
While this is true, the syntax is intended for cases where people are themselves thinking about a problem out of sequence (at least as far as the language is concerned). As an example of this, consider the following thought in the mind of a Python user:
I want to sort the items in this sequence according to the values of attr1 and attr2 on each item.
If they're comfortable with Python's lambda expressions, then they might choose to write it like this:
sorted_list = sorted(original, key=(lambda v: v.attr1, v.attr2))
That gets the job done, but it hardly reaches the standard of executable pseudocode that fits Python's reputation.
If they don't like lambda specifically, the operator module offers an alternative that still allows the key function to be defined inline:
sorted_list = sorted(original,
key=operator.attrgetter(v. 'attr1', 'attr2'))
- Again, it gets the job done, but even the most generous of readers would
- not consider that to be "executable pseudocode".
If they think both of the above options are ugly and confusing, or they need logic in their key function that can't be expressed as an expression (such as catching an exception), then Python currently forces them to reverse the order of their original thought and define the sorting criteria first:
def sort_key(item):
return item.attr1, item.attr2
sorted_list = sorted(original, key=sort_key)
"Just define a function" has been the rote response to requests for multi-line lambda support for years. As with the above options, it gets the job done, but it really does represent a break between what the user is thinking and what the language allows them to express.
I believe the proposal in this PEP would finally let Python get close to the "executable pseudocode" bar for the kind of thought expressed above:
sorted_list = sorted(original, key=?.key) given:
def key(item):
return item.attr1, item.attr2
Everything is in the same order as it was in the user's original thought, and they don't even need to come up with a name for the sorting criteria: it is possible to reuse the keyword argument name directly.
A possible enhancement to those proposal would be to provide a convenient shorthand syntax to say "use the given clause contents as keyword arguments". Even without dedicated syntax, that can be written simply as **vars(?).
Harmful to Introspection
Poking around in module and class internals is an invaluable tool for white-box testing and interactive debugging. The given clause will be quite effective at preventing access to temporary state used during calculations (although no more so than current usage of del statements in that regard).
While this is a valid concern, design for testability is an issue that cuts across many aspects of programming. If a component needs to be tested independently, then a given statement should be refactored in to separate statements so that information is exposed to the test suite. This isn't significantly different from refactoring an operation hidden inside a function or generator out into its own function purely to allow it to be tested in isolation.
Lack of Real World Impact Assessment
The examples in the current PEP are almost all relatively small "toy" examples. The proposal in this PEP needs to be subjected to the test of application to a large code base (such as the standard library or a large Twisted application) in a search for examples where the readability of real world code is genuinely enhanced.
This is more of a deficiency in the PEP rather than the idea, though. If it wasn't a real world problem, we wouldn't get so many complaints about the lack of multi-line lambda support and Ruby's block construct probably wouldn't be quite so popular.
Open Questions
Syntax for Forward References
The ? symbol is proposed for forward references to the given namespace as it is short, currently unused and suggests "there's something missing here that will be filled in later".
The proposal in the PEP doesn't neatly parallel any existing Python feature, so reusing an already used symbol has been deliberately avoided.
Handling of nonlocal and global
nonlocal and global are explicitly disallowed in the given clause suite and will be syntax errors if they occur. They will work normally if they appear within a def statement within that suite.
Alternatively, they could be defined as operating as if the anonymous functions were defined as in the expansion above.
Handling of break and continue
break and continue will operate as if the anonymous functions were defined as in the expansion above. They will be syntax errors if they occur in the given clause suite but will work normally if they appear within a for or while loop as part of that suite.
Handling of return and yield
return and yield are explicitly disallowed in the given clause suite and will be syntax errors if they occur. They will work normally if they appear within a def statement within that suite.
Examples
Defining callbacks for event driven programming:
# Current Python (definition before use)
def cb(sock):
# Do something with socket
def eb(exc):
logging.exception(
"Failed connecting to %s:%s", host, port)
loop.create_connection((host, port), cb, eb) given:
# Becomes:
loop.create_connection((host, port), ?.cb, ?.eb) given:
def cb(sock):
# Do something with socket
def eb(exc):
logging.exception(
"Failed connecting to %s:%s", host, port)
Defining "one-off" classes which typically only have a single instance:
# Current Python (instantiation after definition)
class public_name():
... # However many lines
public_name = public_name(*params)
# Current Python (custom decorator)
def singleton(*args, **kwds):
def decorator(cls):
return cls(*args, **kwds)
return decorator
@singleton(*params)
class public_name():
... # However many lines
# Becomes:
public_name = ?.MeaningfulClassName(*params) given:
class MeaningfulClassName():
... # Should trawl the stdlib for an example of doing this
Calculating attributes without polluting the local namespace (from os.py):
# Current Python (manual namespace cleanup)
def _createenviron():
... # 27 line function
environ = _createenviron()
del _createenviron
# Becomes:
environ = ?._createenviron() given:
def _createenviron():
... # 27 line function
Replacing default argument hack (from functools.lru_cache):
# Current Python (default argument hack)
def decorating_function(user_function,
tuple=tuple, sorted=sorted, len=len, KeyError=KeyError):
... # 60 line function
return decorating_function
# Becomes:
return ?.decorating_function given:
# Cell variables rather than locals, but should give similar speedup
tuple, sorted, len, KeyError = tuple, sorted, len, KeyError
def decorating_function(user_function):
... # 60 line function
# This example also nicely makes it clear that there is nothing in the
# function after the nested function definition. Due to additional
# nested functions, that isn't entirely clear in the current code.
Possible Additions
- The current proposal allows the addition of a given clause only for simple statements. Extending the idea to allow the use of compound statements would be quite possible (by appending the given clause as an independent suite at the end), but doing so raises serious readability concerns (as values defined in the given clause may be used well before they are defined, exactly the kind of readability trap that other features like decorators and with statements are designed to eliminate)
- The "explicit early binding" variant may be applicable to the discussions on python-ideas on how to eliminate the default argument hack. A given clause in the header line for functions (after the return type annotation) may be the answer to that question.
Rejected Alternatives
- An earlier version of this PEP allowed implicit forward references to the names in the trailing suite, and also used implicit early binding semantics. Both of these ideas substantially complicated the proposal without providing a sufficient increase in expressive power. The current proposing with explicit forward references and early binding brings the new construct into line with existing scoping semantics, greatly improving the chances the idea can actually be implemented.
- In addition to the proposals made here, there have also been suggestions of two suite "in-order" variants which provide the limited scoping of names without supporting out-of-order execution. I believe these suggestions largely miss the point of what people are complaining about when they ask for multi-line lambda support - it isn't that coming up with a name for the subexpression is especially difficult, it's that naming the function before the statement that uses it means the code no longer matches the way the developer thinks about the problem at hand.
- I've made some unpublished attempts to allow direct references to the closure implicitly created by the given clause, while still retaining the general structure of the syntax as defined in this PEP (For example, allowing a subexpression like ?given or :given to be used in expressions to indicate a direct reference to the implied closure, thus preventig it from being called automatically to create the local namespace). All such attempts have appeared unattractive and confusing compared to the simpler decorator-inspired proposal in PEP 403.
Reference Implementation
None as yet. If you want a crash course in Python namespace semantics and code compilation, feel free to try ;)
TO-DO
- Mention PEP 359 and possible uses for locals() in the given clause
- Figure out if this can be used internally to make the implementation of zero-argument super() calls less awful
References
| [1] | Explicitation lines in Python: http://mail.python.org/pipermail/python-ideas/2010-June/007476.html |
| [2] | 'where' statement in Python: http://mail.python.org/pipermail/python-ideas/2010-July/007584.html |
| [3] | Where-statement (Proposal for function expressions): http://mail.python.org/pipermail/python-ideas/2009-July/005132.html |
| [4] | Name conflict with NumPy for 'where' keyword choice: http://mail.python.org/pipermail/python-ideas/2010-July/007596.html |
| [5] | The "Status quo wins a stalemate" design principle: http://www.boredomandlaziness.org/2011/02/status-quo-wins-stalemate.html |
| [6] | Assignments in list/generator expressions: http://mail.python.org/pipermail/python-ideas/2011-April/009863.html |
| [7] | Possible PEP 3150 style guidelines (#1): http://mail.python.org/pipermail/python-ideas/2011-April/009869.html |
| [8] | Discussion of PEP 403 (statement local function definition): http://mail.python.org/pipermail/python-ideas/2011-October/012276.html |
| [9] | Possible PEP 3150 style guidelines (#2): http://mail.python.org/pipermail/python-ideas/2011-October/012341.html |
| [10] | Multi-line lambdas (again!) http://mail.python.org/pipermail/python-ideas/2013-August/022526.html |
Copyright
This document has been placed in the public domain.
pep-3151 Reworking the OS and IO exception hierarchy
| PEP: | 3151 |
|---|---|
| Title: | Reworking the OS and IO exception hierarchy |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| BDFL-Delegate: | Barry Warsaw |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2010-07-21 |
| Python-Version: | 3.3 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2011-October/114033.html |
Contents
- Abstract
- Rationale
- Compatibility strategy
- Step 1: coalesce exception types
- Step 2: define additional subclasses
- Possible objections
- Earlier discussion
- Implementation
- Possible alternative
- Exceptions ignored by this PEP
- Appendix A: Survey of common errnos
- Appendix B: Survey of raised OS and IO errors
- Acknowledgments
- References
- Copyright
Abstract
The standard exception hierarchy is an important part of the Python language. It has two defining qualities: it is both generic and selective. Generic in that the same exception type can be raised - and handled - regardless of the context (for example, whether you are trying to add something to an integer, to call a string method, or to write an object on a socket, a TypeError will be raised for bad argument types). Selective in that it allows the user to easily handle (silence, examine, process, store or encapsulate...) specific kinds of error conditions while letting other errors bubble up to higher calling contexts. For example, you can choose to catch ZeroDivisionErrors without affecting the default handling of other ArithmeticErrors (such as OverflowErrors).
This PEP proposes changes to a part of the exception hierarchy in order to better embody the qualities mentioned above: the errors related to operating system calls (OSError, IOError, mmap.error, select.error, and all their subclasses).
Rationale
Lack of fine-grained exceptions
The current variety of OS-related exceptions doesn't allow the user to filter easily for the desired kinds of failures. As an example, consider the task of deleting a file if it exists. The Look Before You Leap (LBYL) idiom suffers from an obvious race condition:
if os.path.exists(filename):
os.remove(filename)
If a file named as filename is created by another thread or process between the calls to os.path.exists and os.remove, it won't be deleted. This can produce bugs in the application, or even security issues.
Therefore, the solution is to try to remove the file, and ignore the error if the file doesn't exist (an idiom known as Easier to Ask Forgiveness than to get Permission, or EAFP). Careful code will read like the following (which works under both POSIX and Windows systems):
try:
os.remove(filename)
except OSError as e:
if e.errno != errno.ENOENT:
raise
or even:
try:
os.remove(filename)
except EnvironmentError as e:
if e.errno != errno.ENOENT:
raise
This is a lot more to type, and also forces the user to remember the various cryptic mnemonics from the errno module. It imposes an additional cognitive burden and gets tiresome rather quickly. Consequently, many programmers will instead write the following code, which silences exceptions too broadly:
try:
os.remove(filename)
except OSError:
pass
os.remove can raise an OSError not only when the file doesn't exist, but in other possible situations (for example, the filename points to a directory, or the current process doesn't have permission to remove the file), which all indicate bugs in the application logic and therefore shouldn't be silenced. What the programmer would like to write instead is something such as:
try:
os.remove(filename)
except FileNotFoundError:
pass
Compatibility strategy
Reworking the exception hierarchy will obviously change the exact semantics of at least some existing code. While it is not possible to improve on the current situation without changing exact semantics, it is possible to define a narrower type of compatibility, which we will call useful compatibility.
For this we first must explain what we will call careful and careless exception handling. Careless (or "naïve") code is defined as code which blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute. This is because such exception types are much too broad to signify anything. Any of them can be raised for error conditions as diverse as: a bad file descriptor (which will usually indicate a programming error), an unconnected socket (ditto), a socket timeout, a file type mismatch, an invalid argument, a transmission failure, insufficient permissions, a non-existent directory, a full filesystem, etc.
(moreover, the use of certain of these exceptions is irregular; Appendix B exposes the case of the select module, which raises different exceptions depending on the implementation)
Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it.
Then we can define useful compatibility as follows:
useful compatibility doesn't make exception catching any narrower, but it can be broader for careless exception-catching code. Given the following kind of snippet, all exceptions caught before this PEP will also be caught after this PEP, but the reverse may be false (because the coalescing of OSError, IOError and others means the except clause throws a slightly broader net):
try: ... os.remove(filename) ... except OSError: passuseful compatibility doesn't alter the behaviour of careful exception-catching code. Given the following kind of snippet, the same errors should be silenced or re-raised, regardless of whether this PEP has been implemented or not:
try: os.remove(filename) except OSError as e: if e.errno != errno.ENOENT: raise
The rationale for this compromise is that careless code can't really be helped, but at least code which "works" won't suddenly raise errors and crash. This is important since such code is likely to be present in scripts used as cron tasks or automated system administration programs.
Careful code, on the other hand, should not be penalized. Actually, one purpose of this PEP is to ease writing careful code.
Step 1: coalesce exception types
The first step of the resolution is to coalesce existing exception types. The following changes are proposed:
- alias both socket.error and select.error to OSError
- alias mmap.error to OSError
- alias both WindowsError and VMSError to OSError
- alias IOError to OSError
- coalesce EnvironmentError into OSError
Each of these changes doesn't preserve exact compatibility, but it does preserve useful compatibility (see "compatibility" section above).
Each of these changes can be accepted or refused individually, but of course it is considered that the greatest impact can be achieved if this first step is accepted in full. In this case, the IO exception sub-hierarchy would become:
+-- OSError (replacing IOError, WindowsError, EnvironmentError, etc.)
+-- io.BlockingIOError
+-- io.UnsupportedOperation (also inherits from ValueError)
+-- socket.gaierror
+-- socket.herror
+-- socket.timeout
Justification
Not only does this first step present the user a simpler landscape as explained in the rationale section, but it also allows for a better and more complete resolution of Step 2 (see Prerequisite).
The rationale for keeping OSError as the official name for generic OS-related exceptions is that it, precisely, is more generic than IOError. EnvironmentError is more tedious to type and also much lesser-known.
The survey in Appendix B shows that IOError is the dominant error today in the standard library. As for third-party Python code, Google Code Search shows IOError being ten times more popular than EnvironmentError in user code, and three times more popular than OSError [3]. However, with no intention to deprecate IOError in the middle term, the lesser popularity of OSError is not a problem.
Exception attributes
Since WindowsError is coalesced into OSError, the latter gains a winerror attribute under Windows. It is set to None under situations where it is not meaningful, as is already the case with the errno, filename and strerror attributes (for example when OSError is raised directly by Python code).
Deprecation of names
The following paragraphs outline a possible deprecation strategy for old exception names. However, it has been decided to keep them as aliases for the time being. This decision could be revised in time for Python 4.0.
built-in exceptions
Deprecating the old built-in exceptions cannot be done in a straightforward fashion by intercepting all lookups in the builtins namespace, since these are performance-critical. We also cannot work at the object level, since the deprecated names will be aliased to non-deprecated objects.
A solution is to recognize these names at compilation time, and then emit a separate LOAD_OLD_GLOBAL opcode instead of the regular LOAD_GLOBAL. This specialized opcode will handle the output of a DeprecationWarning (or PendingDeprecationWarning, depending on the policy decided upon) when the name doesn't exist in the globals namespace, but only in the builtins one. This will be enough to avoid false positives (for example if someone defines their own OSError in a module), and false negatives will be rare (for example when someone accesses OSError through the builtins module rather than directly).
module-level exceptions
The above approach cannot be used easily, since it would require special-casing some modules when compiling code objects. However, these names are by construction much less visible (they don't appear in the builtins namespace), and lesser-known too, so we might decide to let them live in their own namespaces.
Step 2: define additional subclasses
The second step of the resolution is to extend the hierarchy by defining subclasses which will be raised, rather than their parent, for specific errno values. Which errno values is subject to discussion, but a survey of existing exception matching practices (see Appendix A) helps us propose a reasonable subset of all values. Trying to map all errno mnemonics, indeed, seems foolish, pointless, and would pollute the root namespace.
Furthermore, in a couple of cases, different errno values could raise the same exception subclass. For example, EAGAIN, EALREADY, EWOULDBLOCK and EINPROGRESS are all used to signal that an operation on a non-blocking socket would block (and therefore needs trying again later). They could therefore all raise an identical subclass and let the user examine the errno attribute if (s)he so desires (see below "exception attributes").
Prerequisite
Step 1 is a loose prerequisite for this.
Prerequisite, because some errnos can currently be attached to different exception classes: for example, ENOENT can be attached to both OSError and IOError, depending on the context. If we don't want to break useful compatibility, we can't make an except OSError (or IOError) fail to match an exception where it would succeed today.
Loose, because we could decide for a partial resolution of step 2 if existing exception classes are not coalesced: for example, ENOENT could raise a hypothetical FileNotFoundError where an IOError was previously raised, but continue to raise OSError otherwise.
The dependency on step 1 could be totally removed if the new subclasses used multiple inheritance to match with all of the existing superclasses (or, at least, OSError and IOError, which are arguable the most prevalent ones). It would, however, make the hierarchy more complicated and therefore harder to grasp for the user.
New exception classes
The following tentative list of subclasses, along with a description and the list of errnos mapped to them, is submitted to discussion:
- FileExistsError: trying to create a file or directory which already exists (EEXIST)
- FileNotFoundError: for all circumstances where a file and directory is requested but doesn't exist (ENOENT)
- IsADirectoryError: file-level operation (open(), os.remove()...) requested on a directory (EISDIR)
- NotADirectoryError: directory-level operation requested on something else (ENOTDIR)
- PermissionError: trying to run an operation without the adequate access rights - for example filesystem permissions (EACCES, EPERM)
- BlockingIOError: an operation would block on an object (e.g. socket) set for non-blocking operation (EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS); this is the existing io.BlockingIOError with an extended role
- BrokenPipeError: trying to write on a pipe while the other end has been closed, or trying to write on a socket which has been shutdown for writing (EPIPE, ESHUTDOWN)
- InterruptedError: a system call was interrupted by an incoming signal (EINTR)
- ConnectionAbortedError: connection attempt aborted by peer (ECONNABORTED)
- ConnectionRefusedError: connection reset by peer (ECONNREFUSED)
- ConnectionResetError: connection reset by peer (ECONNRESET)
- TimeoutError: connection timed out (ETIMEDOUT); this can be re-cast as a generic timeout exception, replacing socket.timeout and also useful for other types of timeout (for example in Lock.acquire())
- ChildProcessError: operation on a child process failed (ECHILD); this is raised mainly by the wait() family of functions.
- ProcessLookupError: the given process (as identified by, e.g., its process id) doesn't exist (ESRCH).
In addition, the following exception class is proposed for inclusion:
- ConnectionError: a base class for ConnectionAbortedError, ConnectionRefusedError and ConnectionResetError
The following drawing tries to sum up the proposed additions, along with the corresponding errno values (where applicable). The root of the sub-hierarchy (OSError, assuming Step 1 is accepted in full) is not shown:
+-- BlockingIOError EAGAIN, EALREADY, EWOULDBLOCK, EINPROGRESS
+-- ChildProcessError ECHILD
+-- ConnectionError
+-- BrokenPipeError EPIPE, ESHUTDOWN
+-- ConnectionAbortedError ECONNABORTED
+-- ConnectionRefusedError ECONNREFUSED
+-- ConnectionResetError ECONNRESET
+-- FileExistsError EEXIST
+-- FileNotFoundError ENOENT
+-- InterruptedError EINTR
+-- IsADirectoryError EISDIR
+-- NotADirectoryError ENOTDIR
+-- PermissionError EACCES, EPERM
+-- ProcessLookupError ESRCH
+-- TimeoutError ETIMEDOUT
Naming
Various naming controversies can arise. One of them is whether all exception class names should end in "Error". In favour is consistency with the rest of the exception hiearchy, against is concision (especially with long names such as ConnectionAbortedError).
Exception attributes
In order to preserve useful compatibility, these subclasses should still set adequate values for the various exception attributes defined on the superclass (for example errno, filename, and optionally winerror).
Implementation
Since it is proposed that the subclasses are raised based purely on the value of errno, little or no changes should be required in extension modules (either standard or third-party).
The first possibility is to adapt the PyErr_SetFromErrno() family of functions (PyErr_SetFromWindowsErr() under Windows) to raise the appropriate OSError subclass. This wouldn't cover, however, Python code raising OSError directly, using the following idiom (seen in Lib/tempfile.py):
raise IOError(_errno.EEXIST, "No usable temporary file name found")
A second possibility, suggested by Marc-Andre Lemburg, is to adapt OSError.__new__ to instantiate the appropriate subclass. This has the benefit of also covering Python code such as the above.
Possible objections
Namespace pollution
Making the exception hierarchy finer-grained makes the root (or builtins) namespace larger. This is to be moderated, however, as:
- only a handful of additional classes are proposed;
- while standard exception types live in the root namespace, they are visually distinguished by the fact that they use the CamelCase convention, while almost all other builtins use lowercase naming (except True, False, None, Ellipsis and NotImplemented)
An alternative would be to provide a separate module containing the finer-grained exceptions, but that would defeat the purpose of encouraging careful code over careless code, since the user would first have to import the new module instead of using names already accessible.
Earlier discussion
While this is the first time such as formal proposal is made, the idea has received informal support in the past [1]; both the introduction of finer-grained exception classes and the coalescing of OSError and IOError.
The removal of WindowsError alone has been discussed and rejected as part of another PEP [2], but there seemed to be a consensus that the distinction with OSError wasn't meaningful. This supports at least its aliasing with OSError.
Implementation
The reference implementation has been integrated into Python 3.3. It was formerly developed in http://hg.python.org/features/pep-3151/ in branch pep-3151, and also tracked on the bug tracker at http://bugs.python.org/issue12555. It has been successfully tested on a variety of systems: Linux, Windows, OpenIndiana and FreeBSD buildbots.
One source of trouble has been with the respective constructors of OSError and WindowsError, which were incompatible. The way it is solved is by keeping the OSError signature and adding a fourth optional argument to allow passing the Windows error code (which is different from the POSIX errno). The fourth argument is stored as winerror and its POSIX translation as errno. The PyErr_SetFromWindowsErr* functions have been adapted to use the right constructor call.
A slight complication is when the PyErr_SetExcFromWindowsErr* functions are called with OSError rather than WindowsError: the errno attribute of the exception object would store the Windows error code (such as 109 for ERROR_BROKEN_PIPE) rather than its POSIX translation (such as 32 for EPIPE), which it does now. For non-socket error codes, this only occurs in the private _multiprocessing module for which there is no compatibility concern.
Note
For socket errors, the "POSIX errno" as reflected by the errno module is numerically equal to the Windows Socket error code returned by the WSAGetLastError system call:
>>> errno.EWOULDBLOCK 10035 >>> errno.WSAEWOULDBLOCK 10035
Possible alternative
Pattern matching
Another possibility would be to introduce an advanced pattern matching syntax when catching exceptions. For example:
try:
os.remove(filename)
except OSError as e if e.errno == errno.ENOENT:
pass
Several problems with this proposal:
- it introduces new syntax, which is perceived by the author to be a heavier change compared to reworking the exception hierarchy
- it doesn't decrease typing effort significantly
- it doesn't relieve the programmer from the burden of having to remember errno mnemonics
Exceptions ignored by this PEP
This PEP ignores EOFError, which signals a truncated input stream in various protocol and file format implementations (for example GzipFile). EOFError is not OS- or IO-related, it is a logical error raised at a higher level.
This PEP also ignores SSLError, which is raised by the ssl module in order to propagate errors signalled by the OpenSSL library. Ideally, SSLError would benefit from a similar but separate treatment since it defines its own constants for error types (ssl.SSL_ERROR_WANT_READ, etc.). In Python 3.2, SSLError is already replaced with socket.timeout when it signals a socket timeout (see issue 10272).
Endly, the fate of socket.gaierror and socket.herror is not settled. While they would deserve less cryptic names, this can be handled separately from the exception hierarchy reorganization effort.
Appendix A: Survey of common errnos
This is a quick inventory of the various errno mnemonics checked for in the standard library and its tests, as part of except clauses.
Common errnos with OSError
- EBADF: bad file descriptor (usually means the file descriptor was closed)
- EEXIST: file or directory exists
- EINTR: interrupted function call
- EISDIR: is a directory
- ENOTDIR: not a directory
- ENOENT: no such file or directory
- EOPNOTSUPP: operation not supported on socket (possible confusion with the existing io.UnsupportedOperation)
- EPERM: operation not permitted (when using e.g. os.setuid())
Common errnos with IOError
- EACCES: permission denied (for filesystem operations)
- EBADF: bad file descriptor (with select.epoll); read operation on a write-only GzipFile, or vice-versa
- EBUSY: device or resource busy
- EISDIR: is a directory (when trying to open())
- ENODEV: no such device
- ENOENT: no such file or directory (when trying to open())
- ETIMEDOUT: connection timed out
Common errnos with socket.error
All these errors may also be associated with a plain IOError, for example when calling read() on a socket's file descriptor.
- EAGAIN: resource temporarily unavailable (during a non-blocking socket call except connect())
- EALREADY: connection already in progress (during a non-blocking connect())
- EINPROGRESS: operation in progress (during a non-blocking connect())
- EINTR: interrupted function call
- EISCONN: the socket is connected
- ECONNABORTED: connection aborted by peer (during an accept() call)
- ECONNREFUSED: connection refused by peer
- ECONNRESET: connection reset by peer
- ENOTCONN: socket not connected
- ESHUTDOWN: cannot send after transport endpoint shutdown
- EWOULDBLOCK: same reasons as EAGAIN
Common errnos with select.error
- EINTR: interrupted function call
Appendix B: Survey of raised OS and IO errors
About VMSError
VMSError is completely unused by the interpreter core and the standard library. It was added as part of the OpenVMS patches submitted in 2002 by Jean-François Piéronne [4]; the motivation for including VMSError was that it could be raised by third-party packages.
Interpreter core
Handling of PYTHONSTARTUP raises IOError (but the error gets discarded):
$ PYTHONSTARTUP=foox ./python Python 3.2a0 (py3k:82920M, Jul 16 2010, 22:53:23) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 'foox'
PyObject_Print() raises IOError when ferror() signals an error on the FILE * parameter (which, in the source tree, is always either stdout or stderr).
Unicode encoding and decoding using the mbcs encoding can raise WindowsError for some error conditions.
Standard library
bz2
Raises IOError throughout (OSError is unused):
>>> bz2.BZ2File("foox", "rb")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> bz2.BZ2File("LICENSE", "rb").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: invalid data stream
>>> bz2.BZ2File("/tmp/zzz.bz2", "wb").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: file is not ready for reading
curses
Not examined.
dbm.gnu, dbm.ndbm
_dbm.error and _gdbm.error inherit from IOError:
>>> dbm.gnu.open("foox")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_gdbm.error: [Errno 2] No such file or directory
fcntl
Raises IOError throughout (OSError is unused).
imp module
Raises IOError for bad file descriptors:
>>> imp.load_source("foo", "foo", 123)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 9] Bad file descriptor
io module
Raises IOError when trying to open a directory under Unix:
>>> open("Python/", "r")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 21] Is a directory: 'Python/'
Raises IOError or io.UnsupportedOperation (which inherits from the former) for unsupported operations:
>>> open("LICENSE").write("bar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: not writable
>>> io.StringIO().fileno()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
io.UnsupportedOperation: fileno
>>> open("LICENSE").seek(1, 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: can't do nonzero cur-relative seeks
Raises either IOError or TypeError when the inferior I/O layer misbehaves (i.e. violates the API it is expected to implement).
Raises IOError when the underlying OS resource becomes invalid:
>>> f = open("LICENSE")
>>> os.close(f.fileno())
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 9] Bad file descriptor
...or for implementation-specific optimizations:
>>> f = open("LICENSE")
>>> next(f)
'A. HISTORY OF THE SOFTWARE\n'
>>> f.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call
Raises BlockingIOError (inheriting from IOError) when a call on a non-blocking object would block.
mmap
Under Unix, raises its own mmap.error (inheriting from EnvironmentError) throughout:
>>> mmap.mmap(123, 10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
mmap.error: [Errno 9] Bad file descriptor
>>> mmap.mmap(os.open("/tmp", os.O_RDONLY), 10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
mmap.error: [Errno 13] Permission denied
Under Windows, however, it mostly raises WindowsError (the source code also shows a few occurrences of mmap.error):
>>> fd = os.open("LICENSE", os.O_RDONLY)
>>> m = mmap.mmap(fd, 16384)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
WindowsError: [Error 5] Accès refusé
>>> sys.last_value.errno
13
>>> errno.errorcode[13]
'EACCES'
>>> m = mmap.mmap(-1, 4096)
>>> m.resize(16384)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
WindowsError: [Error 87] Paramètre incorrect
>>> sys.last_value.errno
22
>>> errno.errorcode[22]
'EINVAL'
multiprocessing
Not examined.
os / posix
The os (or posix) module raises OSError throughout, except under Windows where WindowsError can be raised instead.
ossaudiodev
Raises IOError throughout (OSError is unused):
>>> ossaudiodev.open("foo", "r")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'foo'
readline
Raises IOError in various file-handling functions:
>>> readline.read_history_file("foo")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> readline.read_init_file("foo")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory
>>> readline.write_history_file("/dev/nonexistent")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied
select
- select() and poll objects raise select.error, which doesn't inherit from anything (but poll.modify() raises IOError);
- epoll objects raise IOError;
- kqueue objects raise both OSError and IOError.
As a side-note, not deriving from EnvironmentError means select.error does not get the useful errno attribute. User code must check args[0] instead:
>>> signal.alarm(1); select.select([], [], []) 0 Traceback (most recent call last): File "<stdin>", line 1, in <module> select.error: (4, 'Interrupted system call') >>> e = sys.last_value >>> e error(4, 'Interrupted system call') >>> e.errno == errno.EINTR Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'error' object has no attribute 'errno' >>> e.args[0] == errno.EINTR True
signal
signal.ItimerError inherits from IOError.
socket
socket.error inherits from IOError.
sys
sys.getwindowsversion() raises WindowsError with a bogus error number if the GetVersionEx() call fails.
time
Raises IOError for internal errors in time.time() and time.sleep().
zipimport
zipimporter.get_data() can raise IOError.
Acknowledgments
Significant input has been received from Nick Coghlan.
References
| [1] | "IO module precisions and exception hierarchy": http://mail.python.org/pipermail/python-dev/2009-September/092130.html |
| [2] | Discussion of "Removing WindowsError" in PEP 348: http://www.python.org/dev/peps/pep-0348/#removing-windowserror |
| [3] | Google Code Search of IOError in Python code: around 40000 results; OSError: around 15200 results; EnvironmentError: around 3000 results |
| [4] | http://bugs.python.org/issue614055 |
Copyright
This document has been placed in the public domain.
pep-3152 Cofunctions
| PEP: | 3152 |
|---|---|
| Title: | Cofunctions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Gregory Ewing <greg.ewing at canterbury.ac.nz> |
| Status: | Rejected |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 13-Feb-2009 |
| Python-Version: | 3.3 |
| Post-History: |
Contents
Abstract
A syntax is proposed for defining and calling a special type of generator called a 'cofunction'. It is designed to provide a streamlined way of writing generator-based coroutines, and allow the early detection of certain kinds of error that are easily made when writing such code, which otherwise tend to cause hard-to-diagnose symptoms.
This proposal builds on the 'yield from' mechanism described in PEP 380, and describes some of the semantics of cofunctions in terms of it. However, it would be possible to define and implement cofunctions independently of PEP 380 if so desired.
Specification
Cofunction definitions
A new keyword codef is introduced which is used in place of def to define a cofunction. A cofunction is a special kind of generator having the following characteristics:
- A cofunction is always a generator, even if it does not contain any yield or yield from expressions.
- A cofunction cannot be called the same way as an ordinary function. An exception is raised if an ordinary call to a cofunction is attempted.
Cocalls
Calls from one cofunction to another are made by marking the call with a new keyword cocall. The expression
cocall f(*args, **kwds)
is semantically equivalent to
yield from f.__cocall__(*args, **kwds)
except that the object returned by __cocall__ is expected to be an iterator, so the step of calling iter() on it is skipped.
The full syntax of a cocall expression is described by the following grammar lines:
atom: cocall | <existing alternatives for atom>
cocall: 'cocall' atom cotrailer* '(' [arglist] ')'
cotrailer: '[' subscriptlist ']' | '.' NAME
The cocall keyword is syntactically valid only inside a cofunction. A SyntaxError will result if it is used in any other context.
Objects which implement __cocall__ are expected to return an object obeying the iterator protocol. Cofunctions respond to __cocall__ the same way as ordinary generator functions respond to __call__, i.e. by returning a generator-iterator.
Certain objects that wrap other callable objects, notably bound methods, will be given __cocall__ implementations that delegate to the underlying object.
New builtins, attributes and C API functions
To facilitate interfacing cofunctions with non-coroutine code, there will be a built-in function costart whose definition is equivalent to
def costart(obj, *args, **kwds):
return obj.__cocall__(*args, **kwds)
There will also be a corresponding C API function
PyObject *PyObject_CoCall(PyObject *obj, PyObject *args, PyObject *kwds)
It is left unspecified for now whether a cofunction is a distinct type of object or, like a generator function, is simply a specially-marked function instance. If the latter, a read-only boolean attribute __iscofunction__ should be provided to allow testing whether a given function object is a cofunction.
Motivation and Rationale
The yield from syntax is reasonably self-explanatory when used for the purpose of delegating part of the work of a generator to another function. It can also be used to good effect in the implementation of generator-based coroutines, but it reads somewhat awkwardly when used for that purpose, and tends to obscure the true intent of the code.
Furthermore, using generators as coroutines is somewhat error-prone. If one forgets to use yield from when it should have been used, or uses it when it shouldn't have, the symptoms that result can be obscure and confusing.
Finally, sometimes there is a need for a function to be a coroutine even though it does not yield anything, and in these cases it is necessary to resort to kludges such as if 0: yield to force it to be a generator.
The codef and cocall constructs address the first issue by making the syntax directly reflect the intent, that is, that the function forms part of a coroutine.
The second issue is addressed by making it impossible to mix coroutine and non-coroutine code in ways that don't make sense. If the rules are violated, an exception is raised that points out exactly what and where the problem is.
Lastly, the need for dummy yields is eliminated by making the form of definition determine whether the function is a coroutine, rather than what it contains.
Prototype Implementation
An implementation in the form of patches to Python 3.1.2 can be found here:
http://www.cosc.canterbury.ac.nz/greg.ewing/python/generators/cofunctions.html
Copyright
This document has been placed in the public domain.
pep-3153 Asynchronous IO support
| PEP: | 3153 |
|---|---|
| Title: | Asynchronous IO support |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Laurens Van Houtven <_ at lvh.cc> |
| Status: | Superseded |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 29-May-2011 |
| Post-History: | TBD |
| Superseded-By: | 3156 |
Contents
Abstract
This PEP describes an abstraction of asynchronous IO for the Python standard library.
The goal is to reach a abstraction that can be implemented by many different asynchronous IO backends and provides a target for library developers to write code portable between those different backends.
Rationale
People who want to write asynchronous code in Python right now have a few options:
- asyncore and asynchat
- something bespoke, most likely based on the select module
- using a third party library, such as Twisted [2] or gevent [3]
Unfortunately, each of these options has its downsides, which this PEP tries to address.
Despite having been part of the Python standard library for a long time, the asyncore module suffers from fundamental flaws following from an inflexible API that does not stand up to the expectations of a modern asynchronous networking module.
Moreover, its approach is too simplistic to provide developers with all the tools they need in order to fully exploit the potential of asynchronous networking.
The most popular solution right now used in production involves the use of third party libraries. These often provide satisfactory solutions, but there is a lack of compatibility between these libraries, which tends to make codebases very tightly coupled to the library they use.
This current lack of portability between different asynchronous IO libraries causes a lot of duplicated effort for third party library developers. A sufficiently powerful abstraction could mean that asynchronous code gets written once, but used everywhere.
An eventual added goal would be for standard library implementations of wire and network protocols to evolve towards being real protocol implementations, as opposed to standalone libraries that do everything including calling recv() blockingly. This means they could be easily reused for both synchronous and asynchronous code.
Communication abstractions
Transports
Transports provide a uniform API for reading bytes from and writing bytes to different kinds of connections. Transports in this PEP are always ordered, reliable, bidirectional, stream-oriented two-endpoint connections. This might be a TCP socket, an SSL connection, a pipe (named or otherwise), a serial port... It may abstract a file descriptor on POSIX platforms or a Handle on Windows or some other data structure appropriate to a particular platform. It encapsulates all of the particular implementation details of using that platform data structure and presents a uniform interface for application developers.
Transports talk to two things: the other side of the connection on one hand, and a protocol on the other. It's a bridge between the specific underlying transfer mechanism and the protocol. Its job can be described as allowing the protocol to just send and receive bytes, taking care of all of the magic that needs to happen to those bytes to be eventually sent across the wire.
The primary feature of a transport is sending bytes to a protocol and receiving bytes from the underlying protocol. Writing to the transport is done using the write and write_sequence methods. The latter method is a performance optimization, to allow software to take advantage of specific capabilities in some transport mechanisms. Specifically, this allows transports to use writev [4] instead of write [5] or send [6], also known as scatter/gather IO.
A transport can be paused and resumed. This will cause it to buffer data coming from protocols and stop sending received data to the protocol.
A transport can also be closed, half-closed and aborted. A closed transport will finish writing all of the data queued in it to the underlying mechanism, and will then stop reading or writing data. Aborting a transport stops it, closing the connection without sending any data that is still queued.
Further writes will result in exceptions being thrown. A half-closed transport may not be written to anymore, but will still accept incoming data.
Protocols
Protocols are probably more familiar to new users. The terminology is consistent with what you would expect from something called a protocol: the protocols most people think of first, like HTTP, IRC, SMTP... are all examples of something that would be implemented in a protocol.
The shortest useful definition of a protocol is a (usually two-way) bridge between the transport and the rest of the application logic. A protocol will receive bytes from a transport and translates that information into some behavior, typically resulting in some method calls on an object. Similarly, application logic calls some methods on the protocol, which the protocol translates into bytes and communicates to the transport.
One of the simplest protocols is a line-based protocol, where data is delimited by \r\n. The protocol will receive bytes from the transport and buffer them until there is at least one complete line. Once that's done, it will pass this line along to some object. Ideally that would be accomplished using a callable or even a completely separate object composed by the protocol, but it could also be implemented by subclassing (as is the case with Twisted's LineReceiver). For the other direction, the protocol could have a write_line method, which adds the required \r\n and passes the new bytes buffer on to the transport.
This PEP suggests a generalized LineReceiver called ChunkProtocol, where a "chunk" is a message in a stream, delimited by the specified delimiter. Instances take a delimiter and a callable that will be called with a chunk of data once it's received (as opposed to Twisted's subclassing behavior). ChunkProtocol also has a write_chunk method analogous to the write_line method described above.
Why separate protocols and transports?
This separation between protocol and transport often confuses people who first come across it. In fact, the standard library itself does not make this distinction in many cases, particularly not in the API it provides to users.
It is nonetheless a very useful distinction. In the worst case, it simplifies the implementation by clear separation of concerns. However, it often serves the far more useful purpose of being able to reuse protocols across different transports.
Consider a simple RPC protocol. The same bytes may be transferred across many different transports, for example pipes or sockets. To help with this, we separate the protocol out from the transport. The protocol just reads and writes bytes, and doesn't really care what mechanism is used to eventually transfer those bytes.
This also allows for protocols to be stacked or nested easily, allowing for even more code reuse. A common example of this is JSON-RPC: according to the specification, it can be used across both sockets and HTTP [1]. In practice, it tends to be primarily encapsulated in HTTP. The protocol-transport abstraction allows us to build a stack of protocols and transports that allow you to use HTTP as if it were a transport. For JSON-RPC, that might get you a stack somewhat like this:
- TCP socket transport
- HTTP protocol
- HTTP-based transport
- JSON-RPC protocol
- Application code
Flow control
Consumers
Consumers consume bytes produced by producers. Together with producers, they make flow control possible.
Consumers primarily play a passive role in flow control. They get called whenever a producer has some data available. They then process that data, and typically yield control back to the producer.
Consumers typically implement buffers of some sort. They make flow control possible by telling their producer about the current status of those buffers. A consumer can instruct a producer to stop producing entirely, stop producing temporarily, or resume producing if it has been told to pause previously.
Producers are registered to the consumer using the register method.
Producers
Where consumers consume bytes, producers produce them.
Producers are modeled after the IPushProducer [7] interface found in Twisted. Although there is an IPullProducer [8] as well, it is on the whole far less interesting and therefore probably out of the scope of this PEP.
Although producers can be told to stop producing entirely, the two most interesting methods they have are pause and resume. These are usually called by the consumer, to signify whether it is ready to process ("consume") more data or not. Consumers and producers cooperate to make flow control possible.
In addition to the Twisted IPushProducer [7] interface, producers have a half_register method which is called with the consumer when the consumer tries to register that producer. In most cases, this will just be a case of setting self.consumer = consumer, but some producers may require more complex preconditions or behavior when a consumer is registered. End-users are not supposed to call this method directly.
Considered API alternatives
Generators as producers
Generators have been suggested as way to implement producers. However, there appear to be a few problems with this.
First of all, there is a conceptual problem. A generator, in a sense, is "passive". It needs to be told, through a method call, to take action. A producer is "active": it initiates those method calls. A real producer has a symmetric relationship with it's consumer. In the case of a generator-turned-producer, only the consumer would have a reference, and the producer is blissfully unaware of the consumer's existence.
This conceptual problem translates into a few technical issues as well. After a successful write method call on its consumer, a (push) producer is free to take action once more. In the case of a generator, it would need to be told, either by asking for the next object through the iteration protocol (a process which could block indefinitely), or perhaps by throwing some kind of signal exception into it.
This signaling setup may provide a technically feasible solution, but it is still unsatisfactory. For one, this introduces unwarranted complexity in the consumer, which now not only needs to understand how to receive and process data, but also how to ask for new data and deal with the case of no new data being available.
This latter edge case is particularly problematic. It needs to be taken care of, since the entire operation is not allowed to block. However, generators can not raise an exception on iteration without terminating, thereby losing the state of the generator. As a result, signaling a lack of available data would have to be done using a sentinel value, instead of being done using th exception mechanism.
Last but not least, nobody produced actually working code demonstrating how they could be used.
References
| [1] | Sections 2.1 and 2.2 . |
| [2] | http://www.twistedmatrix.com/ |
| [3] | http://www.gevent.org/ |
| [4] | http://pubs.opengroup.org/onlinepubs/009695399/functions/writev.html |
| [5] | http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html |
| [6] | http://pubs.opengroup.org/onlinepubs/009695399/functions/send.html |
| [7] | (1, 2) http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.IPushProducer.html |
| [8] | http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.IPullProducer.html |
Copyright
This document has been placed in the public domain.
pep-3154 Pickle protocol version 4
| PEP: | 3154 |
|---|---|
| Title: | Pickle protocol version 4 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2011-08-11 |
| Python-Version: | 3.4 |
| Post-History: | http://mail.python.org/pipermail/python-dev/2011-August/112821.html |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130439.html |
Contents
Abstract
Data serialized using the pickle module must be portable across Python versions. It should also support the latest language features as well as implementation-specific features. For this reason, the pickle module knows about several protocols (currently numbered from 0 to 3), each of which appeared in a different Python version. Using a low-numbered protocol version allows to exchange data with old Python versions, while using a high-numbered protocol allows access to newer features and sometimes more efficient resource use (both CPU time required for (de)serializing, and disk size / network bandwidth required for data transfer).
Rationale
The latest current protocol, coincidentally named protocol 3, appeared with Python 3.0 and supports the new incompatible features in the language (mainly, unicode strings by default and the new bytes object). The opportunity was not taken at the time to improve the protocol in other ways.
This PEP is an attempt to foster a number of incremental improvements in a new pickle protocol version. The PEP process is used in order to gather as many improvements as possible, because the introduction of a new pickle protocol should be a rare occurrence.
Proposed changes
Framing
Traditionally, when unpickling an object from a stream (by calling load() rather than loads()), many small read() calls can be issued on the file-like object, with a potentially huge performance impact.
Protocol 4, by contrast, features binary framing. The general structure of a pickle is thus the following:
+------+------+ | 0x80 | 0x04 | protocol header (2 bytes) +------+------+ | OP | FRAME opcode (1 byte) +------+------+-----------+ | MM MM MM MM MM MM MM MM | frame size (8 bytes, little-endian) +------+------------------+ | .... | first frame contents (M bytes) +------+ | OP | FRAME opcode (1 byte) +------+------+-----------+ | NN NN NN NN NN NN NN NN | frame size (8 bytes, little-endian) +------+------------------+ | .... | second frame contents (N bytes) +------+ etc.
To keep the implementation simple, it is forbidden for a pickle opcode to straddle frame boundaries. The pickler takes care not to produce such pickles, and the unpickler refuses them. Also, there is no "last frame" marker. The last frame is simply the one which ends with a STOP opcode.
A well-written C implementation doesn't need additional memory copies for the framing layer, preserving general (un)pickling efficiency.
Note
How the pickler decides to partition the pickle stream into frames is an implementation detail. For example, "closing" a frame as soon as it reaches ~64 KiB is a reasonable choice for both performance and pickle size overhead.
Binary encoding for all opcodes
The GLOBAL opcode, which is still used in protocol 3, uses the so-called "text" mode of the pickle protocol, which involves looking for newlines in the pickle stream. It also complicates the implementation of binary framing.
Protocol 4 forbids use of the GLOBAL opcode and replaces it with GLOBAL_STACK, a new opcode which takes its operand from the stack.
Serializing more "lookupable" objects
By default, pickle is only able to serialize module-global functions and classes. Supporting other kinds of objects, such as unbound methods [4], is a common request. Actually, third-party support for some of them, such as bound methods, is implemented in the multiprocessing module [5].
The __qualname__ attribute from PEP 3155 makes it possible to lookup many more objects by name. Making the GLOBAL_STACK opcode accept dot-separated names would allow the standard pickle implementation to support all those kinds of objects.
64-bit opcodes for large objects
Current protocol versions export object sizes for various built-in types (str, bytes) as 32-bit ints. This forbids serialization of large data [1]. New opcodes are required to support very large bytes and str objects.
Native opcodes for sets and frozensets
Many common built-in types (such as str, bytes, dict, list, tuple) have dedicated opcodes to improve resource consumption when serializing and deserializing them; however, sets and frozensets don't. Adding such opcodes would be an obvious improvement. Also, dedicated set support could help remove the current impossibility of pickling self-referential sets [2].
Calling __new__ with keyword arguments
Currently, classes whose __new__ mandates the use of keyword-only arguments can not be pickled (or, rather, unpickled) [3]. Both a new special method (__getnewargs_ex__) and a new opcode (NEWOBJ_EX) are needed. The __getnewargs_ex__ method, if it exists, must return a two-tuple (args, kwargs) where the first item is the tuple of positional arguments and the second item is the dict of keyword arguments for the class's __new__ method.
Better string encoding
Short str objects currently have their length coded as a 4-bytes integer, which is wasteful. A specific opcode with a 1-byte length would make many pickles smaller.
Smaller memoization
The PUT opcodes all require an explicit index to select in which entry of the memo dictionary the top-of-stack is memoized. However, in practice those numbers are allocated in sequential order. A new opcode, MEMOIZE, will instead store the top-of-stack in at the index equal to the current size of the memo dictionary. This allows for shorter pickles, since PUT opcodes are emitted for all non-atomic datatypes.
Summary of new opcodes
These reflect the state of the proposed implementation (thanks mostly to Alexandre Vassalotti's work):
- FRAME: introduce a new frame (followed by the 8-byte frame size and the frame contents).
- SHORT_BINUNICODE: push a utf8-encoded str object with a one-byte size prefix (therefore less than 256 bytes long).
- BINUNICODE8: push a utf8-encoded str object with a eight-byte size prefix (for strings longer than 2**32 bytes, which therefore cannot be serialized using BINUNICODE).
- BINBYTES8: push a bytes object with a eight-byte size prefix (for bytes objects longer than 2**32 bytes, which therefore cannot be serialized using BINBYTES).
- EMPTY_SET: push a new empty set object on the stack.
- ADDITEMS: add the topmost stack items to the set (to be used with EMPTY_SET).
- FROZENSET: create a frozenset object from the topmost stack items, and push it on the stack.
- NEWOBJ_EX: take the three topmost stack items cls, args and kwargs, and push the result of calling cls.__new__(*args, **kwargs).
- STACK_GLOBAL: take the two topmost stack items module_name and qualname, and push the result of looking up the dotted qualname in the module named module_name.
- MEMOIZE: store the top-of-stack object in the memo dictionary with an index equal to the current size of the memo dictionary.
Alternative ideas
Prefetching
Serhiy Storchaka suggested to replace framing with a special PREFETCH opcode (with a 2- or 4-bytes argument) to declare known pickle chunks explicitly. Large data may be pickled outside such chunks. A naĂŻve unpickler should be able to skip the PREFETCH opcode and still decode pickles properly, but good error handling would require checking that the PREFETCH length falls on an opcode boundary.
Acknowledgments
In alphabetic order:
References
| [1] | "pickle not 64-bit ready": http://bugs.python.org/issue11564 |
| [2] | "Cannot pickle self-referencing sets": http://bugs.python.org/issue9269 |
| [3] | "pickle/copyreg doesn't support keyword only arguments in __new__": http://bugs.python.org/issue4727 |
| [4] | "pickle should support methods": http://bugs.python.org/issue9276 |
| [5] | Lib/multiprocessing/forking.py: http://hg.python.org/cpython/file/baea9f5f973c/Lib/multiprocessing/forking.py#l54 |
| [6] | (1, 2) Implement PEP 3154, by Alexandre Vassalotti http://bugs.python.org/issue17810 |
| [7] | Implement PEP 3154, by Stefan Mihaila http://bugs.python.org/issue15642 |
Copyright
This document has been placed in the public domain.
pep-3155 Qualified name for classes and functions
| PEP: | 3155 |
|---|---|
| Title: | Qualified name for classes and functions |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Antoine Pitrou <solipsis at pitrou.net> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 2011-10-29 |
| Python-Version: | 3.3 |
| Post-History: | |
| Resolution: | http://mail.python.org/pipermail/python-dev/2011-November/114545.html |
Contents
Rationale
Python's introspection facilities have long had poor support for nested classes. Given a class object, it is impossible to know whether it was defined inside another class or at module top-level; and, if the former, it is also impossible to know in which class it was defined. While use of nested classes is often considered poor style, the only reason for them to have second class introspection support is a lousy pun.
Python 3 adds insult to injury by dropping what was formerly known as unbound methods. In Python 2, given the following definition:
class C:
def f():
pass
you can then walk up from the C.f object to its defining class:
>>> C.f.im_class <class '__main__.C'>
This possibility is gone in Python 3:
>>> C.f.im_class Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'function' object has no attribute 'im_class' >>> dir(C.f) ['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']
This limits again the introspection capabilities available to the user. It can produce actual issues when porting software to Python 3, for example Twisted Core where the issue of introspecting method objects came up several times. It also limits pickling support [1].
Proposal
This PEP proposes the addition of a __qualname__ attribute to functions and classes. For top-level functions and classes, the __qualname__ attribute is equal to the __name__ attribute. For nested classed, methods, and nested functions, the __qualname__ attribute contains a dotted path leading to the object from the module top-level. A function's local namespace is represented in that dotted path by a component named <locals>.
The repr() and str() of functions and classes is modified to use __qualname__ rather than __name__.
Example with nested classes
>>> class C: ... def f(): pass ... class D: ... def g(): pass ... >>> C.__qualname__ 'C' >>> C.f.__qualname__ 'C.f' >>> C.D.__qualname__ 'C.D' >>> C.D.g.__qualname__ 'C.D.g'
Example with nested functions
>>> def f(): ... def g(): pass ... return g ... >>> f.__qualname__ 'f' >>> f().__qualname__ 'f.<locals>.g'
Limitations
With nested functions (and classes defined inside functions), the dotted path will not be walkable programmatically as a function's namespace is not available from the outside. It will still be more helpful to the human reader than the bare __name__.
As the __name__ attribute, the __qualname__ attribute is computed statically and it will not automatically follow rebinding.
Discussion
Excluding the module name
As __name__, __qualname__ doesn't include the module name. This makes it independent of module aliasing and rebinding, and also allows to compute it at compile time.
Reviving unbound methods
Reviving unbound methods would only solve a fraction of the problems this PEP solves, at a higher price (an additional object type and an additional indirection, rather than an additional attribute).
Naming choice
"Qualified name" is the best approximation, as a short phrase, of what the additional attribute is about. It is not a "full name" or "fully qualified name" since it (deliberately) does not include the module name. Calling it a "path" would risk confusion with filesystem paths and the __file__ attribute.
The first proposal for the attribute name was to call it __qname__ but many people (who are not aware of previous use of such jargon in e.g. the XML specification [2]) found it obscure and non-obvious, which is why the slightly less short and more explicit __qualname__ was finally chosen.
References
| [1] | "pickle should support methods": http://bugs.python.org/issue9276 |
| [2] | "QName" entry in Wikipedia: http://en.wikipedia.org/wiki/QName |
Copyright
This document has been placed in the public domain.
pep-3156 Asynchronous IO Support Rebooted: the "asyncio" Module
| PEP: | 3156 |
|---|---|
| Title: | Asynchronous IO Support Rebooted: the "asyncio" Module |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | Guido van Rossum <guido at python.org> |
| BDFL-Delegate: | Antoine Pitrou <antoine@python.org> |
| Discussions-To: | <python-tulip at googlegroups.com> |
| Status: | Final |
| Type: | Standards Track |
| Content-Type: | text/x-rst |
| Created: | 12-Dec-2012 |
| Post-History: | 21-Dec-2012 |
| Resolution: | https://mail.python.org/pipermail/python-dev/2013-November/130419.html |
Contents
- Abstract
- Introduction
- Event Loop Interface Specification
- Coroutines and the Scheduler
- Synchronization
- Miscellaneous
- Wish List
- Open Issues
- References
- Acknowledgments
- Copyright
Abstract
This is a proposal for asynchronous I/O in Python 3, starting at Python 3.3. Consider this the concrete proposal that is missing from PEP 3153. The proposal includes a pluggable event loop, transport and protocol abstractions similar to those in Twisted, and a higher-level scheduler based on yield from (PEP 380). The proposed package name is asyncio.
Introduction
Status
A reference implementation exists under the code name Tulip. The Tulip repo is linked from the References section at the end. Packages based on this repo will be provided on PyPI (see References) to enable using the asyncio package with Python 3.3 installations.
As of October 20th 2013, the asyncio package has been checked into the Python 3.4 repository and released with Python 3.4-alpha-4, with "provisional" API status. This is an expression of confidence and intended to increase early feedback on the API, and not intended to force acceptance of the PEP. The expectation is that the package will keep provisional status in Python 3.4 and progress to final status in Python 3.5. Development continues to occur primarily in the Tulip repo, with changes occasionally merged into the CPython repo.
Dependencies
Python 3.3 is required for many of the proposed features. The reference implementation (Tulip) requires no new language or standard library features beyond Python 3.3, no third-party modules or packages, and no C code, except for the (optional) IOCP support on Windows.
Module Namespace
The specification here lives in a new top-level package, asyncio. Different components live in separate submodules of the package. The package will import common APIs from their respective submodules and make them available as package attributes (similar to the way the email package works). For such common APIs, the name of the submodule that actually defines them is not part of the specification. Less common APIs may have to explicitly be imported from their respective submodule, and in this case the submodule name is part of the specification.
Classes and functions defined without a submodule name are assumed to live in the namespace of the top-level package. (But do not confuse these with methods of various classes, which for brevity are also used without a namespace prefix in certain contexts.)
Interoperability
The event loop is the place where most interoperability occurs. It should be easy for (Python 3.3 ports of) frameworks like Twisted, Tornado, or even gevents to either adapt the default event loop implementation to their needs using a lightweight adapter or proxy, or to replace the default event loop implementation with an adaptation of their own event loop implementation. (Some frameworks, like Twisted, have multiple event loop implementations. This should not be a problem since these all have the same interface.)
In most cases it should be possible for two different third-party frameworks to interoperate, either by sharing the default event loop implementation (each using its own adapter), or by sharing the event loop implementation of either framework. In the latter case two levels of adaptation would occur (from framework A's event loop to the standard event loop interface, and from there to framework B's event loop). Which event loop implementation is used should be under control of the main program (though a default policy for event loop selection is provided).
For this interoperability to be effective, the preferred direction of adaptation in third party frameworks is to keep the default event loop and adapt it to the framework's API. Ideally all third party frameworks would give up their own event loop implementation in favor of the standard implementation. But not all frameworks may be satisfied with the functionality provided by the standard implementation.
In order to support both directions of adaptation, two separate APIs are specified:
- An interface for managing the current event loop
- The interface of a conforming event loop
An event loop implementation may provide additional methods and guarantees, as long as these are called out in the documentation as non-standard. An event loop implementation may also leave certain methods unimplemented if they cannot be implemented in the given environment; however, such deviations from the standard API should be considered only as a last resort, and only if the platform or environment forces the issue. (An example would be a platform where there is a system event loop that cannot be started or stopped; see "Embedded Event Loops" below.)
The event loop API does not depend on yield from. Rather, it uses a combination of callbacks, additional interfaces (transports and protocols), and Futures. The latter are similar to those defined in PEP 3148, but have a different implementation and are not tied to threads. In particular, the result() method raises an exception instead of blocking when a result is not yet ready; the user is expected to use callbacks (or yield from) to wait for the result.
All event loop methods specified as returning a coroutine are allowed to return either a Future or a coroutine, at the implementation's choice (the standard implementation always returns coroutines). All event loop methods documented as accepting coroutine arguments must accept both Futures and coroutines for such arguments. (A convenience function, async(), exists to convert an argument that is either a coroutine or a Future into a Future.)
For users (like myself) who don't like using callbacks, a scheduler is provided for writing asynchronous I/O code as coroutines using the PEP 380 yield from expressions. The scheduler is not pluggable; pluggability occurs at the event loop level, and the standard scheduler implementation should work with any conforming event loop implementation. (In fact this is an important litmus test for conforming implementations.)
For interoperability between code written using coroutines and other async frameworks, the scheduler defines a Task class that behaves like a Future. A framework that interoperates at the event loop level can wait for a Future to complete by adding a callback to the Future. Likewise, the scheduler offers an operation to suspend a coroutine until a callback is called.
The event loop API provides limited interoperability with threads: there is an API to submit a function to an executor (see PEP 3148) which returns a Future that is compatible with the event loop, and there is a method to schedule a callback with an event loop from another thread in a thread-safe manner.
Transports and Protocols
For those not familiar with Twisted, a quick explanation of the relationship between transports and protocols is in order. At the highest level, the transport is concerned with how bytes are transmitted, while the protocol determines which bytes to transmit (and to some extent when).
A different way of saying the same thing: a transport is an abstraction for a socket (or similar I/O endpoint) while a protocol is an abstraction for an application, from the transport's point of view.
Yet another view is simply that the transport and protocol interfaces together define an abstract interface for using network I/O and interprocess I/O.
There is almost always a 1:1 relationship between transport and protocol objects: the protocol calls transport methods to send data, while the transport calls protocol methods to pass it data that has been received. Neither transport not protocol methods "block" -- they set events into motion and then return.
The most common type of transport is a bidirectional stream transport. It represents a pair of buffered streams (one in each direction) that each transmit a sequence of bytes. The most common example of a bidirectional stream transport is probably a TCP connection. Another common example is an SSL/TLS connection. But there are some other things that can be viewed this way, for example an SSH session or a pair of UNIX pipes. Typically there aren't many different transport implementations, and most of them come with the event loop implementation. However, there is no requirement that all transports must be created by calling an event loop method: a third party module may well implement a new transport and provide a constructor or factory function for it that simply takes an event loop as an argument or calls get_event_loop().
Note that transports don't need to use sockets, not even if they use TCP -- sockets are a platform-specific implementation detail.
A bidirectional stream transport has two "ends": one end talks to the network (or another process, or whatever low-level interface it wraps), and the other end talks to the protocol. The former uses whatever API is necessary to implement the transport; but the interface between transport and protocol is standardized by this PEP.
A protocol can represent some kind of "application-level" protocol such as HTTP or SMTP; it can also implement an abstraction shared by multiple protocols, or a whole application. A protocol's primary interface is with the transport. While some popular protocols (and other abstractions) may have standard implementations, often applications implement custom protocols. It also makes sense to have libraries of useful third party protocol implementations that can be downloaded and installed from PyPI.
There general notion of transport and protocol includes other interfaces, where the transport wraps some other communication abstraction. Examples include interfaces for sending and receiving datagrams (e.g. UDP), or a subprocess manager. The separation of concerns is the same as for bidirectional stream transports and protocols, but the specific interface between transport and protocol is different in each case.
Details of the interfaces defined by the various standard types of transports and protocols are given later.
Event Loop Interface Specification
Event Loop Policy: Getting and Setting the Current Event Loop
Event loop management is controlled by an event loop policy, which is a global (per-process) object. There is a default policy, and an API to change the policy. A policy defines the notion of context; a policy manages a separate event loop per context. The default policy's notion of context is defined as the current thread.
Certain platforms or programming frameworks may change the default policy to something more suitable to the expectations of the users of that platform or framework. Such platforms or frameworks must document their policy and at what point during their initialization sequence the policy is set, in order to avoid undefined behavior when multiple active frameworks want to override the default policy. (See also "Embedded Event Loops" below.)
To get the event loop for current context, use get_event_loop(). This returns an event loop object implementing the interface specified below, or raises an exception in case no event loop has been set for the current context and the current policy does not specify to create one. It should never return None.
To set the event loop for the current context, use set_event_loop(event_loop), where event_loop is an event loop object, i.e. an instance of AbstractEventLoop, or None. It is okay to set the current event loop to None, in which case subsequent calls to get_event_loop() will raise an exception. This is useful for testing code that should not depend on the existence of a default event loop.
It is expected that get_event_loop() returns a different event loop object depending on the context (in fact, this is the definition of context). It may create a new event loop object if none is set and creation is allowed by the policy. The default policy will create a new event loop only in the main thread (as defined by threading.py, which uses a special subclass for the main thread), and only if get_event_loop() is called before set_event_loop() is ever called. (To reset this state, reset the policy.) In other threads an event loop must be explicitly set. Other policies may behave differently. Event loop by the default policy creation is lazy; i.e. the first call to get_event_loop() creates an event loop instance if necessary and specified by the current policy.
For the benefit of unit tests and other special cases there's a third policy function: new_event_loop(), which creates and returns a new event loop object according to the policy's default rules. To make this the current event loop, you must call set_event_loop() with it.
To change the event loop policy, call set_event_loop_policy(policy), where policy is an event loop policy object or None. If not None, the policy object must be an instance of AbstractEventLoopPolicy that defines methods get_event_loop(), set_event_loop(loop) and new_event_loop(), all behaving like the functions described above.
Passing a policy value of None restores the default event loop policy (overriding the alternate default set by the platform or framework). The default event loop policy is an instance of the class DefaultEventLoopPolicy. The current event loop policy object can be retrieved by calling get_event_loop_policy().
Passing an Event Loop Around Explicitly
It is possible to write code that uses an event loop without relying on a global or per-thread default event loop. For this purpose, all APIs that need access to the current event loop (and aren't methods on an event class) take an optional keyword argument named loop. If this argument is None or unspecified, such APIs will call get_event_loop() to get the default event loop, but if the loop keyword argument is set to an event loop object, they will use that event loop, and pass it along to any other such APIs they call. For example, Future(loop=my_loop) will create a Future tied to the event loop my_loop. When the default current event is None, the loop keyword argument is effectively mandatory.
Note that an explicitly passed event loop must still belong to the current thread; the loop keyword argument does not magically change the constraints on how an event loop can be used.
Specifying Times
As usual in Python, all timeouts, intervals and delays are measured in seconds, and may be ints or floats. However, absolute times are not specified as POSIX timestamps. The accuracy, precision and epoch of the clock are up to the implementation.
The default implementation uses time.monotonic(). Books could be written about the implications of this choice. Better read the docs for the standard library time module.
Embedded Event Loops
On some platforms an event loop is provided by the system. Such a loop may already be running when the user code starts, and there may be no way to stop or close it without exiting from the program. In this case, the methods for starting, stopping and closing the event loop may not be implementable, and is_running() may always return True.
Event Loop Classes
There is no actual class named EventLoop. There is an AbstractEventLoop class which defines all the methods without implementations, and serves primarily as documentation. The following concrete classes are defined:
- SelectorEventLoop is a concrete implementation of the full API based on the selectors module (new in Python 3.4). The constructor takes one optional argument, a selectors.Selector object. By default an instance of selectors.DefaultSelector is created and used.
- ProactorEventLoop is a concrete implementation of the API except for the I/O event handling and signal handling methods. It is only defined on Windows (or on other platforms which support a similar API for "overlapped I/O"). The constructor takes one optional argument, a Proactor object. By default an instance of IocpProactor is created and used. (The IocpProactor class is not specified by this PEP; it is just an implementation detail of the ProactorEventLoop class.)
Event Loop Methods Overview
The methods of a conforming event loop are grouped into several categories. The first set of categories must be supported by all conforming event loop implementations, with the exception that embedded event loops may not implement the methods for starting, stopping and closing. (However, a partially-conforming event loop is still better than nothing. :-)
- Starting, stopping and closing: run_forever(), run_until_complete(), stop(), is_running(), close().
- Basic and timed callbacks: call_soon(), call_later(), call_at(), time().
- Thread interaction: call_soon_threadsafe(), run_in_executor(), set_default_executor().
- Internet name lookups: getaddrinfo(), getnameinfo().
- Internet connections: create_connection(), create_server(), create_datagram_endpoint().
- Wrapped socket methods: sock_recv(), sock_sendall(), sock_connect(), sock_accept().
The second set of categories may be supported by conforming event loop implementations. If not supported, they will raise NotImplementedError. (In the default implementation, SelectorEventLoop on UNIX systems supports all of these; SelectorEventLoop on Windows supports the I/O event handling category; ProactorEventLoop on Windows supports the pipes and subprocess category.)
- I/O callbacks: add_reader(), remove_reader(), add_writer(), remove_writer().
- Pipes and subprocesses: connect_read_pipe(), connect_write_pipe(), subprocess_shell(), subprocess_exec().
- Signal callbacks: add_signal_handler(), remove_signal_handler().
Event Loop Methods
Starting, Stopping and Closing
An (unclosed) event loop can be in one of two states: running or stopped. These methods deal with starting and stopping an event loop:
- run_forever(). Runs the event loop until stop() is called. This cannot be called when the event loop is already running. (This has a long name in part to avoid confusion with earlier versions of this PEP, where run() had different behavior, in part because there are already too many APIs that have a method named run(), and in part because there shouldn't be many places where this is called anyway.)
- run_until_complete(future). Runs the event loop until the Future is done. If the Future is done, its result is returned, or its exception is raised. This cannot be called when the event loop is already running.
- stop(). Stops the event loop as soon as it is convenient. It is fine to restart the loop with run_forever() or run_until_complete() subsequently; no scheduled callbacks will be lost if this is done. Note: stop() returns normally and the current callback is allowed to continue. How soon after this point the event loop stops is up to the implementation, but the intention is to stop short of polling for I/O, and not to run any callbacks scheduled in the future; the major freedom an implementation has is how much of the "ready queue" (callbacks already scheduled with call_soon()) it processes before stopping.
- is_running(). Returns True if the event loop is currently running, False if it is stopped.
- close(). Closes the event loop, releasing any resources it may hold, such as the file descriptor used by epoll() or kqueue(), and the default executor. This should not be called while the event loop is running. After it has been called the event loop should not be used again. It may be called multiple times; subsequent calls are no-ops.
Basic Callbacks
Callbacks associated with the same event loop are strictly serialized: one callback must finish before the next one will be called. This is an important guarantee: when two or more callbacks use or modify shared state, each callback is guaranteed that while it is running, the shared state isn't changed by another callback.
- call_soon(callback, *args). This schedules a callback to be called as soon as possible. Returns a Handle (see below) representing the callback, whose cancel() method can be used to cancel the callback. It guarantees that callbacks are called in the order in which they were scheduled.
- call_later(delay, callback, *args). Arrange for callback(*args) to be called approximately delay seconds in the future, once, unless cancelled. Returns a Handle representing the callback, whose cancel() method can be used to cancel the callback. Callbacks scheduled in the past or at exactly the same time will be called in an undefined order.
- call_at(when, callback, *args). This is like call_later(), but the time is expressed as an absolute time. Returns a similar Handle. There is a simple equivalency: loop.call_later(delay, callback, *args) is the same as loop.call_at(loop.time() + delay, callback, *args).
- time(). Returns the current time according to the event loop's clock. This may be time.time() or time.monotonic() or some other system-specific clock, but it must return a float expressing the time in units of approximately one second since some epoch. (No clock is perfect -- see PEP 418.)
Note: A previous version of this PEP defined a method named call_repeatedly(), which promised to call a callback at regular intervals. This has been withdrawn because the design of such a function is overspecified. On the one hand, a simple timer loop can easily be emulated using a callback that reschedules itself using call_later(); it is also easy to write coroutine containing a loop and a sleep() call (a toplevel function in the module, see below). On the other hand, due to the complexities of accurate timekeeping there are many traps and pitfalls here for the unaware (see PEP 418), and different use cases require different behavior in edge cases. It is impossible to offer an API for this purpose that is bullet-proof in all cases, so it is deemed better to let application designers decide for themselves what kind of timer loop to implement.
Thread interaction
- call_soon_threadsafe(callback, *args). Like call_soon(callback, *args), but when called from another thread while the event loop is blocked waiting for I/O, unblocks the event loop. Returns a Handle. This is the only method that is safe to call from another thread. (To schedule a callback for a later time in a threadsafe manner, you can use loop.call_soon_threadsafe(loop.call_later, when, callback, *args).) Note: this is not safe to call from a signal handler (since it may use locks). In fact, no API is signal-safe; if you want to handle signals, use add_signal_handler() described below.
- run_in_executor(executor, callback, *args). Arrange to call callback(*args) in an executor (see PEP 3148). Returns an asyncio.Future instance whose result on success is the return value of that call. This is equivalent to wrap_future(executor.submit(callback, *args)). If executor is None, the default executor set by set_default_executor() is used. If no default executor has been set yet, a ThreadPoolExecutor with a default number of threads is created and set as the default executor. (The default implementation uses 5 threads in this case.)
- set_default_executor(executor). Set the default executor used by run_in_executor(). The argument must be a PEP 3148 Executor instance or None, in order to reset the default executor.
See also the wrap_future() function described in the section about Futures.
Internet name lookups
These methods are useful if you want to connect or bind a socket to an address without the risk of blocking for the name lookup. They are usually called implicitly by create_connection(), create_server() or create_datagram_endpoint().
getaddrinfo(host, port, family=0, type=0, proto=0, flags=0). Similar to the socket.getaddrinfo() function but returns a Future. The Future's result on success will be a list of the same format as returned by socket.getaddrinfo(), i.e. a list of (address_family, socket_type, socket_protocol, canonical_name, address) where address is a 2-tuple (ipv4_address, port) for IPv4 addresses and a 4-tuple (ipv4_address, port, flow_info, scope_id) for IPv6 addresses. If the family argument is zero or unspecified, the list returned may contain a mixture of IPv4 and IPv6 addresses; otherwise the addresses returned are constrained by the family value (similar for proto and flags). The default implementation calls socket.getaddrinfo() using run_in_executor(), but other implementations may choose to implement their own DNS lookup. The optional arguments must be specified as keyword arguments.
Note: implementations are allowed to implement a subset of the full socket.getaddrinfo() interface; e.g. they may not support symbolic port names, or they may ignore or incompletely implement the type, proto and flags arguments. However, if type and proto are ignored, the argument values passed in should be copied unchanged into the return tuples' socket_type and socket_protocol elements. (You can't ignore family, since IPv4 and IPv6 addresses must be looked up differently. The only permissible values for family are socket.AF_UNSPEC (0), socket.AF_INET and socket.AF_INET6, and the latter only if it is defined by the platform.)
getnameinfo(sockaddr, flags=0). Similar to socket.getnameinfo() but returns a Future. The Future's result on success will be a tuple (host, port). Same implementation remarks as for getaddrinfo().
Internet connections
These are the high-level interfaces for managing internet connections. Their use is recommended over the corresponding lower-level interfaces because they abstract away the differences between selector-based and proactor-based event loops.
Note that the client and server side of stream connections use the same transport and protocol interface. However, datagram endpoints use a different transport and protocol interface.
create_connection(protocol_factory, host, port, <options>). Creates a stream connection to a given internet host and port. This is a task that is typically called from the client side of the connection. It creates an implementation-dependent bidirectional stream Transport to represent the connection, then calls protocol_factory() to instantiate (or retrieve) the user's Protocol implementation, and finally ties the two together. (See below for the definitions of Transport and Protocol.) The user's Protocol implementation is created or retrieved by calling protocol_factory() without arguments(*). The coroutine's result on success is the (transport, protocol) pair; if a failure prevents the creation of a successful connection, an appropriate exception will be raised. Note that when the coroutine completes, the protocol's connection_made() method has not yet been called; that will happen when the connection handshake is complete.
(*) There is no requirement that protocol_factory is a class. If your protocol class needs to have specific arguments passed to its constructor, you can use lambda. You can also pass a trivial lambda that returns a previously constructed Protocol instance.
The <options> are all specified using optional keyword arguments:
- ssl: Pass True to create an SSL/TLS transport (by default a plain TCP transport is created). Or pass an ssl.SSLContext object to override the default SSL context object to be used. If a default context is created it is up to the implementation to configure reasonable defaults. The reference implementation currently uses PROTOCOL_SSLv23 and sets the OP_NO_SSLv2 option, calls set_default_verify_paths() and sets verify_mode to CERT_REQUIRED. In addition, whenever the context (default or otherwise) specifies a verify_mode of CERT_REQUIRED or CERT_OPTIONAL, if a hostname is given, immediately after a successful handshake ssl.match_hostname(peercert, hostname) is called, and if this raises an exception the conection is closed. (To avoid this behavior, pass in an SSL context that has verify_mode set to CERT_NONE. But this means you are not secure, and vulnerable to for example man-in-the-middle attacks.)
- family, proto, flags: Address family, protocol and flags to be passed through to getaddrinfo(). These all default to 0, which means "not specified". (The socket type is always SOCK_STREAM.) If any of these values are not specified, the getaddrinfo() method will choose appropriate values. Note: proto has nothing to do with the high-level Protocol concept or the protocol_factory argument.
- sock: An optional socket to be used instead of using the host, port, family, proto and flags arguments. If this is given, host and port must be explicitly set to None.
- local_addr: If given, a (host, port) tuple used to bind the socket to locally. This is rarely needed but on multi-homed servers you occasionally need to force a connection to come from a specific address. This is how you would do that. The host and port are looked up using getaddrinfo().
- server_hostname: This is only relevant when using SSL/TLS; it should not be used when ssl is not set. When ssl is set, this sets or overrides the hostname that will be verified. By default the value of the host argument is used. If host is empty, there is no default and you must pass a value for server_hostname. To disable hostname verification (which is a serious security risk) you must pass an empty string here and pass an ssl.SSLContext object whose verify_mode is set to ssl.CERT_NONE as the ssl argument.
create_server(protocol_factory, host, port, <options>). Enters a serving loop that accepts connections. This is a coroutine that completes once the serving loop is set up to serve. The return value is a Server object which can be used to stop the serving loop in a controlled fashion (see below). Multiple sockets may be bound if the specified address allows both IPv4 and IPv6 connections.
Each time a connection is accepted, protocol_factory is called without arguments(**) to create a Protocol, a bidirectional stream Transport is created to represent the network side of the connection, and the two are tied together by calling protocol.connection_made(transport).
(**) See previous footnote for create_connection(). However, since protocol_factory() is called once for each new incoming connection, it should return a new Protocol object each time it is called.
The <options> are all specified using optional keyword arguments:
ssl: Pass an ssl.SSLContext object (or an object with the same interface) to override the default SSL context object to be used. (Unlike for create_connection(), passing True does not make sense here -- the SSLContext object is needed to specify the certificate and key.)
backlog: Backlog value to be passed to the listen() call. The default is implementation-dependent; in the default implementation the default value is 100.
reuse_address: Whether to set the SO_REUSEADDR option on the socket. The default is True on UNIX, False on Windows.
- family, flags: Address family and flags to be passed
through to getaddrinfo(). The family defaults to AF_UNSPEC; the flags default to AI_PASSIVE. (The socket type is always SOCK_STREAM; the socket protocol always set to 0, to let getaddrinfo() choose.)
sock: An optional socket to be used instead of using the host, port, family and flags arguments. If this is given, host and port must be explicitly set to None.
create_datagram_endpoint(protocol_factory, local_addr=None, remote_addr=None, <options>). Creates an endpoint for sending and receiving datagrams (typically UDP packets). Because of the nature of datagram traffic, there are no separate calls to set up client and server side, since usually a single endpoint acts as both client and server. This is a coroutine that returns a (transport, protocol) pair on success, or raises an exception on failure. If the coroutine returns successfully, the transport will call callbacks on the protocol whenever a datagram is received or the socket is closed; it is up to the protocol to call methods on the protocol to send datagrams. The transport returned is a DatagramTransport. The protocol returned is a DatagramProtocol. These are described later.
Mandatory positional argument:
- protocol_factory: A class or factory function that will be called exactly once, without arguments, to construct the protocol object to be returned. The interface between datagram transport and protocol is described below.
Optional arguments that may be specified positionally or as keyword arguments:
- local_addr: An optional tuple indicating the address to which the socket will be bound. If given this must be a (host, port) pair. It will be passed to getaddrinfo() to be resolved and the result will be passed to the bind() method of the socket created. If getaddrinfo() returns more than one address, they will be tried in turn. If omitted, no bind() call will be made.
- remote_addr: An optional tuple indicating the address to which the socket will be "connected". (Since there is no such thing as a datagram connection, this just specifies a default value for the destination address of outgoing datagrams.) If given this must be a (host, port) pair. It will be passed to getaddrinfo() to be resolved and the result will be passed to sock_connect() together with the socket created. If getaddrinfo() returns more than one address, they will be tried in turn. If omitted, no sock_connect() call will be made.
The <options> are all specified using optional keyword arguments:
- family, proto, flags: Address family, protocol and flags to be passed through to getaddrinfo(). These all default to 0, which means "not specified". (The socket type is always SOCK_DGRAM.) If any of these values are not specified, the getaddrinfo() method will choose appropriate values.
Note that if both local_addr and remote_addr are present, all combinations of local and remote addresses with matching address family will be tried.
Wrapped Socket Methods
The following methods for doing async I/O on sockets are not for general use. They are primarily meant for transport implementations working with IOCP through the ProactorEventLoop class. However, they are easily implementable for other event loop types, so there is no reason not to require them. The socket argument has to be a non-blocking socket.
- sock_recv(sock, n). Receive up to n bytes from socket sock. Returns a Future whose result on success will be a bytes object.
- sock_sendall(sock, data). Send bytes data to socket sock. Returns a Future whose result on success will be None. Note: the name uses sendall instead of send, to reflect that the semantics and signature of this method echo those of the standard library socket method sendall() rather than send().
- sock_connect(sock, address). Connect to the given address. Returns a Future whose result on success will be None.
- sock_accept(sock). Accept a connection from a socket. The socket must be in listening mode and bound to an address. Returns a Future whose result on success will be a tuple (conn, peer) where conn is a connected non-blocking socket and peer is the peer address.
I/O Callbacks
These methods are primarily meant for transport implementations working with a selector. They are implemented by SelectorEventLoop but not by ProactorEventLoop. Custom event loop implementations may or may not implement them.
The fd arguments below may be integer file descriptors, or "file-like" objects with a fileno() method that wrap integer file descriptors. Not all file-like objects or file descriptors are acceptable. Sockets (and socket file descriptors) are always accepted. On Windows no other types are supported. On UNIX, pipes and possibly tty devices are also supported, but disk files are not. Exactly which special file types are supported may vary by platform and per selector implementation. (Experimentally, there is at least one kind of pseudo-tty on OS X that is supported by select and poll but not by kqueue: it is used by Emacs shell windows.)
- add_reader(fd, callback, *args). Arrange for callback(*args) to be called whenever file descriptor fd is deemed ready for reading. Calling add_reader() again for the same file descriptor implies a call to remove_reader() for the same file descriptor.
- add_writer(fd, callback, *args). Like add_reader(), but registers the callback for writing instead of for reading.
- remove_reader(fd). Cancels the current read callback for file descriptor fd, if one is set. If no callback is currently set for the file descriptor, this is a no-op and returns False. Otherwise, it removes the callback arrangement and returns True.
- remove_writer(fd). This is to add_writer() as remove_reader() is to add_reader().
Pipes and Subprocesses
These methods are supported by SelectorEventLoop on UNIX and ProactorEventLoop on Windows.
The transports and protocols used with pipes and subprocesses differ from those used with regular stream connections. These are described later.
Each of the methods below has a protocol_factory argument, similar to create_connection(); this will be called exactly once, without arguments, to construct the protocol object to be returned.
Each method is a coroutine that returns a (transport, protocol) pair on success, or raises an exception on failure.
- connect_read_pipe(protocol_factory, pipe): Create a unidrectional stream connection from a file-like object wrapping the read end of a UNIX pipe, which must be in non-blocking mode. The transport returned is a ReadTransport.
- connect_write_pipe(protocol_factory, pipe): Create a unidrectional stream connection from a file-like object wrapping the write end of a UNIX pipe, which must be in non-blocking mode. The transport returned is a WriteTransport; it does not have any read-related methods. The protocol returned is a BaseProtocol.
- subprocess_shell(protocol_factory, cmd, <options>): Create a subprocess from cmd, which is a string using the platform's "shell" syntax. This is similar to the standard library subprocess.Popen() class called with shell=True. The remaining arguments and return value are described below.
- subprocess_exec(protocol_factory, *args, <options>): Create a subprocess from one or more string arguments, where the first string specifies the program to execute, and the remaining strings specify the program's arguments. (Thus, together the string arguments form the sys.argv value of the program, assuming it is a Python script.) This is similar to the standard library subprocess.Popen() class called with shell=False and the list of strings passed as the first argument; however, where Popen() takes a single argument which is list of strings, subprocess_exec() takes multiple string arguments. The remaining arguments and return value are described below.
Apart from the way the program to execute is specified, the two subprocess_*() methods behave the same. The transport returned is a SubprocessTransport which has a different interface than the common bidirectional stream transport. The protocol returned is a SubprocessProtocol which also has a custom interface.
The <options> are all specified using optional keyword arguments:
- stdin: Either a file-like object representing the pipe to be connected to the subprocess's standard input stream using connect_write_pipe(), or the constant subprocess.PIPE (the default). By default a new pipe will be created and connected.
- stdout: Either a file-like object representing the pipe to be connected to the subprocess's standard output stream using connect_read_pipe(), or the constant subprocess.PIPE (the default). By default a new pipe will be created and connected.
- stderr: Either a file-like object representing the pipe to be connected to the subprocess's standard error stream using connect_read_pipe(), or one of the constants subprocess.PIPE (the default) or subprocess.STDOUT. By default a new pipe will be created and connected. When subprocess.STDOUT is specified, the subprocess's standard error stream will be connected to the same pipe as the standard output stream.
- bufsize: The buffer size to be used when creating a pipe; this is passed to subprocess.Popen(). In the default implementation this defaults to zero, and on Windows it must be zero; these defaults deviate from subprocess.Popen().
- executable, preexec_fn, close_fds, cwd, env, startupinfo, creationflags, restore_signals, start_new_session, pass_fds: These optional arguments are passed to subprocess.Popen() without interpretation.
Signal callbacks
These methods are only supported on UNIX.
- add_signal_handler(sig, callback, *args). Whenever signal sig is received, arrange for callback(*args) to be called. Specifying another callback for the same signal replaces the previous handler (only one handler can be active per signal). The sig must be a valid signal number defined in the signal module. If the signal cannot be handled this raises an exception: ValueError if it is not a valid signal or if it is an uncatchable signal (e.g. SIGKILL), RuntimeError if this particular event loop instance cannot handle signals (since signals are global per process, only an event loop associated with the main thread can handle signals).
- remove_signal_handler(sig). Removes the handler for signal sig, if one is set. Raises the same exceptions as add_signal_handler() (except that it may return False instead raising RuntimeError for uncatchable signals). Returns True if a handler was removed successfully, False if no handler was set.
Note: If these methods are statically known to be unsupported, they may raise NotImplementedError instead of RuntimeError.
Mutual Exclusion of Callbacks
An event loop should enforce mutual exclusion of callbacks, i.e. it should never start a callback while a previously callback is still running. This should apply across all types of callbacks, regardless of whether they are scheduled using call_soon(), call_later(), call_at(), call_soon_threadsafe(), add_reader(), add_writer(), or add_signal_handler().
Exceptions
There are two categories of exceptions in Python: those that derive from the Exception class and those that derive from BaseException. Exceptions deriving from Exception will generally be caught and handled appropriately; for example, they will be passed through by Futures, and they will be logged and ignored when they occur in a callback.
However, exceptions deriving only from BaseException are typically not caught, and will usually cause the program to terminate with a traceback. In some cases they are caught and re-raised. (Examples of this category include KeyboardInterrupt and SystemExit; it is usually unwise to treat these the same as most other exceptions.)
Handles
The various methods for registering one-off callbacks (call_soon(), call_later(), call_at() and call_soon_threadsafe()) all return an object representing the registration that can be used to cancel the callback. This object is called a Handle. Handles are opaque and have only one public method:
- cancel(): Cancel the callback.
Note that add_reader(), add_writer() and add_signal_handler() do not return Handles.
Servers
The create_server() method returns a Server instance, which wraps the sockets (or other network objects) used to accept requests. This class has two public methods:
- close(): Close the service. This stops accepting new requests but does not cancel requests that have already been accepted and are currently being handled.
- wait_closed(): A coroutine that blocks until the service is closed and all accepted requests have been handled.
Futures
The asyncio.Future class here is intentionally similar to the concurrent.futures.Future class specified by PEP 3148, but there are slight differences. Whenever this PEP talks about Futures or futures this should be understood to refer to asyncio.Future unless concurrent.futures.Future is explicitly mentioned. The supported public API is as follows, indicating the differences with PEP 3148:
- cancel(). If the Future is already done (or cancelled), do nothing and return False. Otherwise, this attempts to cancel the Future and returns True. If the the cancellation attempt is successful, eventually the Future's state will change to cancelled (so that cancelled() will return True) and the callbacks will be scheduled. For regular Futures, cancellation will always succeed immediately; but for Tasks (see below) the task may ignore or delay the cancellation attempt.
- cancelled(). Returns True if the Future was successfully cancelled.
- done(). Returns True if the Future is done. Note that a cancelled Future is considered done too (here and everywhere).
- result(). Returns the result set with set_result(), or raises the exception set with set_exception(). Raises CancelledError if cancelled. Difference with PEP 3148: This has no timeout argument and does not wait; if the future is not yet done, it raises an exception.
- exception(). Returns the exception if set with set_exception(), or None if a result was set with set_result(). Raises CancelledError if cancelled. Difference with PEP 3148: This has no timeout argument and does not wait; if the future is not yet done, it raises an exception.
- add_done_callback(fn). Add a callback to be run when the Future becomes done (or is cancelled). If the Future is already done (or cancelled), schedules the callback to using call_soon(). Difference with PEP 3148: The callback is never called immediately, and always in the context of the caller -- typically this is a thread. You can think of this as calling the callback through call_soon(). Note that in order to match PEP 3148, the callback (unlike all other callbacks defined in this PEP, and ignoring the convention from the section "Callback Style" below) is always called with a single argument, the Future object. (The motivation for strictly serializing callbacks scheduled with call_soon() applies here too.)
- remove_done_callback(fn). Remove the argument from the list of callbacks. This method is not defined by PEP 3148. The argument must be equal (using ==) to the argument passed to add_done_callback(). Returns the number of times the callback was removed.
- set_result(result). The Future must not be done (nor cancelled) already. This makes the Future done and schedules the callbacks. Difference with PEP 3148: This is a public API.
- set_exception(exception). The Future must not be done (nor cancelled) already. This makes the Future done and schedules the callbacks. Difference with PEP 3148: This is a public API.
The internal method set_running_or_notify_cancel() is not supported; there is no way to set the running state. Likewise, the method running() is not supported.
The following exceptions are defined:
- InvalidStateError. Raised whenever the Future is not in a state acceptable to the method being called (e.g. calling set_result() on a Future that is already done, or calling result() on a Future that is not yet done).
- InvalidTimeoutError. Raised by result() and exception() when a nonzero timeout argument is given.
- CancelledError. An alias for concurrent.futures.CancelledError. Raised when result() or exception() is called on a Future that is cancelled.
- TimeoutError. An alias for concurrent.futures.TimeoutError. May be raised by run_until_complete().
A Future is associated with an event loop when it is created.
A asyncio.Future object is not acceptable to the wait() and as_completed() functions in the concurrent.futures package. However, there are similar APIs asyncio.wait() and asyncio.as_completed(), described below.
A asyncio.Future object is acceptable to a yield from expression when used in a coroutine. This is implemented through the __iter__() interface on the Future. See the section "Coroutines and the Scheduler" below.
When a Future is garbage-collected, if it has an associated exception but neither result() nor exception() has ever been called, the exception is logged. (When a coroutine uses yield from to wait for a Future, that Future's result() method is called once the coroutine is resumed.)
In the future (pun intended) we may unify asyncio.Future and concurrent.futures.Future, e.g. by adding an __iter__() method to the latter that works with yield from. To prevent accidentally blocking the event loop by calling e.g. result() on a Future that's not done yet, the blocking operation may detect that an event loop is active in the current thread and raise an exception instead. However the current PEP strives to have no dependencies beyond Python 3.3, so changes to concurrent.futures.Future are off the table for now.
There are some public functions related to Futures:
- asyncio.async(arg). This takes an argument that is either a coroutine object or a Future (i.e., anything you can use with yield from) and returns a Future. If the argument is a Future, it is returned unchanged; if it is a coroutine object, it wraps it in a Task (remember that Task is a subclass of Future).
- asyncio.wrap_future(future). This takes a PEP 3148 Future (i.e., an instance of concurrent.futures.Future) and returns a Future compatible with the event loop (i.e., a asyncio.Future instance).
Transports
Transports and protocols are strongly influenced by Twisted and PEP 3153. Users rarely implement or instantiate transports -- rather, event loops offer utility methods to set up transports.
Transports work in conjunction with protocols. Protocols are typically written without knowing or caring about the exact type of transport used, and transports can be used with a wide variety of protocols. For example, an HTTP client protocol implementation may be used with either a plain socket transport or an SSL/TLS transport. The plain socket transport can be used with many different protocols besides HTTP (e.g. SMTP, IMAP, POP, FTP, IRC, SPDY).
The most common type of transport is a bidirectional stream transport. There are also unidirectional stream transports (used for pipes) and datagram transports (used by the create_datagram_endpoint() method).
Methods For All Transports
- get_extra_info(name, default=None). This is a catch-all method that returns implementation-specific information about a transport. The first argument is the name of the extra field to be retrieved. The optional second argument is a default value to be returned. Consult the implementation documentation to find out the supported extra field names. For an unsupported name, the default is always returned.
Bidirectional Stream Transports
A bidrectional stream transport is an abstraction on top of a socket or something similar (for example, a pair of UNIX pipes or an SSL/TLS connection).
Most connections have an asymmetric nature: the client and server usually have very different roles and behaviors. Hence, the interface between transport and protocol is also asymmetric. From the protocol's point of view, writing data is done by calling the write() method on the transport object; this buffers the data and returns immediately. However, the transport takes a more active role in reading data: whenever some data is read from the socket (or other data source), the transport calls the protocol's data_received() method.
Nevertheless, the interface between transport and protocol used by bidirectional streams is the same for clients as it is for servers, since the connection between a client and a server is essentially a pair of streams, one in each direction.
Bidirectional stream transports have the following public methods:
write(data). Write some bytes. The argument must be a bytes object. Returns None. The transport is free to buffer the bytes, but it must eventually cause the bytes to be transferred to the entity at the other end, and it must maintain stream behavior. That is, t.write(b'abc'); t.write(b'def') is equivalent to t.write(b'abcdef'), as well as to:
t.write(b'a') t.write(b'b') t.write(b'c') t.write(b'd') t.write(b'e') t.write(b'f')
writelines(iterable). Equivalent to:
for data in iterable: self.write(data)write_eof(). Close the writing end of the connection. Subsequent calls to write() are not allowed. Once all buffered data is transferred, the transport signals to the other end that no more data will be received. Some protocols don't support this operation; in that case, calling write_eof() will raise an exception. (Note: This used to be called half_close(), but unless you already know what it is for, that name doesn't indicate which end is closed.)
can_write_eof(). Return True if the protocol supports write_eof(), False if it does not. (This method typically returns a fixed value that depends only on the specific Transport class, not on the state of the Transport object. It is needed because some protocols need to change their behavior when write_eof() is unavailable. For example, in HTTP, to send data whose size is not known ahead of time, the end of the data is typically indicated using write_eof(); however, SSL/TLS does not support this, and an HTTP protocol implementation would have to use the "chunked" transfer encoding in this case. But if the data size is known ahead of time, the best approach in both cases is to use the Content-Length header.)
get_write_buffer_size(). Return the current size of the transport's write buffer in bytes. This only knows about the write buffer managed explicitly by the transport; buffering in other layers of the network stack or elsewhere of the network is not reported.
set_write_buffer_limits(high=None, low=None). Set the high- and low-water limits for flow control.
These two values control when to call the protocol's pause_writing() and resume_writing() methods. If specified, the low-water limit must be less than or equal to the high-water limit. Neither value can be negative.
The defaults are implementation-specific. If only the high-water limit is given, the low-water limit defaults to a implementation-specific value less than or equal to the high-water limit. Setting high to zero forces low to zero as well, and causes pause_writing() to be called whenever the buffer becomes non-empty. Setting low to zero causes resume_writing() to be called only once the buffer is empty. Use of zero for either limit is generally sub-optimal as it reduces opportunities for doing I/O and computation concurrently.
pause_reading(). Suspend delivery of data to the protocol until a subsequent resume_reading() call. Between pause_reading() and resume_reading(), the protocol's data_received() method will not be called.
resume_reading(). Restart delivery of data to the protocol via data_received(). Note that "paused" is a binary state -- pause_reading() should only be called when the transport is not paused, while resume_reading() should only be called when the transport is paused.
close(). Sever the connection with the entity at the other end. Any data buffered by write() will (eventually) be transferred before the connection is actually closed. The protocol's data_received() method will not be called again. Once all buffered data has been flushed, the protocol's connection_lost() method will be called with None as the argument. Note that this method does not wait for all that to happen.
abort(). Immediately sever the connection. Any data still buffered by the transport is thrown away. Soon, the protocol's connection_lost() method will be called with None as argument.
Unidirectional Stream Transports
A writing stream transport supports the write(), writelines(), write_eof(), can_write_eof(), close() and abort() methods described for bidrectional stream transports.
A reading stream transport supports the pause_reading(), resume_reading() and close() methods described for bidrectional stream transports.
A writing stream transport calls only connection_made() and connection_lost() on its associated protocol.
A reading stream transport can call all protocol methods specified in the Protocols section below (i.e., the previous two plus data_received() and eof_received()).
Datagram Transports
Datagram transports have these methods:
- sendto(data, addr=None). Sends a datagram (a bytes object). The optional second argument is the destination address. If omitted, remote_addr must have been specified in the create_datagram_endpoint() call that created this transport. If present, and remote_addr was specified, they must match. The (data, addr) pair may be sent immediately or buffered. The return value is None.
- abort(). Immediately close the transport. Buffered data will be discarded.
- close(). Close the transport. Buffered data will be transmitted asynchronously.
Datagram transports call the following methods on the associated protocol object: connection_made(), connection_lost(), error_received() and datagram_received(). ("Connection" in these method names is a slight misnomer, but the concepts still exist: connection_made() means the transport representing the endpoint has been created, and connection_lost() means the transport is closed.)
Subprocess Transports
Subprocess transports have the following methods:
- get_pid(). Return the process ID of the subprocess.
- get_returncode(). Return the process return code, if the process has exited; otherwise None.
- get_pipe_transport(fd). Return the pipe transport (a unidirectional stream transport) corresponding to the argument, which should be 0, 1 or 2 representing stdin, stdout or stderr (of the subprocess). If there is no such pipe transport, return None. For stdin, this is a writing transport; for stdout and stderr this is a reading transport. You must use this method to get a transport you can use to write to the subprocess's stdin.
- send_signal(signal). Send a signal to the subprocess.
- terminate(). Terminate the subprocess.
- kill(). Kill the subprocess. On Windows this is an alias for terminate().
- close(). This is an alias for terminate().
Note that send_signal(), terminate() and kill() wrap the corresponding methods in the standard library subprocess module.
Protocols
Protocols are always used in conjunction with transports. While a few common protocols are provided (e.g. decent though not necessarily excellent HTTP client and server implementations), most protocols will be implemented by user code or third-party libraries.
Like for transports, we distinguish between stream protocols, datagram protocols, and perhaps other custom protocols. The most common type of protocol is a bidirectional stream protocol. (There are no unidirectional protocols.)
Stream Protocols
A (bidirectional) stream protocol must implement the following methods, which will be called by the transport. Think of these as callbacks that are always called by the event loop in the right context. (See the "Context" section way above.)
connection_made(transport). Indicates that the transport is ready and connected to the entity at the other end. The protocol should probably save the transport reference as an instance variable (so it can call its write() and other methods later), and may write an initial greeting or request at this point.
data_received(data). The transport has read some bytes from the connection. The argument is always a non-empty bytes object. There are no guarantees about the minimum or maximum size of the data passed along this way. p.data_received(b'abcdef') should be treated exactly equivalent to:
p.data_received(b'abc') p.data_received(b'def')
eof_received(). This is called when the other end called write_eof() (or something equivalent). If this returns a false value (including None), the transport will close itself. If it returns a true value, closing the transport is up to the protocol. However, for SSL/TLS connections this is ignored, because the TLS standard requires that no more data is sent and the connection is closed as soon as a "closure alert" is received.
The default implementation returns None.
pause_writing(). Asks that the protocol temporarily stop writing data to the transport. Heeding the request is optional, but the transport's buffer may grow without bounds if you keep writing. The buffer size at which this is called can be controlled through the transport's set_write_buffer_limits() method.
resume_writing(). Tells the protocol that it is safe to start writing data to the transport again. Note that this may be called directly by the transport's write() method (as opposed to being called indirectly using call_soon()), so that the protocol may be aware of its paused state immediately after write() returns.
connection_lost(exc). The transport has been closed or aborted, has detected that the other end has closed the connection cleanly, or has encountered an unexpected error. In the first three cases the argument is None; for an unexpected error, the argument is the exception that caused the transport to give up.
Here is a table indicating the order and multiplicity of the basic calls:
- connection_made() -- exactly once
- data_received() -- zero or more times
- eof_received() -- at most once
- connection_lost() -- exactly once
Calls to pause_writing() and resume_writing() occur in pairs and only between #1 and #4. These pairs will not be nested. The final resume_writing() call may be omitted; i.e. a paused connection may be lost and never be resumed.
Datagram Protocols
Datagram protocols have connection_made() and connection_lost() methods with the same signatures as stream protocols. (As explained in the section about datagram transports, we prefer the slightly odd nomenclature over defining different method names to indicating the opening and closing of the socket.)
In addition, they have the following methods:
- datagram_received(data, addr). Indicates that a datagram data (a bytes objects) was received from remote address addr (an IPv4 2-tuple or an IPv6 4-tuple).
- error_received(exc). Indicates that a send or receive operation raised an OSError exception. Since datagram errors may be transient, it is up to the protocol to call the transport's close() method if it wants to close the endpoint.
Here is a chart indicating the order and multiplicity of calls:
- connection_made() -- exactly once
- datagram_received(), error_received() -- zero or more times
- connection_lost() -- exactly once
Subprocess Protocol
Subprocess protocols have connection_made(), connection_lost(), pause_writing() and resume_writing() methods with the same signatures as stream protocols. In addition, they have the following methods:
- pipe_data_received(fd, data). Called when the subprocess writes data to its stdout or stderr. fd is the file descriptor (1 for stdout, 2 for stderr). data is a bytes object. (TBD: No pipe_eof_received()?)
- pipe_connection_lost(fd, exc). Called when the subprocess closes its stdin, stdout or stderr. fd is the file descriptor. exc is an exception or None.
- process_exited(). Called when the subprocess has exited. To retrieve the exit status, use the transport's get_returncode() method.
Note that depending on the behavior of the subprocess it is possible that process_exited() is called either before or after pipe_connection_lost(). For example, if the subprocess creates a sub-subprocess that shares its stdin/stdout/stderr and then itself exits, process_exited() may be called while all the pipes are still open. On the other hand when the subprocess closes its stdin/stdout/stderr but does not exit, pipe_connection_lost() may be called for all three pipes without process_exited() being called. If (as is the more common case) the subprocess exits and thereby implicitly closes all pipes, the calling order is undefined.
Callback Style
Most interfaces taking a callback also take positional arguments. For instance, to arrange for foo("abc", 42) to be called soon, you call loop.call_soon(foo, "abc", 42). To schedule the call foo(), use loop.call_soon(foo). This convention greatly reduces the number of small lambdas required in typical callback programming.
This convention specifically does not support keyword arguments. Keyword arguments are used to pass optional extra information about the callback. This allows graceful evolution of the API without having to worry about whether a keyword might be significant to a callee somewhere. If you have a callback that must be called with a keyword argument, you can use a lambda. For example:
loop.call_soon(lambda: foo('abc', repeat=42))
Coroutines and the Scheduler
This is a separate toplevel section because its status is different from the event loop interface. Usage of coroutines is optional, and it is perfectly fine to write code using callbacks only. On the other hand, there is only one implementation of the scheduler/coroutine API, and if you're using coroutines, that's the one you're using.
Coroutines
A coroutine is a generator that follows certain conventions. For documentation purposes, all coroutines should be decorated with @asyncio.coroutine, but this cannot be strictly enforced.
Coroutines use the yield from syntax introduced in PEP 380, instead of the original yield syntax.
The word "coroutine", like the word "generator", is used for two different (though related) concepts:
- The function that defines a coroutine (a function definition decorated with asyncio.coroutine). If disambiguation is needed we will call this a coroutine function.
- The object obtained by calling a coroutine function. This object represents a computation or an I/O operation (usually a combination) that will complete eventually. If disambiguation is needed we will call it a coroutine object.
Things a coroutine can do:
- result = yield from future -- suspends the coroutine until the future is done, then returns the future's result, or raises an exception, which will be propagated. (If the future is cancelled, it will raise a CancelledError exception.) Note that tasks are futures, and everything said about futures also applies to tasks.
- result = yield from coroutine -- wait for another coroutine to produce a result (or raise an exception, which will be propagated). The coroutine expression must be a call to another coroutine.
- return expression -- produce a result to the coroutine that is waiting for this one using yield from.
- raise exception -- raise an exception in the coroutine that is waiting for this one using yield from.
Calling a coroutine does not start its code running -- it is just a generator, and the coroutine object returned by the call is really a generator object, which doesn't do anything until you iterate over it. In the case of a coroutine object, there are two basic ways to start it running: call yield from coroutine from another coroutine (assuming the other coroutine is already running!), or convert it to a Task (see below).
Coroutines (and tasks) can only run when the event loop is running.
Waiting for Multiple Coroutines
To wait for multiple coroutines or Futures, two APIs similar to the wait() and as_completed() APIs in the concurrent.futures package are provided:
asyncio.wait(fs, timeout=None, return_when=ALL_COMPLETED). This is a coroutine that waits for the Futures or coroutines given by fs to complete. Coroutine arguments will be wrapped in Tasks (see below). This returns a Future whose result on success is a tuple of two sets of Futures, (done, pending), where done is the set of original Futures (or wrapped coroutines) that are done (or cancelled), and pending is the rest, i.e. those that are still not done (nor cancelled). Note that with the defaults for timeout and return_when, done will always be an empty list. Optional arguments timeout and return_when have the same meaning and defaults as for concurrent.futures.wait(): timeout, if not None, specifies a timeout for the overall operation; return_when, specifies when to stop. The constants FIRST_COMPLETED, FIRST_EXCEPTION, ALL_COMPLETED are defined with the same values and the same meanings as in PEP 3148:
- ALL_COMPLETED (default): Wait until all Futures are done (or until the timeout occurs).
- FIRST_COMPLETED: Wait until at least one Future is done (or until the timeout occurs).
- FIRST_EXCEPTION: Wait until at least one Future is done but not cancelled with an exception set. (The exclusion of cancelled Futures from the condition is surprising, but PEP 3148 does it this way.)
asyncio.as_completed(fs, timeout=None). Returns an iterator whose values are Futures or coroutines; waiting for successive values waits until the next Future or coroutine from the set fs completes, and returns its result (or raises its exception). The optional argument timeout has the same meaning and default as it does for concurrent.futures.wait(): when the timeout occurs, the next Future returned by the iterator will raise TimeoutError when waited for. Example of use:
for f in as_completed(fs): result = yield from f # May raise an exception. # Use result.Note: if you do not wait for the values produced by the iterator, your for loop may not make progress (since you are not allowing other tasks to run).
asyncio.wait_for(f, timeout). This is a convenience to wait for a single coroutine or Future with a timeout. When a timeout occurs, it cancels the task and raises TimeoutError. To avoid the task cancellation, wrap it in shield().
asyncio.gather(f1, f2, ...). Returns a Future which waits until all arguments (Futures or coroutines) are done and return a list of their corresponding results. If one or more of the arguments is cancelled or raises an exception, the returned Future is cancelled or has its exception set (matching what happened to the first argument), and the remaining arguments are left running in the background. Cancelling the returned Future does not affect the arguments. Note that coroutine arguments are converted to Futures using asyncio.async().
asyncio.shield(f). Wait for a Future, shielding it from cancellation. This returns a Future whose result or exception is exactly the same as the argument; however, if the returned Future is cancelled, the argument Future is unaffected.
A use case for this function would be a coroutine that caches a query result for a coroutine that handles a request in an HTTP server. When the request is cancelled by the client, we could (arguably) want the query-caching coroutine to continue to run, so that when the client reconnects, the query result is (hopefully) cached. This could be written e.g. as follows:
@asyncio.coroutine def handle_request(self, request): ... cached_query = self.get_cache(...) if cached_query is None: cached_query = yield from asyncio.shield(self.fill_cache(...)) ...
Sleeping
The coroutine asyncio.sleep(delay) returns after a given time delay.
Tasks
A Task is an object that manages an independently running coroutine. The Task interface is the same as the Future interface, and in fact Task is a subclass of Future. The task becomes done when its coroutine returns or raises an exception; if it returns a result, that becomes the task's result, if it raises an exception, that becomes the task's exception.
Cancelling a task that's not done yet throws an asyncio.CancelledError exception into the coroutine. If the coroutine doesn't catch this (or if it re-raises it) the task will be marked as cancelled (i.e., cancelled() will return True); but if the coroutine somehow catches and ignores the exception it may continue to execute (and cancelled() will return False).
Tasks are also useful for interoperating between coroutines and callback-based frameworks like Twisted. After converting a coroutine into a Task, callbacks can be added to the Task.
To convert a coroutine into a task, call the coroutine function and pass the resulting coroutine object to the asyncio.Task() constructor. You may also use asyncio.async() for this purpose.
You may ask, why not automatically convert all coroutines to Tasks? The @asyncio.coroutine decorator could do this. However, this would slow things down considerably in the case where one coroutine calls another (and so on), as switching to a "bare" coroutine has much less overhead than switching to a Task.
The Scheduler
The scheduler has no public interface. You interact with it by using yield from future and yield from task. In fact, there is no single object representing the scheduler -- its behavior is implemented by the Task and Future classes using only the public interface of the event loop, so it will work with third-party event loop implementations, too.
Convenience Utilities
A few functions and classes are provided to simplify the writing of basic stream-based clients and servers, such as FTP or HTTP. Thes are:
asyncio.open_connection(host, port): A wrapper for EventLoop.create_connection() that does not require you to provide a Protocol factory or class. This is a coroutine that returns a (reader, writer) pair, where reader is an instance of StreamReader and writer is an instance of StreamWriter (both described below).
asyncio.start_server(client_connected_cb, host, port): A wrapper for EventLoop.create_server() that takes a simple callback function rather than a Protocol factory or class. This is a coroutine that returns a Server object just as create_server() does. Each time a client connection is accepted, client_connected_cb(reader, writer) is called, where reader is an instance of StreamReader and writer is an instance of StreamWriter (both described below). If the result returned by client_connected_cb() is a coroutine, it is automatically wrapped in a Task.
StreamReader: A class offering an interface not unlike that of a read-only binary stream, except that the various reading methods are coroutines. It is normally driven by a StreamReaderProtocol instance. Note that there should be only one reader. The interface for the reader is:
- readline(): A coroutine that reads a string of bytes representing a line of text ending in '\n', or until the end of the stream, whichever comes first.
- read(n): A coroutine that reads up to n bytes. If n is omitted or negative, it reads until the end of the stream.
- readexactly(n): A coroutine that reads exactly n bytes, or until the end of the stream, whichever comes first.
- exception(): Return the exception that has been set on the stream using set_exception(), or None if no exception is set.
The interface for the driver is:
- feed_data(data): Append data (a bytes object) to the internal buffer. This unblocks a blocked reading coroutine if it provides sufficient data to fulfill the reader's contract.
- feed_eof(): Signal the end of the buffer. This unblocks a blocked reading coroutine. No more data should be fed to the reader after this call.
- set_exception(exc): Set an exception on the stream. All subsequent reading methods will raise this exception. No more data should be fed to the reader after this call.
StreamWriter: A class offering an interface not unlike that of a write-only binary stream. It wraps a transport. The interface is an extended subset of the transport interface: the following methods behave the same as the corresponding transport methods: write(), writelines(), write_eof(), can_write_eof(), get_extra_info(), close(). Note that the writing methods are _not_ coroutines (this is the same as for transports, but different from the StreamReader class). The following method is in addition to the transport interface:
drain(): This should be called with yield from after writing significant data, for the purpose of flow control. The intended use is like this:
writer.write(data) yield from writer.drain()
Note that this is not technically a coroutine: it returns either a Future or an empty tuple (both can be passed to yield from). Use of this method is optional. However, when it is not used, the internal buffer of the transport underlying the StreamWriter may fill up with all data that was ever written to the writer. If an app does not have a strict limit on how much data it writes, it _should_ call yield from drain() occasionally to avoid filling up the transport buffer.
StreamReaderProtocol: A protocol implementation used as an adapter between the bidirectional stream transport/protocol interface and the StreamReader and StreamWriter classes. It acts as a driver for a specific StreamReader instance, calling its methods feed_data(), feed_eof(), and set_exception() in response to various protocol callbacks. It also controls the behavior of the drain() method of the StreamWriter instance.
Synchronization
Locks, events, conditions and semaphores modeled after those in the threading module are implemented and can be accessed by importing the asyncio.locks submodule. Queus modeled after those in the queue module are implemented and can be accessed by importing the asyncio.queues submodule.
In general these have a close correspondence to their threaded counterparts, however, blocking methods (e.g. acquire() on locks, put() and get() on queues) are coroutines, and timeout parameters are not provided (you can use asyncio.wait_for() to add a timeout to a blocking call, however).
The docstrings in the modules provide more complete documentation.
Locks
The following classes are provided by asyncio.locks. For all these except Event, the with statement may be used in combination with yield from to acquire the lock and ensure that the lock is released regardless of how the with block is left, as follows:
with (yield from my_lock):
...
- Lock: a basic mutex, with methods acquire() (a coroutine), locked(), and release().
- Event: an event variable, with methods wait() (a coroutine), set(), clear(), and is_set().
- Condition: a condition variable, with methods acquire(), wait(), wait_for(predicate) (all three coroutines), locked(), release(), notify(), and notify_all().
- Semaphore: a semaphore, with methods acquire() (a coroutine), locked(), and release(). The constructor argument is the initial value (default 1).
- BoundedSemaphore: a bounded semaphore; this is similar to Semaphore but the initial value is also the maximum value.
Queues
The following classes and exceptions are provided by asyncio.queues.
- Queue: a standard queue, with methods get(), put() (both coroutines), get_nowait(), put_nowait(), empty(), full(), qsize(), and maxsize().
- PriorityQueue: a subclass of Queue that retrieves entries in priority order (lowest first).
- LifoQueue: a subclass of Queue that retrieves the most recently added entries first.
- JoinableQueue: a subclass of Queue with task_done() and join() methods (the latter a coroutine).
- Empty, Full: exceptions raised when get_nowait() or put_nowait() is called on a queue that is empty or full, respectively.
Miscellaneous
Logging
All logging performed by the asyncio package uses a single logging.Logger object, asyncio.logger. To customize logging you can use the standard Logger API on this object. (Do not replace the object though.)
SIGCHLD handling on UNIX
Efficient implementation of the process_exited() method on subprocess protocols requires a SIGCHLD signal handler. However, signal handlers can only be set on the event loop associated with the main thread. In order to support spawning subprocesses from event loops running in other threads, a mechanism exists to allow sharing a SIGCHLD handler between multiple event loops. There are two additional functions, asyncio.get_child_watcher() and asyncio.set_child_watcher(), and corresponding methods on the event loop policy.
There are two child watcher implementation classes, FastChildWatcher and SafeChildWatcher. Both use SIGCHLD. The SafeChildWatcher class is used by default; it is inefficient when many subprocesses exist simultaneously. The FastChildWatcher class is efficient, but it may interfere with other code (either C code or Python code) that spawns subprocesses without using an asyncio event loop. If you are sure you are not using other code that spawns subprocesses, to use the fast implementation, run the following in your main thread:
watcher = asyncio.FastChildWatcher() asyncio.set_child_watcher(watcher)
Wish List
(There is agreement that these features are desirable, but no implementation was available when Python 3.4 beta 1 was released, and the feature freeze for the rest of the Python 3.4 release cycle prohibits adding them in this late stage. However, they will hopefully be added in Python 3.5, and perhaps earlier in the PyPI distribution.)
- Support a "start TLS" operation to upgrade a TCP socket to SSL/TLS.
Former wish list items that have since been implemented (but aren't specified by the PEP):
- UNIX domain sockets.
- A per-loop error handling callback.
Open Issues
(Note that these have been resolved de facto in favor of the status quo by the acceptance of the PEP. However, the PEP's provisional status allows revising these decisions for Python 3.5.)
- Why do create_connection() and create_datagram_endpoint() have a proto argument but not create_server()? And why are the family, flag, proto arguments for getaddrinfo() sometimes zero and sometimes named constants (whose value is also zero)?
- Do we need another inquiry method to tell whether the loop is in the process of stopping?
- A fuller public API for Handle? What's the use case?
- A debugging API? E.g. something that logs a lot of stuff, or logs unusual conditions (like queues filling up faster than they drain) or even callbacks taking too much time...
- Do we need introspection APIs? E.g. asking for the read callback given a file descriptor. Or when the next scheduled call is. Or the list of file descriptors registered with callbacks. Right now these all require using internals.
- Do we need more socket I/O methods, e.g. sock_sendto() and sock_recvfrom(), and perhaps others like pipe_read()? I guess users can write their own (it's not rocket science).
- We may need APIs to control various timeouts. E.g. we may want to limit the time spent in DNS resolution, connecting, ssl/tls handshake, idle connection, close/shutdown, even per session. Possibly it's sufficient to add timeout keyword arguments to some methods, and other timeouts can probably be implemented by clever use of call_later() and Task.cancel(). But it's possible that some operations need default timeouts, and we may want to change the default for a specific operation globally (i.e., per event loop).
References
- PEP 380 describes the semantics of yield from.
- Greg Ewing's yield from tutorials: http://www.cosc.canterbury.ac.nz/greg.ewing/python/yield-from/yield_from.html
- PEP 3148 describes concurrent.futures.Future.
- PEP 3153, while rejected, has a good write-up explaining the need to separate transports and protocols.
- PEP 418 discusses the issues of timekeeping.
- Tulip repo: http://code.google.com/p/tulip/
- PyPI: the Python Package Index at http://pypi.python.org/
- Nick Coghlan wrote a nice blog post with some background, thoughts about different approaches to async I/O, gevent, and how to use futures with constructs like while, for and with: http://python-notes.boredomandlaziness.org/en/latest/pep_ideas/async_programming.html
- TBD: references to the relevant parts of Twisted, Tornado, ZeroMQ, pyftpdlib, libevent, libev, pyev, libuv, wattle, and so on.
Acknowledgments
Apart from PEP 3153, influences include PEP 380 and Greg Ewing's tutorial for yield from, Twisted, Tornado, ZeroMQ, pyftpdlib, and wattle (Steve Dower's counter-proposal). My previous work on asynchronous support in the NDB library for Google App Engine provided an important starting point.
I am grateful for the numerous discussions on python-ideas from September through December 2012, and many more on python-tulip since then; a Skype session with Steve Dower and Dino Viehland; email exchanges with and a visit by Ben Darnell; an audience with Niels Provos (original author of libevent); and in-person meetings (as well as frequent email exchanges) with several Twisted developers, including Glyph, Brian Warner, David Reid, and Duncan McGreggor.
Contributors to the implementation include Eli Bendersky, Gustavo Carneiro (Gambit Research), Saúl Ibarra Corretgé, Geert Jansen, A. Jesse Jiryu Davis, Nikolay Kim, Charles-François Natali, Richard Oudkerk, Antoine Pitrou, Giampaolo Rodolá, Andrew Svetlov, and many others who submitted bugs and/or fixes.
I thank Antoine Pitrou for his feedback in his role of official PEP BDFL.
Copyright
This document has been placed in the public domain.
pep-3333 Python Web Server Gateway Interface v1.0.1
| PEP: | 3333 |
|---|---|
| Title: | Python Web Server Gateway Interface v1.0.1 |
| Version: | $Revision$ |
| Last-Modified: | $Date$ |
| Author: | P.J. Eby <pje at telecommunity.com> |
| Discussions-To: | Python Web-SIG <web-sig at python.org> |
| Status: | Final |
| Type: | Informational |
| Content-Type: | text/x-rst |
| Created: | 26-Sep-2010 |
| Post-History: | 26-Sep-2010, 04-Oct-2010 |
| Replaces: | 333 |
Contents
Preface for Readers of PEP 333
This is an updated version of PEP 333, modified slightly to improve usability under Python 3, and to incorporate several long-standing de-facto amendments to the WSGI protocol. (Its code samples have also been ported to Python 3.)
While for procedural reasons [6], this must be a distinct PEP, no changes were made that invalidate previously-compliant servers or applications under Python 2.x. If your 2.x application or server is compliant to PEP 333, it is also compliant with this PEP.
Under Python 3, however, your app or server must also follow the rules outlined in the sections below titled, A Note On String Types, and Unicode Issues.
For detailed, line-by-line diffs between this document and PEP 333, you may view its SVN revision history [7], from revision 84854 forward.
Abstract
This document specifies a proposed standard interface between web servers and Python web applications or frameworks, to promote web application portability across a variety of web servers.
Original Rationale and Goals (from PEP 333)
Python currently boasts a wide variety of web application frameworks, such as Zope, Quixote, Webware, SkunkWeb, PSO, and Twisted Web -- to name just a few [1]. This wide variety of choices can be a problem for new Python users, because generally speaking, their choice of web framework will limit their choice of usable web servers, and vice versa.
By contrast, although Java has just as many web application frameworks available, Java's "servlet" API makes it possible for applications written with any Java web application framework to run in any web server that supports the servlet API.
The availability and widespread use of such an API in web servers for Python -- whether those servers are written in Python (e.g. Medusa), embed Python (e.g. mod_python), or invoke Python via a gateway protocol (e.g. CGI, FastCGI, etc.) -- would separate choice of framework from choice of web server, freeing users to choose a pairing that suits them, while freeing framework and server developers to focus on their preferred area of specialization.
This PEP, therefore, proposes a simple and universal interface between web servers and web applications or frameworks: the Python Web Server Gateway Interface (WSGI).
But the mere existence of a WSGI spec does nothing to address the existing state of servers and frameworks for Python web applications. Server and framework authors and maintainers must actually implement WSGI for there to be any effect.
However, since no existing servers or frameworks support WSGI, there is little immediate reward for an author who implements WSGI support. Thus, WSGI must be easy to implement, so that an author's initial investment in the interface can be reasonably low.
Thus, simplicity of implementation on both the server and framework sides of the interface is absolutely critical to the utility of the WSGI interface, and is therefore the principal criterion for any design decisions.
Note, however, that simplicity of implementation for a framework author is not the same thing as ease of use for a web application author. WSGI presents an absolutely "no frills" interface to the framework author, because bells and whistles like response objects and cookie handling would just get in the way of existing frameworks' handling of these issues. Again, the goal of WSGI is to facilitate easy interconnection of existing servers and applications or frameworks, not to create a new web framework.
Note also that this goal precludes WSGI from requiring anything that is not already available in deployed versions of Python. Therefore, new standard library modules are not proposed or required by this specification, and nothing in WSGI requires a Python version greater than 2.2.2. (It would be a good idea, however, for future versions of Python to include support for this interface in web servers provided by the standard library.)
In addition to ease of implementation for existing and future frameworks and servers, it should also be easy to create request preprocessors, response postprocessors, and other WSGI-based "middleware" components that look like an application to their containing server, while acting as a server for their contained applications.
If middleware can be both simple and robust, and WSGI is widely available in servers and frameworks, it allows for the possibility of an entirely new kind of Python web application framework: one consisting of loosely-coupled WSGI middleware components. Indeed, existing framework authors may even choose to refactor their frameworks' existing services to be provided in this way, becoming more like libraries used with WSGI, and less like monolithic frameworks. This would then allow application developers to choose "best-of-breed" components for specific functionality, rather than having to commit to all the pros and cons of a single framework.
Of course, as of this writing, that day is doubtless quite far off. In the meantime, it is a sufficient short-term goal for WSGI to enable the use of any framework with any server.
Finally, it should be mentioned that the current version of WSGI does not prescribe any particular mechanism for "deploying" an application for use with a web server or server gateway. At the present time, this is necessarily implementation-defined by the server or gateway. After a sufficient number of servers and frameworks have implemented WSGI to provide field experience with varying deployment requirements, it may make sense to create another PEP, describing a deployment standard for WSGI servers and application frameworks.
Specification Overview
The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained.
In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions.
Throughout this specification, we will use the term "a callable" to mean "a function, method, class, or an instance with a __call__ method". It is up to the server, gateway, or application implementing the callable to choose the appropriate implementation technique for their needs. Conversely, a server, gateway, or application that is invoking a callable must not have any dependency on what kind of callable was provided to it. Callables are only to be called, not introspected upon.
A Note On String Types
In general, HTTP deals with bytes, which means that this specification is mostly about handling bytes.
However, the content of those bytes often has some kind of textual interpretation, and in Python, strings are the most convenient way to handle text.
But in many Python versions and implementations, strings are Unicode, rather than bytes. This requires a careful balance between a usable API and correct translations between bytes and text in the context of HTTP... especially to support porting code between Python implementations with different str types.
WSGI therefore defines two kinds of "string":
- "Native" strings (which are always implemented using the type named str) that are used for request/response headers and metadata
- "Bytestrings" (which are implemented using the bytes type in Python 3, and str elsewhere), that are used for the bodies of requests and responses (e.g. POST/PUT input data and HTML page outputs).
Do not be confused however: even if Python's str type is actually Unicode "under the hood", the content of native strings must still be translatable to bytes via the Latin-1 encoding! (See the section on Unicode Issues later in this document for more details.)
In short: where you see the word "string" in this document, it refers to a "native" string, i.e., an object of type str, whether it is internally implemented as bytes or unicode. Where you see references to "bytestring", this should be read as "an object of type bytes under Python 3, or type str under Python 2".
And so, even though HTTP is in some sense "really just bytes", there are many API conveniences to be had by using whatever Python's default str type is.
The Application/Framework Side
The application object is simply a callable object that accepts two arguments. The term "object" should not be misconstrued as requiring an actual object instance: a function, method, class, or instance with a __call__ method are all acceptable for use as an application object. Application objects must be able to be invoked more than once, as virtually all servers/gateways (other than CGI) will make such repeated requests.
(Note: although we refer to it as an "application" object, this should not be construed to mean that application developers will use WSGI as a web programming API! It is assumed that application developers will continue to use existing, high-level framework services to develop their applications. WSGI is a tool for framework and server developers, and is not intended to directly support application developers.)
Here are two example application objects; one is a function, and the other is a class:
HELLO_WORLD = b"Hello world!\n"
def simple_app(environ, start_response):
"""Simplest possible application object"""
status = '200 OK'
response_headers = [('Content-type', 'text/plain')]
start_response(status, response_headers)
return [HELLO_WORLD]
class AppClass:
"""Produce the same output, but using a class
(Note: 'AppClass' is the "application" here, so calling it
returns an instance of 'AppClass', which is then the iterable
return value of the "application callable" as required by
the spec.
If we wanted to use *instances* of 'AppClass' as application
objects instead, we would have to implement a '__call__'
method, which would be invoked to execute the application,
and we would need to create an instance for use by the
server or gateway.
"""
def __init__(self, environ, start_response):
self.environ = environ
self.start = start_response
def __iter__(self):
status = '200 OK'
response_headers = [('Content-type', 'text/plain')]
self.start(status, response_headers)
yield HELLO_WORLD
The Server/Gateway Side
The server or gateway invokes the application callable once for each request it receives from an HTTP client, that is directed at the application. To illustrate, here is a simple CGI gateway, implemented as a function taking an application object. Note that this simple example has limited error handling, because by default an uncaught exception will be dumped to sys.stderr and logged by the web server.
import os, sys
enc, esc = sys.getfilesystemencoding(), 'surrogateescape'
def unicode_to_wsgi(u):
# Convert an environment variable to a WSGI "bytes-as-unicode" string
return u.encode(enc, esc).decode('iso-8859-1')
def wsgi_to_bytes(s):
return s.encode('iso-8859-1')
def run_with_cgi(application):
environ = {k: unicode_to_wsgi(v) for k,v in os.environ.items()}
environ['wsgi.input'] = sys.stdin.buffer
environ['wsgi.errors'] = sys.stderr
environ['wsgi.version'] = (1, 0)
environ['wsgi.multithread'] = False
environ['wsgi.multiprocess'] = True
environ['wsgi.run_once'] = True
if environ.get('HTTPS', 'off') in ('on', '1'):
environ['wsgi.url_scheme'] = 'https'
else:
environ['wsgi.url_scheme'] = 'http'
headers_set = []
headers_sent = []
def write(data):
out = sys.stdout.buffer
if not headers_set:
raise AssertionError("write() before start_response()")
elif not headers_sent:
# Before the first output, send the stored headers
status, response_headers = headers_sent[:] = headers_set
out.write(wsgi_to_bytes('Status: %s\r\n' % status))
for header in response_headers:
out.write(wsgi_to_bytes('%s: %s\r\n' % header))
out.write(wsgi_to_bytes('\r\n'))
out.write(data)
out.flush()
def start_response(status, response_headers, exc_info=None):
if exc_info:
try:
if headers_sent:
# Re-raise original exception if headers sent
raise exc_info[1].with_traceback(exc_info[2])
finally:
exc_info = None # avoid dangling circular ref
elif headers_set:
raise AssertionError("Headers already set!")
headers_set[:] = [status, response_headers]
# Note: error checking on the headers should happen here,
# *after* the headers are set. That way, if an error
# occurs, start_response can only be re-called with
# exc_info set.
return write
result = application(environ, start_response)
try:
for data in result:
if data: # don't send headers until body appears
write(data)
if not headers_sent:
write('') # send headers now if body was empty
finally:
if hasattr(result, 'close'):
result.close()
Middleware: Components that Play Both Sides
Note that a single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such "middleware" components can perform such functions as:
- Routing a request to different application objects based on the target URL, after rewriting the environ accordingly.
- Allowing multiple applications or frameworks to run side-by-side in the same process
- Load balancing and remote processing, by forwarding requests and responses over a network
- Perform content postprocessing, such as applying XSL stylesheets
The presence of middleware in general is transparent to both the "server/gateway" and the "application/framework" sides of the interface, and should require no special support. A user who desires to incorporate middleware into an application simply provides the middleware component to the server, as if it were an application, and configures the middleware component to invoke the application, as if the middleware component were a server. Of course, the "application" that the middleware wraps may in fact be another middleware component wrapping another application, and so on, creating what is referred to as a "middleware stack".
For the most part, middleware must conform to the restrictions and requirements of both the server and application sides of WSGI. In some cases, however, requirements for middleware are more stringent than for a "pure" server or application, and these points will be noted in the specification.
Here is a (tongue-in-cheek) example of a middleware component that converts text/plain responses to pig latin, using Joe Strout's piglatin.py. (Note: a "real" middleware component would probably use a more robust way of checking the content type, and should also check for a content encoding. Also, this simple example ignores the possibility that a word might be split across a block boundary.)
from piglatin import piglatin
class LatinIter:
"""Transform iterated output to piglatin, if it's okay to do so
Note that the "okayness" can change until the application yields
its first non-empty bytestring, so 'transform_ok' has to be a mutable
truth value.
"""
def __init__(self, result, transform_ok):
if hasattr(result, 'close'):
self.close = result.close
self._next = iter(result).__next__
self.transform_ok = transform_ok
def __iter__(self):
return self
def __next__(self):
if self.transform_ok:
return piglatin(self._next()) # call must be byte-safe on Py3
else:
return self._next()
class Latinator:
# by default, don't transform output
transform = False
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
transform_ok = []
def start_latin(status, response_headers, exc_info=None):
# Reset ok flag, in case this is a repeat call
del transform_ok[:]
for name, value in response_headers:
if name.lower() == 'content-type' and value == 'text/plain':
transform_ok.append(True)
# Strip content-length if present, else it'll be wrong
response_headers = [(name, value)
for name, value in response_headers
if name.lower() != 'content-length'
]
break
write = start_response(status, response_headers, exc_info)
if transform_ok:
def write_latin(data):
write(piglatin(data)) # call must be byte-safe on Py3
return write_latin
else:
return write
return LatinIter(self.application(environ, start_latin), transform_ok)
# Run foo_app under a Latinator's control, using the example CGI gateway
from foo_app import foo_app
run_with_cgi(Latinator(foo_app))
Specification Details
The application object must accept two positional arguments. For the sake of illustration, we have named them environ and start_response, but they are not required to have these names. A server or gateway must invoke the application object using positional (not keyword) arguments. (E.g. by calling result = application(environ, start_response) as shown above.)
The environ parameter is a dictionary object, containing CGI-style environment variables. This object must be a builtin Python dictionary (not a subclass, UserDict or other dictionary emulation), and the application is allowed to modify the dictionary in any way it desires. The dictionary must also include certain WSGI-required variables (described in a later section), and may also include server-specific extension variables, named according to a convention that will be described below.
The start_response parameter is a callable accepting two required positional arguments, and one optional argument. For the sake of illustration, we have named these arguments status, response_headers, and exc_info, but they are not required to have these names, and the application must invoke the start_response callable using positional arguments (e.g. start_response(status, response_headers)).
The status parameter is a status string of the form "999 Message here", and response_headers is a list of (header_name, header_value) tuples describing the HTTP response header. The optional exc_info parameter is described below in the sections on The start_response() Callable and Error Handling. It is used only when the application has trapped an error and is attempting to display an error message to the browser.
The start_response callable must return a write(body_data) callable that takes one positional parameter: a bytestring to be written as part of the HTTP response body. (Note: the write() callable is provided only to support certain existing frameworks' imperative output APIs; it should not be used by new applications or frameworks if it can be avoided. See the Buffering and Streaming section for more details.)
When called by the server, the application object must return an iterable yielding zero or more bytestrings. This can be accomplished in a variety of ways, such as by returning a list of bytestrings, or by the application being a generator function that yields bytestrings, or by the application being a class whose instances are iterable. Regardless of how it is accomplished, the application object must always return an iterable yielding zero or more bytestrings.
The server or gateway must transmit the yielded bytestrings to the client in an unbuffered fashion, completing the transmission of each bytestring before requesting another one. (In other words, applications should perform their own buffering. See the Buffering and Streaming section below for more on how application output must be handled.)
The server or gateway should treat the yielded bytestrings as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the bytestring(s) to be written are in a format suitable for the client. (The server or gateway may apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See Other HTTP Features, below, for more details.)
If a call to len(iterable) succeeds, the server must be able to rely on the result being accurate. That is, if the iterable returned by the application provides a working __len__() method, it must return an accurate result. (See the Handling the Content-Length Header section for information on how this would normally be used.)
If the iterable returned by the application has a close() method, the server or gateway must call that method upon completion of the current request, whether the request was completed normally, or terminated early due to an application error during iteration or an early disconnect of the browser. (The close() method requirement is to support resource release by the application. This protocol is intended to complement PEP 342's generator support, and other common iterables with close() methods.)
Applications returning a generator or other custom iterator should not assume the entire iterator will be consumed, as it may be closed early by the server.
(Note: the application must invoke the start_response() callable before the iterable yields its first body bytestring, so that the server can send the headers before any body content. However, this invocation may be performed by the iterable's first iteration, so servers must not assume that start_response() has been called before they begin iterating over the iterable.)
Finally, servers and gateways must not directly use any other attributes of the iterable returned by the application, unless it is an instance of a type specific to that server or gateway, such as a "file wrapper" returned by wsgi.file_wrapper (see Optional Platform-Specific File Handling). In the general case, only attributes specified here, or accessed via e.g. the PEP 234 iteration APIs are acceptable.
environ Variables
The environ dictionary is required to contain these CGI environment variables, as defined by the Common Gateway Interface specification [2]. The following variables must be present, unless their value would be an empty string, in which case they may be omitted, except as otherwise noted below.
- REQUEST_METHOD
- The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required.
- SCRIPT_NAME
- The initial portion of the request URL's "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server.
- PATH_INFO
- The remainder of the request URL's "path", designating the virtual "location" of the request's target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash.
- QUERY_STRING
- The portion of the request URL that follows the "?", if any. May be empty or absent.
- CONTENT_TYPE
- The contents of any Content-Type fields in the HTTP request. May be empty or absent.
- CONTENT_LENGTH
- The contents of any Content-Length fields in the HTTP request. May be empty or absent.
- SERVER_NAME, SERVER_PORT
- When combined with SCRIPT_NAME and PATH_INFO, these two strings can be used to complete the URL. Note, however, that HTTP_HOST, if present, should be used in preference to SERVER_NAME for reconstructing the request URL. See the URL Reconstruction section below for more detail. SERVER_NAME and SERVER_PORT can never be empty strings, and so are always required.
- SERVER_PROTOCOL
- The version of the protocol the client used to send the request. Typically this will be something like "HTTP/1.0" or "HTTP/1.1" and may be used by the application to determine how to treat any HTTP request headers. (This variable should probably be called REQUEST_PROTOCOL, since it denotes the protocol used in the request, and is not necessarily the protocol that will be used in the server's response. However, for compatibility with CGI we have to keep the existing name.)
- HTTP_ Variables
- Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin with "HTTP_"). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request.
A server or gateway should attempt to provide as many other CGI variables as are applicable. In addition, if SSL is in use, the server or gateway should also provide as many of the Apache SSL environment variables [5] as are applicable, such as HTTPS=on and SSL_PROTOCOL. Note, however, that an application that uses any CGI variables other than the ones listed above are necessarily non-portable to web servers that do not support the relevant extensions. (For example, web servers that do not publish files will not be able to provide a meaningful DOCUMENT_ROOT or PATH_TRANSLATED.)
A WSGI-compliant server or gateway should document what variables it provides, along with their definitions as appropriate. Applications should check for the presence of any variables they require, and have a fallback plan in the event such a variable is absent.
Note: missing variables (such as REMOTE_USER when no authentication has occurred) should be left out of the environ dictionary. Also note that CGI-defined variables must be native strings, if they are present at all. It is a violation of this specification for any CGI variable's value to be of any type other than str.
In addition to the CGI-defined variables, the environ dictionary may also contain arbitrary operating-system "environment variables", and must contain the following WSGI-defined variables:
| Variable | Value |
|---|---|
| wsgi.version | The tuple (1, 0), representing WSGI version 1.0. |
| wsgi.url_scheme | A string representing the "scheme" portion of the URL at which the application is being invoked. Normally, this will have the value "http" or "https", as appropriate. |
| wsgi.input | An input stream (file-like object) from which the HTTP request body bytes can be read. (The server or gateway may perform reads on-demand as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, according to its preference.) |
| wsgi.errors | An output stream (file-like object) to which error output can be written, for the purpose of recording program or other errors in a standardized and possibly centralized location. This should be a "text mode" stream; i.e., applications should use "\n" as a line ending, and assume that it will be converted to the correct line ending by the server/gateway. (On platforms where the str type is unicode, the error stream should accept and log arbitary unicode without raising an error; it is allowed, however, to substitute characters that cannot be rendered in the stream's encoding.) For many servers, wsgi.errors will be the server's main error log. Alternatively, this may be sys.stderr, or a log file of some sort. The server's documentation should include an explanation of how to configure this or where to find the recorded output. A server or gateway may supply different error streams to different applications, if this is desired. |
| wsgi.multithread | This value should evaluate true if the application object may be simultaneously invoked by another thread in the same process, and should evaluate false otherwise. |
| wsgi.multiprocess | This value should evaluate true if an equivalent application object may be simultaneously invoked by another process, and should evaluate false otherwise. |
| wsgi.run_once | This value should evaluate true if the server or gateway expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a gateway based on CGI (or something similar). |
Finally, the environ dictionary may also contain server-defined variables. These variables should be named using only lower-case letters, numbers, dots, and underscores, and should be prefixed with a name that is unique to the defining server or gateway. For example, mod_python might define variables with names like mod_python.some_variable.
Input and Error Streams
The input and error streams provided by the server must support the following methods:
| Method | Stream | Notes |
|---|---|---|
| read(size) | input | 1 |
| readline() | input | 1, 2 |
| readlines(hint) | input | 1, 3 |
| __iter__() | input | |
| flush() | errors | 4 |
| write(str) | errors | |
| writelines(seq) | errors |
The semantics of each method are as documented in the Python Library Reference, except for these notes as listed in the table above:
The server is not required to read past the client's specified Content-Length, and should simulate an end-of-file condition if the application attempts to read past that point. The application should not attempt to read more data than is specified by the CONTENT_LENGTH variable.
A server should allow read() to be called without an argument, and return the remainder of the client's input stream.
A server should return empty bytestrings from any attempt to read from an empty or exhausted input stream.
Servers should support the optional "size" argument to readline(), but as in WSGI 1.0, they are allowed to omit support for it.
(In WSGI 1.0, the size argument was not supported, on the grounds that it might have been complex to implement, and was not often used in practice... but then the cgi module started using it, and so practical servers had to start supporting it anyway!)
Note that the hint argument to readlines() is optional for both caller and implementer. The application is free not to supply it, and the server or gateway is free to ignore it.
Since the errors stream may not be rewound, servers and gateways are free to forward write operations immediately, without buffering. In this case, the flush() method may be a no-op. Portable applications, however, cannot assume that output is unbuffered or that flush() is a no-op. They must call flush() if they need to ensure that output has in fact been written. (For example, to minimize intermingling of data from multiple processes writing to the same error log.)
The methods listed in the table above must be supported by all servers conforming to this specification. Applications conforming to this specification must not use any other methods or attributes of the input or errors objects. In particular, applications must not attempt to close these streams, even if they possess close() methods.
The start_response() Callable
The second parameter passed to the application object is a callable of the form start_response(status, response_headers, exc_info=None). (As with all WSGI callables, the arguments must be supplied positionally, not by keyword.) The start_response callable is used to begin the HTTP response, and it must return a write(body_data) callable (see the Buffering and Streaming section, below).
The status argument is an HTTP "status" string like "200 OK" or "404 Not Found". That is, it is a string consisting of a Status-Code and a Reason-Phrase, in that order and separated by a single space, with no surrounding whitespace or other characters. (See RFC 2616, Section 6.1.1 for more information.) The string must not contain control characters, and must not be terminated with a carriage return, linefeed, or combination thereof.
The response_headers argument is a list of (header_name, header_value) tuples. It must be a Python list; i.e. type(response_headers) is ListType, and the server may change its contents in any way it desires. Each header_name must be a valid HTTP header field-name (as defined by RFC 2616, Section 4.2), without a trailing colon or other punctuation.
Each header_value must not include any control characters, including carriage returns or linefeeds, either embedded or at the end. (These requirements are to minimize the complexity of any parsing that must be performed by servers, gateways, and intermediate response processors that need to inspect or modify response headers.)
In general, the server or gateway is responsible for ensuring that correct headers are sent to the client: if the application omits a header required by HTTP (or other relevant specifications that are in effect), the server or gateway must add it. For example, the HTTP Date: and Server: headers would normally be supplied by the server or gateway.
(A reminder for server/gateway authors: HTTP header names are case-insensitive, so be sure to take that into consideration when examining application-supplied headers!)
Applications and middleware are forbidden from using HTTP/1.1 "hop-by-hop" features or headers, any equivalent features in HTTP/1.0, or any headers that would affect the persistence of the client's connection to the web server. These features are the exclusive province of the actual web server, and a server or gateway should consider it a fatal error for an application to attempt sending them, and raise an error if they are supplied to start_response(). (For more specifics on "hop-by-hop" features and headers, please see the Other HTTP Features section below.)
Servers should check for errors in the headers at the time start_response is called, so that an error can be raised while the application is still running.
However, the start_response callable must not actually transmit the response headers. Instead, it must store them for the server or gateway to transmit only after the first iteration of the application return value that yields a non-empty bytestring, or upon the application's first invocation of the write() callable. In other words, response headers must not be sent until there is actual body data available, or until the application's returned iterable is exhausted. (The only possible exception to this rule is if the response headers explicitly include a Content-Length of zero.)
This delaying of response header transmission is to ensure that buffered and asynchronous applications can replace their originally intended output with error output, up until the last possible moment. For example, the application may need to change the response status from "200 OK" to "500 Internal Error", if an error occurs while the body is being generated within an application buffer.
The exc_info argument, if supplied, must be a Python sys.exc_info() tuple. This argument should be supplied by the application only if start_response is being called by an error handler. If exc_info is supplied, and no HTTP headers have been output yet, start_response should replace the currently-stored HTTP response headers with the newly-supplied ones, thus allowing the application to "change its mind" about the output when an error has occurred.
However, if exc_info is provided, and the HTTP headers have already been sent, start_response must raise an error, and should re-raise using the exc_info tuple. That is:
raise exc_info[1].with_traceback(exc_info[2])
This will re-raise the exception trapped by the application, and in principle should abort the application. (It is not safe for the application to attempt error output to the browser once the HTTP headers have already been sent.) The application must not trap any exceptions raised by start_response, if it called start_response with exc_info. Instead, it should allow such exceptions to propagate back to the server or gateway. See Error Handling below, for more details.
The application may call start_response more than once, if and only if the exc_info argument is provided. More precisely, it is a fatal error to call start_response without the exc_info argument if start_response has already been called within the current invocation of the application. This includes the case where the first call to start_response raised an error. (See the example CGI gateway above for an illustration of the correct logic.)
Note: servers, gateways, or middleware implementing start_response should ensure that no reference is held to the exc_info parameter beyond the duration of the function's execution, to avoid creating a circular reference through the traceback and frames involved. The simplest way to do this is something like:
def start_response(status, response_headers, exc_info=None):
if exc_info:
try:
# do stuff w/exc_info here
finally:
exc_info = None # Avoid circular ref.
The example CGI gateway provides another illustration of this technique.
Handling the Content-Length Header
If the application supplies a Content-Length header, the server should not transmit more bytes to the client than the header allows, and should stop iterating over the response when enough data has been sent, or raise an error if the application tries to write() past that point. (Of course, if the application does not provide enough data to meet its stated Content-Length, the server should close the connection and log or otherwise report the error.)
If the application does not supply a Content-Length header, a server or gateway may choose one of several approaches to handling it. The simplest of these is to close the client connection when the response is completed.
Under some circumstances, however, the server or gateway may be able to either generate a Content-Length header, or at least avoid the need to close the client connection. If the application does not call the write() callable, and returns an iterable whose len() is 1, then the server can automatically determine Content-Length by taking the length of the first bytestring yielded by the iterable.
And, if the server and client both support HTTP/1.1 "chunked encoding" [3], then the server may use chunked encoding to send a chunk for each write() call or bytestring yielded by the iterable, thus generating a Content-Length header for each chunk. This allows the server to keep the client connection alive, if it wishes to do so. Note that the server must comply fully with RFC 2616 when doing this, or else fall back to one of the other strategies for dealing with the absence of Content-Length.
(Note: applications and middleware must not apply any kind of Transfer-Encoding to their output, such as chunking or gzipping; as "hop-by-hop" operations, these encodings are the province of the actual web server/gateway. See Other HTTP Features below, for more details.)
Buffering and Streaming
Generally speaking, applications will achieve the best throughput by buffering their (modestly-sized) output and sending it all at once. This is a common approach in existing frameworks such as Zope: the output is buffered in a StringIO or similar object, then transmitted all at once, along with the response headers.
The corresponding approach in WSGI is for the application to simply return a single-element iterable (such as a list) containing the response body as a single bytestring. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory.
For large files, however, or for specialized uses of HTTP streaming (such as multipart "server push"), an application may need to provide output in smaller blocks (e.g. to avoid loading a large file into memory). It's also sometimes the case that part of a response may be time-consuming to produce, but it would be useful to send ahead the portion of the response that precedes it.
In these cases, applications will usually return an iterator (often a generator-iterator) that produces the output in a block-by-block fashion. These blocks may be broken to coincide with mulitpart boundaries (for "server push"), or just before time-consuming tasks (such as reading another block of an on-disk file).
WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:
- Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
- Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
- (Middleware only) send the entire block to its parent gateway/server
By providing this guarantee, WSGI allows applications to ensure that transmission will not become stalled at an arbitrary point in their output data. This is critical for proper functioning of e.g. multipart "server push" streaming, where data between multipart boundaries should be transmitted in full to the client.
Middleware Handling of Block Boundaries
In order to better support asynchronous applications and servers, middleware components must not block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can produce any output, it must yield an empty bytestring.
To put this requirement another way, a middleware component must yield at least one value each time its underlying application yields a value. If the middleware cannot yield any other value, it must yield an empty bytestring.
This requirement ensures that asynchronous applications and servers can conspire to reduce the number of threads that are required to run a given number of application instances simultaneously.
Note also that this requirement means that middleware must return an iterable as soon as its underlying application returns an iterable. It is also forbidden for middleware to use the write() callable to transmit data that is yielded by an underlying application. Middleware may only use their parent server's write() callable to transmit data that the underlying application sent using a middleware-provided write() callable.
The write() Callable
Some existing application framework APIs support unbuffered output in a different manner than WSGI. Specifically, they provide a "write" function or method of some kind to write an unbuffered block of data, or else they provide a buffered "write" function and a "flush" mechanism to flush the buffer.
Unfortunately, such APIs cannot be implemented in terms of WSGI's "iterable" application return value, unless threads or other special mechanisms are used.
Therefore, to allow these frameworks to continue using an imperative API, WSGI includes a special write() callable, returned by the start_response callable.
New WSGI applications and frameworks should not use the write() callable if it is possible to avoid doing so. The write() callable is strictly a hack to support imperative streaming APIs. In general, applications should produce their output via their returned iterable, as this makes it possible for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole.
The write() callable is returned by the start_response() callable, and it accepts a single parameter: a bytestring to be written as part of the HTTP response body, that is treated exactly as though it had been yielded by the output iterable. In other words, before write() returns, it must guarantee that the passed-in bytestring was either completely sent to the client, or that it is buffered for transmission while the application proceeds onward.
An application must return an iterable object, even if it uses write() to produce all or part of its response body. The returned iterable may be empty (i.e. yield no non-empty bytestrings), but if it does yield non-empty bytestrings, that output must be treated normally by the server or gateway (i.e., it must be sent or queued immediately). Applications must not invoke write() from within their return iterable, and therefore any bytestrings yielded by the iterable are transmitted after all bytestrings passed to write() have been sent to the client.
Unicode Issues
HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all strings passed to or from the server must be of type str or bytes, never unicode. The result of using a unicode object where a string object is required, is undefined.
Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.
On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.
Again, all objects referred to in this specification as "strings" must be of type str or StringType, and must not be of type unicode or UnicodeType. And, even if a given platform allows for more than 8 bits per character in str/StringType objects, only the lower 8 bits may be used, for any value referred to in this specification as a "string".
For values referred to in this specification as "bytestrings" (i.e., values read from wsgi.input, passed to write() or yielded by the application), the value must be of type bytes under Python 3, and str in earlier versions of Python.
Error Handling
In general, applications should try to trap their own, internal errors, and display a helpful message in the browser. (It is up to the application to decide what "helpful" means in this context.)
However, to display such a message, the application must not have actually sent any data to the browser yet, or else it risks corrupting the response. WSGI therefore provides a mechanism to either allow the application to send its error message, or be automatically aborted: the exc_info argument to start_response. Here is an example of its use:
try:
# regular application code here
status = "200 Froody"
response_headers = [("content-type", "text/plain")]
start_response(status, response_headers)
return ["normal body goes here"]
except:
# XXX should trap runtime issues like MemoryError, KeyboardInterrupt
# in a separate handler before this bare 'except:'...
status = "500 Oops"
response_headers = [("content-type", "text/plain")]
start_response(status, response_headers, sys.exc_info())
return ["error body goes here"]
If no output has been written when an exception occurs, the call to start_response will return normally, and the application will return an error body to be sent to the browser. However, if any output has already been sent to the browser, start_response will reraise the provided exception. This exception should not be trapped by the application, and so the application will abort. The server or gateway can then trap this (fatal) exception and abort the response.
Servers should trap and log any exception that aborts an application or the iteration of its return value. If a partial response has already been written to the browser when an application error occurs, the server or gateway may attempt to add an error message to the output, if the already-sent headers indicate a text/* content type that the server knows how to modify cleanly.
Some middleware may wish to provide additional exception handling services, or intercept and replace application error messages. In such cases, middleware may choose to not re-raise the exc_info supplied to start_response, but instead raise a middleware-specific exception, or simply return without an exception after storing the supplied arguments. This will then cause the application to return its error body iterable (or invoke write()), allowing the middleware to capture and modify the error output. These techniques will work as long as application authors:
- Always provide exc_info when beginning an error response
- Never trap errors raised by start_response when exc_info is being provided
HTTP 1.1 Expect/Continue
Servers and gateways that implement HTTP 1.1 must provide transparent support for HTTP 1.1's "expect/continue" mechanism. This may be done in any of several ways:
- Respond to requests containing an Expect: 100-continue request with an immediate "100 Continue" response, and proceed normally.
- Proceed with the request normally, but provide the application with a wsgi.input stream that will send the "100 Continue" response if/when the application first attempts to read from the input stream. The read request must then remain blocked until the client responds.
- Wait until the client decides that the server does not support expect/continue, and sends the request body on its own. (This is suboptimal, and is not recommended.)
Note that these behavior restrictions do not apply for HTTP 1.0 requests, or for requests that are not directed to an application object. For more information on HTTP 1.1 Expect/Continue, see RFC 2616, sections 8.2.3 and 10.1.1.
Other HTTP Features
In general, servers and gateways should "play dumb" and allow the application complete control over its output. They should only make changes that do not alter the effective semantics of the application's response. It is always possible for the application developer to add middleware components to supply additional features, so server/gateway developers should be conservative in their implementation. In a sense, a server should consider itself to be like an HTTP "gateway server", with the application being an HTTP "origin server". (See RFC 2616, section 1.3, for the definition of these terms.)
However, because WSGI servers and applications do not communicate via HTTP, what RFC 2616 calls "hop-by-hop" headers do not apply to WSGI internal communications. WSGI applications must not generate any "hop-by-hop" headers [4], attempt to use HTTP features that would require them to generate such headers, or rely on the content of any incoming "hop-by-hop" headers in the environ dictionary. WSGI servers must handle any supported inbound "hop-by-hop" headers on their own, such as by decoding any inbound Transfer-Encoding, including chunked encoding if applicable.
Applying these principles to a variety of HTTP features, it should be clear that a server may handle cache validation via the If-None-Match and If-Modified-Since request headers and the Last-Modified and ETag response headers. However, it is not required to do this, and the application should perform its own cache validation if it wants to support that feature, since the server/gateway is not required to do such validation.
Similarly, a server may re-encode or transport-encode an application's response, but the application should use a suitable content encoding on its own, and must not apply a transport encoding. A server may transmit byte ranges of the application's response if requested by the client, and the application doesn't natively support byte ranges. Again, however, the application should perform this function on its own if desired.
Note that these restrictions on applications do not necessarily mean that every application must reimplement every HTTP feature; many HTTP features can be partially or fully implemented by middleware components, thus freeing both server and application authors from implementing the same features over and over again.
Thread Support
Thread support, or lack thereof, is also server-dependent. Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server.
Implementation/Application Notes
Server Extension APIs
Some server authors may wish to expose more advanced APIs, that application or framework authors can use for specialized purposes. For example, a gateway based on mod_python might wish to expose part of the Apache API as a WSGI extension.
In the simplest case, this requires nothing more than defining an environ variable, such as mod_python.some_api. But, in many cases, the possible presence of middleware can make this difficult. For example, an API that offers access to the same HTTP headers that are found in environ variables, might return different data if environ has been modified by middleware.
In general, any extension API that duplicates, supplants, or bypasses some portion of WSGI functionality runs the risk of being incompatible with middleware components. Server/gateway developers should not assume that nobody will use middleware, because some framework developers specifically intend to organize or reorganize their frameworks to function almost entirely as middleware of various kinds.
So, to provide maximum compatibility, servers and gateways that provide extension APIs that replace some WSGI functionality, must design those APIs so that they are invoked using the portion of the API that they replace. For example, an extension API to access HTTP request headers must require the application to pass in its current environ, so that the server/gateway may verify that HTTP headers accessible via the API have not been altered by middleware. If the extension API cannot guarantee that it will always agree with environ about the contents of HTTP headers, it must refuse service to the application, e.g. by raising an error, returning None instead of a header collection, or whatever is appropriate to the API.
Similarly, if an extension API provides an alternate means of writing response data or headers, it should require the start_response callable to be passed in, before the application can obtain the extended service. If the object passed in is not the same one that the server/gateway originally supplied to the application, it cannot guarantee correct operation and must refuse to provide the extended service to the application.
These guidelines also apply to middleware that adds information such as parsed cookies, form variables, sessions, and the like to environ. Specifically, such middleware should provide these features as functions which operate on environ, rather than simply stuffing values into environ. This helps ensure that information is calculated from environ after any middleware has done any URL rewrites or other environ modifications.
It is very important that these "safe extension" rules be followed by both server/gateway and middleware developers, in order to avoid a future in which middleware developers are forced to delete any and all extension APIs from environ to ensure that their mediation isn't being bypassed by applications using those extensions!
Application Configuration
This specification does not define how a server selects or obtains an application to invoke. These and other configuration options are highly server-specific matters. It is expected that server/gateway authors will document how to configure the server to execute a particular application object, and with what options (such as threading options).
Framework authors, on the other hand, should document how to create an application object that wraps their framework's functionality. The user, who has chosen both the server and the application framework, must connect the two together. However, since both the framework and the server now have a common interface, this should be merely a mechanical matter, rather than a significant engineering effort for each new server/framework pair.
Finally, some applications, frameworks, and middleware may wish to use the environ dictionary to receive simple string configuration options. Servers and gateways should support this by allowing an application's deployer to specify name-value pairs to be placed in environ. In the simplest case, this support can consist merely of copying all operating system-supplied environment variables from os.environ into the environ dictionary, since the deployer in principle can configure these externally to the server, or in the CGI case they may be able to be set via the server's configuration files.
Applications should try to keep such required variables to a minimum, since not all servers will support easy configuration of them. Of course, even in the worst case, persons deploying an application can create a script to supply the necessary configuration values:
from the_app import application
def new_app(environ, start_response):
environ['the_app.configval1'] = 'something'
return application(environ, start_response)
But, most existing applications and frameworks will probably only need a single configuration value from environ, to indicate the location of their application or framework-specific configuration file(s). (Of course, applications should cache such configuration, to avoid having to re-read it upon each invocation.)
URL Reconstruction
If an application wishes to reconstruct a request's complete URL, it may do so using the following algorithm, contributed by Ian Bicking:
from urllib import quote
url = environ['wsgi.url_scheme']+'://'
if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']
if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']
url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
if environ.get('QUERY_STRING'):
url += '?' + environ['QUERY_STRING']
Note that such a reconstructed URL may not be precisely the same URI as requested by the client. Server rewrite rules, for example, may have modified the client's originally requested URL to place it in a canonical form.
Supporting Older (<2.2) Versions of Python
Some servers, gateways, or applications may wish to support older (<2.2) versions of Python. This is especially important if Jython is a target platform, since as of this writing a production-ready version of Jython 2.2 is not yet available.
For servers and gateways, this is relatively straightforward: servers and gateways targeting pre-2.2 versions of Python must simply restrict themselves to using only a standard "for" loop to iterate over any iterable returned by an application. This is the only way to ensure source-level compatibility with both the pre-2.2 iterator protocol (discussed further below) and "today's" iterator protocol (see PEP 234).
(Note that this technique necessarily applies only to servers, gateways, or middleware that are written in Python. Discussion of how to use iterator protocol(s) correctly from other languages is outside the scope of this PEP.)
For applications, supporting pre-2.2 versions of Python is slightly more complex:
- You may not return a file object and expect it to work as an iterable, since before Python 2.2, files were not iterable. (In general, you shouldn't do this anyway, because it will perform quite poorly most of the time!) Use wsgi.file_wrapper or an application-specific file wrapper class. (See Optional Platform-Specific File Handling for more on wsgi.file_wrapper, and an example class you can use to wrap a file as an iterable.)
- If you return a custom iterable, it must implement the pre-2.2 iterator protocol. That is, provide a __getitem__ method that accepts an integer key, and raises IndexError when exhausted. (Note that built-in sequence types are also acceptable, since they also implement this protocol.)
Finally, middleware that wishes to support pre-2.2 versions of Python, and iterates over application return values or itself returns an iterable (or both), must follow the appropriate recommendations above.
(Note: It should go without saying that to support pre-2.2 versions of Python, any server, gateway, application, or middleware must also use only language features available in the target version, use 1 and 0 instead of True and False, etc.)
Optional Platform-Specific File Handling
Some operating environments provide special high-performance file- transmission facilities, such as the Unix sendfile() call. Servers and gateways may expose this functionality via an optional wsgi.file_wrapper key in the environ. An application may use this "file wrapper" to convert a file or file-like object into an iterable that it then returns, e.g.:
if 'wsgi.file_wrapper' in environ:
return environ['wsgi.file_wrapper'](filelike, block_size)
else:
return iter(lambda: filelike.read(block_size), '')
If the server or gateway supplies wsgi.file_wrapper, it must be a callable that accepts one required positional parameter, and one optional positional parameter. The first parameter is the file-like object to be sent, and the second parameter is an optional block size "suggestion" (which the server/gateway need not use). The callable must return an iterable object, and must not perform any data transmission until and unless the server/gateway actually receives the iterable as a return value from the application. (To do otherwise would prevent middleware from being able to interpret or override the response data.)
To be considered "file-like", the object supplied by the application must have a read() method that takes an optional size argument. It may have a close() method, and if so, the iterable returned by wsgi.file_wrapper must have a close() method that invokes the original file-like object's close() method. If the "file-like" object has any other methods or attributes with names matching those of Python built-in file objects (e.g. fileno()), the wsgi.file_wrapper may assume that these methods or attributes have the same semantics as those of a built-in file object.
The actual implementation of any platform-specific file handling must occur after the application returns, and the server or gateway checks to see if a wrapper object was returned. (Again, because of the presence of middleware, error handlers, and the like, it is not guaranteed that any wrapper created will actually be used.)
Apart from the handling of close(), the semantics of returning a file wrapper from the application should be the same as if the application had returned iter(filelike.read, ''). In other words, transmission should begin at the current position within the "file" at the time that transmission begins, and continue until the end is reached, or until Content-Length bytes have been written. (If the application doesn't supply a Content-Length, the server may generate one from the file using its knowledge of the underlying file implementation.)
Of course, platform-specific file transmission APIs don't usually accept arbitrary "file-like" objects. Therefore, a wsgi.file_wrapper has to introspect the supplied object for things such as a fileno() (Unix-like OSes) or a java.nio.FileChannel (under Jython) in order to determine if the file-like object is suitable for use with the platform-specific API it supports.
Note that even if the object is not suitable for the platform API, the wsgi.file_wrapper must still return an iterable that wraps read() and close(), so that applications using file wrappers are portable across platforms. Here's a simple platform-agnostic file wrapper class, suitable for old (pre 2.2) and new Pythons alike:
class FileWrapper:
def __init__(self, filelike, blksize=8192):
self.filelike = filelike
self.blksize = blksize
if hasattr(filelike, 'close'):
self.close = filelike.close
def __getitem__(self, key):
data = self.filelike.read(self.blksize)
if data:
return data
raise IndexError
and here is a snippet from a server/gateway that uses it to provide access to a platform-specific API:
environ['wsgi.file_wrapper'] = FileWrapper
result = application(environ, start_response)
try:
if isinstance(result, FileWrapper):
# check if result.filelike is usable w/platform-specific
# API, and if so, use that API to transmit the result.
# If not, fall through to normal iterable handling
# loop below.
for data in result:
# etc.
finally:
if hasattr(result, 'close'):
result.close()
Questions and Answers
Why must environ be a dictionary? What's wrong with using a subclass?
The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface. In practice, however, most servers will probably find a dictionary adequate to their needs, and thus framework authors will come to expect the full set of dictionary features to be available, since they will be there more often than not. But, if some server chooses not to use a dictionary, then there will be interoperability problems despite that server's "conformance" to spec. Therefore, making a dictionary mandatory simplifies the specification and guarantees interoperabilty.
Note that this does not prevent server or framework developers from offering specialized services as custom variables inside the environ dictionary. This is the recommended approach for offering any such value-added services.
Why can you call write() and yield bytestrings/return an iterable? Shouldn't we pick just one way?
If we supported only the iteration approach, then current frameworks that assume the availability of "push" suffer. But, if we only support pushing via write(), then server performance suffers for transmission of e.g. large files (if a worker thread can't begin work on a new request until all of the output has been sent). Thus, this compromise allows an application framework to support both approaches, as appropriate, but with only a little more burden to the server implementor than a push-only approach would require.
What's the close() for?
When writes are done during the execution of an application object, the application can ensure that resources are released using a try/finally block. But, if the application returns an iterable, any resources used will not be released until the iterable is garbage collected. The close() idiom allows an application to release critical resources at the end of a request, and it's forward-compatible with the support for try/finally in generators that's proposed by PEP 325.
Why is this interface so low-level? I want feature X! (e.g. cookies, sessions, persistence, ...)
This isn't Yet Another Python Web Framework. It's just a way for frameworks to talk to web servers, and vice versa. If you want these features, you need to pick a web framework that provides the features you want. And if that framework lets you create a WSGI application, you should be able to run it in most WSGI-supporting servers. Also, some WSGI servers may offer additional services via objects provided in their environ dictionary; see the applicable server documentation for details. (Of course, applications that use such extensions will not be portable to other WSGI-based servers.)
Why use CGI variables instead of good old HTTP headers? And why mix them in with WSGI-defined variables?
Many existing web frameworks are built heavily upon the CGI spec, and existing web servers know how to generate CGI variables. In contrast, alternative ways of representing inbound HTTP information are fragmented and lack market share. Thus, using the CGI "standard" seems like a good way to leverage existing implementations. As for mixing them with WSGI variables, separating them would just require two dictionary arguments to be passed around, while providing no real benefits.
What about the status string? Can't we just use the number, passing in 200 instead of "200 OK"?
Doing this would complicate the server or gateway, by requiring them to have a table of numeric statuses and corresponding messages. By contrast, it is easy for an application or framework author to type the extra text to go with the specific response code they are using, and existing frameworks often already have a table containing the needed messages. So, on balance it seems better to make the application/framework responsible, rather than the server or gateway.
Why is wsgi.run_once not guaranteed to run the app only once?
Because it's merely a suggestion to the application that it should "rig for infrequent running". This is intended for application frameworks that have multiple modes of operation for caching, sessions, and so forth. In a "multiple run" mode, such frameworks may preload caches, and may not write e.g. logs or session data to disk after each request. In "single run" mode, such frameworks avoid preloading and flush all necessary writes after each request.
However, in order to test an application or framework to verify correct operation in the latter mode, it may be necessary (or at least expedient) to invoke it more than once. Therefore, an application should not assume that it will definitely not be run again, just because it is called with wsgi.run_once set to True.
Feature X (dictionaries, callables, etc.) are ugly for use in application code; why don't we use objects instead?
All of these implementation choices of WSGI are specifically intended to decouple features from one another; recombining these features into encapsulated objects makes it somewhat harder to write servers or gateways, and an order of magnitude harder to write middleware that replaces or modifies only small portions of the overall functionality.
In essence, middleware wants to have a "Chain of Responsibility" pattern, whereby it can act as a "handler" for some functions, while allowing others to remain unchanged. This is difficult to do with ordinary Python objects, if the interface is to remain extensible. For example, one must use __getattr__ or __getattribute__ overrides, to ensure that extensions (such as attributes defined by future WSGI versions) are passed through.
This type of code is notoriously difficult to get 100% correct, and few people will want to write it themselves. They will therefore copy other people's implementations, but fail to update them when the person they copied from corrects yet another corner case.
Further, this necessary boilerplate would be pure excise, a developer tax paid by middleware developers to support a slightly prettier API for application framework developers. But, application framework developers will typically only be updating one framework to support WSGI, and in a very limited part of their framework as a whole. It will likely be their first (and maybe their only) WSGI implementation, and thus they will likely implement with this specification ready to hand. Thus, the effort of making the API "prettier" with object attributes and suchlike would likely be wasted for this audience.
We encourage those who want a prettier (or otherwise improved) WSGI interface for use in direct web application programming (as opposed to web framework development) to develop APIs or frameworks that wrap WSGI for convenient use by application developers. In this way, WSGI can remain conveniently low-level for server and middleware authors, while not being "ugly" for application developers.
Proposed/Under Discussion
These items are currently being discussed on the Web-SIG and elsewhere, or are on the PEP author's "to-do" list:
- Should wsgi.input be an iterator instead of a file? This would help for asynchronous applications and chunked-encoding input streams.
- Optional extensions are being discussed for pausing iteration of an application's output until input is available or until a callback occurs.
- Add a section about synchronous vs. asynchronous apps and servers, the relevant threading models, and issues/design goals in these areas.
Acknowledgements
Thanks go to the many folks on the Web-SIG mailing list whose thoughtful feedback made this revised draft possible. Especially:
- Gregory "Grisha" Trubetskoy, author of mod_python, who beat up on the first draft as not offering any advantages over "plain old CGI", thus encouraging me to look for a better approach.
- Ian Bicking, who helped nag me into properly specifying the multithreading and multiprocess options, as well as badgering me to provide a mechanism for servers to supply custom extension data to an application.
- Tony Lownds, who came up with the concept of a start_response function that took the status and headers, returning a write function. His input also guided the design of the exception handling facilities, especially in the area of allowing for middleware that overrides application error messages.
- Alan Kennedy, whose courageous attempts to implement WSGI-on-Jython (well before the spec was finalized) helped to shape the "supporting older versions of Python" section, as well as the optional wsgi.file_wrapper facility, and some of the early bytes/unicode decisions.
- Mark Nottingham, who reviewed the spec extensively for issues with HTTP RFC compliance, especially with regard to HTTP/1.1 features that I didn't even know existed until he pointed them out.
- Graham Dumpleton, who worked tirelessly (even in the face of my laziness and stupidity) to get some sort of Python 3 version of WSGI out, who proposed the "native strings" vs. "byte strings" concept, and thoughtfully wrestled through a great many HTTP, wsgi.input, and other amendments. Most, if not all, of the credit for this new PEP belongs to him.
References
| [1] | The Python Wiki "Web Programming" topic (http://www.python.org/cgi-bin/moinmoin/WebProgramming) |
| [2] | The Common Gateway Interface Specification, v 1.1, 3rd Draft (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt) |
| [3] | "Chunked Transfer Coding" -- HTTP/1.1, section 3.6.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1) |
| [4] | "End-to-end and Hop-by-hop Headers" -- HTTP/1.1, Section 13.5.1 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.5.1) |
| [5] | mod_ssl Reference, "Environment Variables" (http://www.modssl.org/docs/2.8/ssl_reference.html#ToC25) |
| [6] | Procedural issues regarding modifications to PEP 333 (http://mail.python.org/pipermail/python-dev/2010-September/104114.html) |
| [7] | SVN revision history for PEP 3333, showing differences from PEP 333 (http://svn.python.org/view/peps/trunk/pep-3333.txt?r1=84854&r2=HEAD) |
Copyright
This document has been placed in the public domain.